[GH-ISSUE #8430] Multi GPU, default GPU setting, specific model pin to specific GPU #67474

Open
opened 2026-05-04 10:29:14 -05:00 by GiteaMirror · 11 comments
Owner

Originally created by @Bashir-Rabbit on GitHub (Jan 15, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8430

I have a multi GPU configuration and they are different GPU models with different memory size.
I wish:

  1. we could select default GPU for all models (potentially the fastest one with higher memory)
  2. we could select specific model per specific GPU to use small models into small VRAM GPU and use fastest/highest_VRAM card for other large models.
Originally created by @Bashir-Rabbit on GitHub (Jan 15, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8430 I have a multi GPU configuration and they are different GPU models with different memory size. I wish: 1. we could select default GPU for all models (potentially the fastest one with higher memory) 2. we could select specific model per specific GPU to use small models into small VRAM GPU and use fastest/highest_VRAM card for other large models.
GiteaMirror added the feature request label 2026-05-04 10:29:14 -05:00
Author
Owner

@rick-github commented on GitHub (Jan 15, 2025):

Related: https://github.com/ollama/ollama/issues/3902

<!-- gh-comment-id:2591448326 --> @rick-github commented on GitHub (Jan 15, 2025): Related: https://github.com/ollama/ollama/issues/3902
Author
Owner

@Bashir-Rabbit commented on GitHub (Jan 15, 2025):

Thanks,
I am going to buy 5090 and I wish to have much more control over my GPUs and utilize them more efficiently.
Ollama and whole concept of deployment is amazing and more and more people will actually use it for private AI agent, instead of subscription based AI solution with risk of data to be leaked or misused.

<!-- gh-comment-id:2593207078 --> @Bashir-Rabbit commented on GitHub (Jan 15, 2025): Thanks, I am going to buy 5090 and I wish to have much more control over my GPUs and utilize them more efficiently. Ollama and whole concept of deployment is amazing and more and more people will actually use it for private AI agent, instead of subscription based AI solution with risk of data to be leaked or misused.
Author
Owner

@rick-github commented on GitHub (Jan 16, 2025):

It will probably be a while before #3902 sees any progress. In the meantime, you can sort of do this by running multiple ollama servers, binding a specific GPU to a server with CUDA_VISIBLE_DEVICES and then running a proxy in front to distribute the queries. litellm is an LLM proxy that can route queries based on the model name. If docker is available it's pretty easy to do a proof-of-concept. Here I've assumed 2 GPUs: GPU0 has lots of VRAM and will run big-model (actually qwen2.5:0.5b in ollama), and GPU1 is set as the default for all other models.

x-ollama: &ollama
  image: ollama/ollama:${OLLAMA_DOCKER_TAG-latest}
  volumes:
    - ${OLLAMA_MODELS-./ollama}:/root/.ollama
  environment: &env
    OLLAMA_KEEP_ALIVE: ${OLLAMA_KEEP_ALIVE--1}
    OLLAMA_LOAD_TIMEOUT: ${OLLAMA_LOAD_TIMEOUT-5m}
    OLLAMA_MAX_LOADED_MODELS: 1
    OLLAMA_NUM_PARALLEL: 1
    OLLAMA_TMPDIR: /tmp/ollama
    OLLAMA_FLASH_ATTENTION: ${OLLAMA_FLASH_ATTENTION-0}
    OLLAMA_DEBUG: ${OLLAMA_DEBUG-1}
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            capabilities: [gpu]

services:
  ollama-1:
    << : *ollama
    environment:
      << : *env
      CUDA_VISIBLE_DEVICES: 0

  ollama-2:
    <<: *ollama
    environment:
      <<: *env
      CUDA_VISIBLE_DEVICES: 1
      OLLAMA_MAX_LOADED_MODELS: 4

  litellm:
    image: litellm-lb
    build:
      dockerfile_inline: |
        FROM litellm/litellm:${LITELLM_DOCKER_TAG-v1.58.2}
        RUN cat > /config.yaml <<EOF
        model_list:
          - model_name: big-model
            litellm_params:
              model: ollama/qwen2.5:0.5b
              api_base: http://ollama-1:11434
          - model_name: "*"
            litellm_params:
              model: "ollama/*"
              api_base: http://ollama-2:11434
        litellm_settings:
          drop_params: True
        EOF
        CMD [ "--config", "/config.yaml", "--port", "4040" ]
    ports:
      - 4040:4040
    depends_on:
      - ollama-1
      - ollama-2
$ curl -s localhost:4040/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"big-model","messages":[{"role":"user","content":"2+2?"}]}' | jq
{
  "id": "chatcmpl-7860f649-b125-4733-8722-ebdaaf70d968",
  "created": 1737047060,
  "model": "ollama/qwen2.5:0.5b",
  "object": "chat.completion",
  "system_fingerprint": null,
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "2 + 2 equals 4.\n\nThis is a simple arithmetic operation that involves two numbers (2 and 2), where we add them together to get the sum.",
        "role": "assistant",
        "tool_calls": null,
        "function_call": null
      }
    }
  ],
  "usage": {
    "completion_tokens": 35,
    "prompt_tokens": 36,
    "total_tokens": 71,
    "completion_tokens_details": null,
    "prompt_tokens_details": null
  }
}
$ curl -s localhost:4040/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"llama3.2","messages":[{"role":"user","content":"2+2?"}]}' | jq
{
  "id": "chatcmpl-14a252dd-564e-4b1a-9cc3-1e6269de28fc",
  "created": 1737047497,
  "model": "ollama/llama3.2",
  "object": "chat.completion",
  "system_fingerprint": null,
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "The answer is: 4",
        "role": "assistant",
        "tool_calls": null,
        "function_call": null
      }
    }
  ],
  "usage": {
    "completion_tokens": 7,
    "prompt_tokens": 32,
    "total_tokens": 39,
    "completion_tokens_details": null,
    "prompt_tokens_details": null
  }
}

The restriction is that you have to use the OpenAI to talk to the proxy, so some of the fancier ollama options like creating models are not available.

<!-- gh-comment-id:2596293633 --> @rick-github commented on GitHub (Jan 16, 2025): It will probably be a while before #3902 sees any progress. In the meantime, you can sort of do this by running multiple ollama servers, binding a specific GPU to a server with `CUDA_VISIBLE_DEVICES` and then running a proxy in front to distribute the queries. [litellm](https://github.com/BerriAI/litellm) is an LLM proxy that can route queries based on the model name. If docker is available it's pretty easy to do a proof-of-concept. Here I've assumed 2 GPUs: GPU0 has lots of VRAM and will run `big-model` (actually `qwen2.5:0.5b` in ollama), and GPU1 is set as the default for all other models. ```yaml x-ollama: &ollama image: ollama/ollama:${OLLAMA_DOCKER_TAG-latest} volumes: - ${OLLAMA_MODELS-./ollama}:/root/.ollama environment: &env OLLAMA_KEEP_ALIVE: ${OLLAMA_KEEP_ALIVE--1} OLLAMA_LOAD_TIMEOUT: ${OLLAMA_LOAD_TIMEOUT-5m} OLLAMA_MAX_LOADED_MODELS: 1 OLLAMA_NUM_PARALLEL: 1 OLLAMA_TMPDIR: /tmp/ollama OLLAMA_FLASH_ATTENTION: ${OLLAMA_FLASH_ATTENTION-0} OLLAMA_DEBUG: ${OLLAMA_DEBUG-1} deploy: resources: reservations: devices: - driver: nvidia capabilities: [gpu] services: ollama-1: << : *ollama environment: << : *env CUDA_VISIBLE_DEVICES: 0 ollama-2: <<: *ollama environment: <<: *env CUDA_VISIBLE_DEVICES: 1 OLLAMA_MAX_LOADED_MODELS: 4 litellm: image: litellm-lb build: dockerfile_inline: | FROM litellm/litellm:${LITELLM_DOCKER_TAG-v1.58.2} RUN cat > /config.yaml <<EOF model_list: - model_name: big-model litellm_params: model: ollama/qwen2.5:0.5b api_base: http://ollama-1:11434 - model_name: "*" litellm_params: model: "ollama/*" api_base: http://ollama-2:11434 litellm_settings: drop_params: True EOF CMD [ "--config", "/config.yaml", "--port", "4040" ] ports: - 4040:4040 depends_on: - ollama-1 - ollama-2 ``` ```console $ curl -s localhost:4040/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"big-model","messages":[{"role":"user","content":"2+2?"}]}' | jq { "id": "chatcmpl-7860f649-b125-4733-8722-ebdaaf70d968", "created": 1737047060, "model": "ollama/qwen2.5:0.5b", "object": "chat.completion", "system_fingerprint": null, "choices": [ { "finish_reason": "stop", "index": 0, "message": { "content": "2 + 2 equals 4.\n\nThis is a simple arithmetic operation that involves two numbers (2 and 2), where we add them together to get the sum.", "role": "assistant", "tool_calls": null, "function_call": null } } ], "usage": { "completion_tokens": 35, "prompt_tokens": 36, "total_tokens": 71, "completion_tokens_details": null, "prompt_tokens_details": null } } ``` ```console $ curl -s localhost:4040/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"llama3.2","messages":[{"role":"user","content":"2+2?"}]}' | jq { "id": "chatcmpl-14a252dd-564e-4b1a-9cc3-1e6269de28fc", "created": 1737047497, "model": "ollama/llama3.2", "object": "chat.completion", "system_fingerprint": null, "choices": [ { "finish_reason": "stop", "index": 0, "message": { "content": "The answer is: 4", "role": "assistant", "tool_calls": null, "function_call": null } } ], "usage": { "completion_tokens": 7, "prompt_tokens": 32, "total_tokens": 39, "completion_tokens_details": null, "prompt_tokens_details": null } } ``` The restriction is that you have to use the OpenAI to talk to the proxy, so some of the fancier ollama options like creating models are not available.
Author
Owner

@Bashir-Rabbit commented on GitHub (Jan 16, 2025):

thanks, This is what I wanted to avoid running multiple docker containers and using proxy.
If this is only way, this is still a solution I will test as soon as I get 5090.

Again many thanks for help, will let you know how well this solution works in around 4-6 weeks.

I am still optimistic that Ollama will have more flexibility to assign models per gpu manually instead of automatic approach which is already implemented (probably work for most of the people quite well).

<!-- gh-comment-id:2596378513 --> @Bashir-Rabbit commented on GitHub (Jan 16, 2025): thanks, This is what I wanted to avoid running multiple docker containers and using proxy. If this is only way, this is still a solution I will test as soon as I get 5090. Again many thanks for help, will let you know how well this solution works in around 4-6 weeks. I am still optimistic that Ollama will have more flexibility to assign models per gpu manually instead of automatic approach which is already implemented (probably work for most of the people quite well).
Author
Owner

@bitcandy commented on GitHub (May 12, 2025):

I think this is very actual issue and very related to current memory balancing bugs, when some models don't utilize 100% of available GPU mem and begin using other GPUs...

As for me, not I'm experiencing issue with small model gemma3:12b-it-qat that actually need only 1/2 of my GPU ram, but the model begin to load to GPU with less mem that other and because of big context window it fail with 500 error.

I mean that it's still 1/2 VRAM available for context, but it fail.

<!-- gh-comment-id:2871981822 --> @bitcandy commented on GitHub (May 12, 2025): I think this is very actual issue and very related to current memory balancing bugs, when some models don't utilize 100% of available GPU mem and begin using other GPUs... As for me, not I'm experiencing issue with small model gemma3:12b-it-qat that actually need only 1/2 of my GPU ram, but the model begin to load to GPU with less mem that other and because of big context window it fail with 500 error. I mean that it's still 1/2 VRAM available for context, but it fail.
Author
Owner

@johnquix commented on GitHub (May 21, 2025):

I am coming from #9462, which is closer to my issue, but it has been marked as a duplicate of this one.

I have a similar situation to #9462: I have two 3060 12GB GPUs, and I have Automatic1111 pinned to GPU 2 so that Open WebUI can provide image generation. This process uses 7.2GB of GPU 2’s VRAM.

When I load Gemma 3 12B on GPU 1 in Ollama, even with the maximum layers specified, Ollama will not use GPU 2—instead, it offloads to the CPU and system RAM.

If I stop Automatic1111 and load any model, Ollama splits the workload across both GPUs as expected. However, if any VRAM is in use on GPU 2, Ollama seems to avoid using it entirely, regardless of the settings.

As mentioned before, I am not interested in pinning a model to a specific GPU; rather, I want Ollama to fully utilize both GPUs, even if one is partially occupied.

<!-- gh-comment-id:2899097521 --> @johnquix commented on GitHub (May 21, 2025): I am coming from #9462, which is closer to my issue, but it has been marked as a duplicate of this one. I have a similar situation to #9462: I have two 3060 12GB GPUs, and I have Automatic1111 pinned to GPU 2 so that Open WebUI can provide image generation. This process uses 7.2GB of GPU 2’s VRAM. When I load Gemma 3 12B on GPU 1 in Ollama, even with the maximum layers specified, Ollama will not use GPU 2—instead, it offloads to the CPU and system RAM. If I stop Automatic1111 and load any model, Ollama splits the workload across both GPUs as expected. However, if any VRAM is in use on GPU 2, Ollama seems to avoid using it entirely, regardless of the settings. As mentioned before, I am not interested in pinning a model to a specific GPU; rather, I want Ollama to fully utilize both GPUs, even if one is partially occupied.
Author
Owner

@rick-github commented on GitHub (May 21, 2025):

Server logs may aid in diagnosis.

<!-- gh-comment-id:2899200315 --> @rick-github commented on GitHub (May 21, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in diagnosis.
Author
Owner

@Bashir-Rabbit commented on GitHub (May 22, 2025):

I think this is very actual issue and very related to current memory balancing bugs, when some models don't utilize 100% of available GPU mem and begin using other GPUs...

As for me, not I'm experiencing issue with small model gemma3:12b-it-qat that actually need only 1/2 of my GPU ram, but the model begin to load to GPU with less mem that other and because of big context window it fail with 500 error.

I mean that it's still 1/2 VRAM available for context, but it fail.

So far I have only positive experience with Ollama even on my WSL setup it works super stable.
Only key missing functionality is a freedom to assign GPUs for specific model so create a model -> number of GPU pool, so I can assign small model into low memory GPU and still enjoy my main GPU resources for daily work.

<!-- gh-comment-id:2901447382 --> @Bashir-Rabbit commented on GitHub (May 22, 2025): > I think this is very actual issue and very related to current memory balancing bugs, when some models don't utilize 100% of available GPU mem and begin using other GPUs... > > As for me, not I'm experiencing issue with small model gemma3:12b-it-qat that actually need only 1/2 of my GPU ram, but the model begin to load to GPU with less mem that other and because of big context window it fail with 500 error. > > I mean that it's still 1/2 VRAM available for context, but it fail. So far I have only positive experience with Ollama even on my WSL setup it works super stable. Only key missing functionality is a freedom to assign GPUs for specific model so create a model -> number of GPU pool, so I can assign small model into low memory GPU and still enjoy my main GPU resources for daily work.
Author
Owner

@Bashir-Rabbit commented on GitHub (May 22, 2025):

I am coming from #9462, which is closer to my issue, but it has been marked as a duplicate of this one.

I have a similar situation to #9462: I have two 3060 12GB GPUs, and I have Automatic1111 pinned to GPU 2 so that Open WebUI can provide image generation. This process uses 7.2GB of GPU 2’s VRAM.

When I load Gemma 3 12B on GPU 1 in Ollama, even with the maximum layers specified, Ollama will not use GPU 2—instead, it offloads to the CPU and system RAM.

If I stop Automatic1111 and load any model, Ollama splits the workload across both GPUs as expected. However, if any VRAM is in use on GPU 2, Ollama seems to avoid using it entirely, regardless of the settings.

As mentioned before, I am not interested in pinning a model to a specific GPU; rather, I want Ollama to fully utilize both GPUs, even if one is partially occupied.

HI,
I have actually same problem:
I have 4090 and 3090, so actually I see from time to time that 3090 is used instead of 4090 due to marginally higher memory, so I decided to use larger model (to utilize 2 GPU memory), but this is suboptimal solution.

<!-- gh-comment-id:2902418156 --> @Bashir-Rabbit commented on GitHub (May 22, 2025): > I am coming from [#9462](https://github.com/ollama/ollama/issues/9462), which is closer to my issue, but it has been marked as a duplicate of this one. > > I have a similar situation to [#9462](https://github.com/ollama/ollama/issues/9462): I have two 3060 12GB GPUs, and I have Automatic1111 pinned to GPU 2 so that Open WebUI can provide image generation. This process uses 7.2GB of GPU 2’s VRAM. > > When I load Gemma 3 12B on GPU 1 in Ollama, even with the maximum layers specified, Ollama will not use GPU 2—instead, it offloads to the CPU and system RAM. > > If I stop Automatic1111 and load any model, Ollama splits the workload across both GPUs as expected. However, if any VRAM is in use on GPU 2, Ollama seems to avoid using it entirely, regardless of the settings. > > As mentioned before, I am not interested in pinning a model to a specific GPU; rather, I want Ollama to fully utilize both GPUs, even if one is partially occupied. HI, I have actually same problem: I have 4090 and 3090, so actually I see from time to time that 3090 is used instead of 4090 due to marginally higher memory, so I decided to use larger model (to utilize 2 GPU memory), but this is suboptimal solution.
Author
Owner

@johnquix commented on GitHub (May 23, 2025):

Server logs may aid in diagnosis.

time=2025-05-23T17:16:43.374Z level=INFO source=server.go:135 msg="system memory" total="31.2 GiB" free="20.9 GiB" free_swap="0 B"
time=2025-05-23T17:16:43.377Z level=INFO source=server.go:168 msg=offload library=cuda layers.requested=49 layers.model=49 layers.offload=37 layers.split=37,0 memory.available="[11.5 GiB 4.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="14.9 GiB" memory.required.partial="11.4 GiB" memory.required.kv="2.0 GiB" memory.required.allocations="[11.4 GiB 0 B]" memory.weights.total="7.5 GiB" memory.weights.repeating="5.6 GiB" memory.weights.nonrepeating="1.9 GiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.3 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB"
time=2025-05-23T17:16:43.431Z level=INFO source=server.go:431 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0 --ctx-size 24576 --batch-size 512 --n-gpu-layers 49 --threads 6 --parallel 1 --tensor-split 37,0 --port 42751"
time=2025-05-23T17:16:43.432Z level=INFO source=sched.go:472 msg="loaded runners" count=1
time=2025-05-23T17:16:43.432Z level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
time=2025-05-23T17:16:43.441Z level=INFO source=runner.go:836 msg="starting ollama engine"
time=2025-05-23T17:16:43.443Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
time=2025-05-23T17:16:43.445Z level=INFO source=runner.go:899 msg="Server listening on 127.0.0.1:42751"
time=2025-05-23T17:16:43.501Z level=INFO source=ggml.go:73 msg="" architecture=gemma3 file_type=Q4_0 name="" description="" num_tensors=1065 num_key_values=40
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so
time=2025-05-23T17:16:43.666Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-05-23T17:16:43.695Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
time=2025-05-23T17:16:43.763Z level=INFO source=ggml.go:299 msg="model weights" buffer=CUDA0 size="8.3 GiB"
time=2025-05-23T17:16:43.763Z level=INFO source=ggml.go:299 msg="model weights" buffer=CPU size="1.9 GiB"
time=2025-05-23T17:16:47.432Z level=INFO source=ggml.go:556 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="1.0 GiB"
time=2025-05-23T17:16:47.432Z level=INFO source=ggml.go:556 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="0 B"
time=2025-05-23T17:16:47.432Z level=INFO source=ggml.go:556 msg="compute graph" backend=CPU buffer_type=CPU size="7.5 MiB"
time=2025-05-23T17:16:47.457Z level=INFO source=server.go:630 msg="llama runner started in 4.03 seconds"
[GIN] 2025/05/23 - 17:17:07 | 200 | 25.454387855s | 172.17.0.1 | POST "/api/chat"
[GIN] 2025/05/23 - 17:17:09 | 200 | 1.653653934s | 172.17.0.1 | POST "/api/chat"
[GIN] 2025/05/23 - 17:17:11 | 200 | 1.567774392s | 172.17.0.1 | POST "/api/chat"
[GIN] 2025/05/23 - 17:18:19 | 200 | 17.540867ms | 172.17.0.1 | GET "/api/tags"
[GIN] 2025/05/23 - 17:18:19 | 404 | 4.064µs | 10.0.7.212 | GET "/models"
[GIN] 2025/05/23 - 17:18:35 | 200 | 17.429591ms | 172.17.0.1 | GET "/api/tags"
[GIN] 2025/05/23 - 17:18:35 | 404 | 4.475µs | 10.0.7.212 | GET "/models"
[GIN] 2025/05/23 - 17:18:35 | 200 | 17.641179ms | 172.17.0.1 | GET "/api/tags"
[GIN] 2025/05/23 - 17:18:35 | 404 | 4.938µs | 10.0.7.212 | GET "/models"
[GIN] 2025/05/23 - 17:18:43 | 200 | 27.752µs | 172.17.0.1 | GET "/api/version"
[GIN] 2025/05/23 - 17:18:59 | 200 | 26.933µs | 172.17.0.1 | GET "/api/version"
time=2025-05-23T17:19:12.986Z level=INFO source=server.go:135 msg="system memory" total="31.2 GiB" free="20.9 GiB" free_swap="0 B"
time=2025-05-23T17:19:12.987Z level=INFO source=server.go:168 msg=offload library=cuda layers.requested=49 layers.model=49 layers.offload=36 layers.split=36,0 memory.available="[11.5 GiB 4.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="15.4 GiB" memory.required.partial="11.5 GiB" memory.required.kv="2.4 GiB" memory.required.allocations="[11.5 GiB 0 B]" memory.weights.total="7.5 GiB" memory.weights.repeating="5.6 GiB" memory.weights.nonrepeating="1.9 GiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.3 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB"
time=2025-05-23T17:19:13.038Z level=INFO source=server.go:431 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0 --ctx-size 32000 --batch-size 512 --n-gpu-layers 49 --threads 6 --parallel 1 --tensor-split 36,0 --port 44077"
time=2025-05-23T17:19:13.039Z level=INFO source=sched.go:472 msg="loaded runners" count=1
time=2025-05-23T17:19:13.039Z level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
time=2025-05-23T17:19:13.039Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
time=2025-05-23T17:19:13.050Z level=INFO source=runner.go:836 msg="starting ollama engine"
time=2025-05-23T17:19:13.054Z level=INFO source=runner.go:899 msg="Server listening on 127.0.0.1:44077"
time=2025-05-23T17:19:13.100Z level=INFO source=ggml.go:73 msg="" architecture=gemma3 file_type=Q4_0 name="" description="" num_tensors=1065 num_key_values=40
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so
time=2025-05-23T17:19:13.248Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-05-23T17:19:13.290Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
time=2025-05-23T17:19:13.344Z level=INFO source=ggml.go:299 msg="model weights" buffer=CUDA0 size="8.3 GiB"
time=2025-05-23T17:19:13.344Z level=INFO source=ggml.go:299 msg="model weights" buffer=CPU size="1.9 GiB"
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1323.01 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 1387274752
panic: failed to reserve graph

goroutine 11 [running]:
github.com/ollama/ollama/runner/ollamarunner.(*Server).loadModel(0xc0007227e0, {0x557ee72afb10?, 0xc000374140?}, {0x7ffdc0720c8c?, 0x0?}, {0xc000502f00, 0x6, 0x0, 0x31, {0xc0007091f0, ...}, ...}, ...)
github.com/ollama/ollama/runner/ollamarunner/runner.go:801 +0x2a5
created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1
github.com/ollama/ollama/runner/ollamarunner/runner.go:872 +0xa2b
time=2025-05-23T17:19:17.133Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
time=2025-05-23T17:19:17.209Z level=ERROR source=server.go:457 msg="llama runner terminated" error="exit status 2"
time=2025-05-23T17:19:17.384Z level=ERROR source=sched.go:478 msg="error loading llama server" error="llama runner process has terminated: cudaMalloc failed: out of memory\nggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 1387274752"
[GIN] 2025/05/23 - 17:19:17 | 500 | 6.212018587s | 172.17.0.1 | POST "/api/chat"
time=2025-05-23T17:19:22.568Z level=WARN source=sched.go:676 msg="gpu VRAM usage didn't recover within timeout" seconds=5.184530069 runner.size="15.4 GiB" runner.vram="11.5 GiB" runner.parallel=1 runner.pid=82 runner.model=/root/.ollama/models/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0
time=2025-05-23T17:19:22.867Z level=WARN source=sched.go:676 msg="gpu VRAM usage didn't recover within timeout" seconds=5.482874844 runner.size="15.4 GiB" runner.vram="11.5 GiB" runner.parallel=1 runner.pid=82 runner.model=/root/.ollama/models/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0
time=2025-05-23T17:19:23.160Z level=WARN source=sched.go:676 msg="gpu VRAM usage didn't recover within timeout" seconds=5.776333648 runner.size="15.4 GiB" runner.vram="11.5 GiB" runner.parallel=1 runner.pid=82 runner.model=/root/.ollama/models/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0

New development: I am finding if I load a larger model like Gemma3 12B Q8 it will utilize the remaining VRAM on the second GPU. With the smaller size model Gemma3 12B QAT above seems to error out as you push context above what the single GPU can handle in VRAM and won't spill over to the second GPU or spread some layers between them to allow for more VRAM allocation. 12B QAT works fine but only on the first GPU at 25k context but at 32k context the above happens.

<!-- gh-comment-id:2905221926 --> @johnquix commented on GitHub (May 23, 2025): > [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in diagnosis. time=2025-05-23T17:16:43.374Z level=INFO source=server.go:135 msg="system memory" total="31.2 GiB" free="20.9 GiB" free_swap="0 B" time=2025-05-23T17:16:43.377Z level=INFO source=server.go:168 msg=offload library=cuda layers.requested=49 layers.model=49 layers.offload=37 layers.split=37,0 memory.available="[11.5 GiB 4.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="14.9 GiB" memory.required.partial="11.4 GiB" memory.required.kv="2.0 GiB" memory.required.allocations="[11.4 GiB 0 B]" memory.weights.total="7.5 GiB" memory.weights.repeating="5.6 GiB" memory.weights.nonrepeating="1.9 GiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.3 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB" time=2025-05-23T17:16:43.431Z level=INFO source=server.go:431 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0 --ctx-size 24576 --batch-size 512 --n-gpu-layers 49 --threads 6 --parallel 1 --tensor-split 37,0 --port 42751" time=2025-05-23T17:16:43.432Z level=INFO source=sched.go:472 msg="loaded runners" count=1 time=2025-05-23T17:16:43.432Z level=INFO source=server.go:591 msg="waiting for llama runner to start responding" time=2025-05-23T17:16:43.441Z level=INFO source=runner.go:836 msg="starting ollama engine" time=2025-05-23T17:16:43.443Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" time=2025-05-23T17:16:43.445Z level=INFO source=runner.go:899 msg="Server listening on 127.0.0.1:42751" time=2025-05-23T17:16:43.501Z level=INFO source=ggml.go:73 msg="" architecture=gemma3 file_type=Q4_0 name="" description="" num_tensors=1065 num_key_values=40 load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so time=2025-05-23T17:16:43.666Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-05-23T17:16:43.695Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" time=2025-05-23T17:16:43.763Z level=INFO source=ggml.go:299 msg="model weights" buffer=CUDA0 size="8.3 GiB" time=2025-05-23T17:16:43.763Z level=INFO source=ggml.go:299 msg="model weights" buffer=CPU size="1.9 GiB" time=2025-05-23T17:16:47.432Z level=INFO source=ggml.go:556 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="1.0 GiB" time=2025-05-23T17:16:47.432Z level=INFO source=ggml.go:556 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="0 B" time=2025-05-23T17:16:47.432Z level=INFO source=ggml.go:556 msg="compute graph" backend=CPU buffer_type=CPU size="7.5 MiB" time=2025-05-23T17:16:47.457Z level=INFO source=server.go:630 msg="llama runner started in 4.03 seconds" [GIN] 2025/05/23 - 17:17:07 | 200 | 25.454387855s | 172.17.0.1 | POST "/api/chat" [GIN] 2025/05/23 - 17:17:09 | 200 | 1.653653934s | 172.17.0.1 | POST "/api/chat" [GIN] 2025/05/23 - 17:17:11 | 200 | 1.567774392s | 172.17.0.1 | POST "/api/chat" [GIN] 2025/05/23 - 17:18:19 | 200 | 17.540867ms | 172.17.0.1 | GET "/api/tags" [GIN] 2025/05/23 - 17:18:19 | 404 | 4.064µs | 10.0.7.212 | GET "/models" [GIN] 2025/05/23 - 17:18:35 | 200 | 17.429591ms | 172.17.0.1 | GET "/api/tags" [GIN] 2025/05/23 - 17:18:35 | 404 | 4.475µs | 10.0.7.212 | GET "/models" [GIN] 2025/05/23 - 17:18:35 | 200 | 17.641179ms | 172.17.0.1 | GET "/api/tags" [GIN] 2025/05/23 - 17:18:35 | 404 | 4.938µs | 10.0.7.212 | GET "/models" [GIN] 2025/05/23 - 17:18:43 | 200 | 27.752µs | 172.17.0.1 | GET "/api/version" [GIN] 2025/05/23 - 17:18:59 | 200 | 26.933µs | 172.17.0.1 | GET "/api/version" time=2025-05-23T17:19:12.986Z level=INFO source=server.go:135 msg="system memory" total="31.2 GiB" free="20.9 GiB" free_swap="0 B" time=2025-05-23T17:19:12.987Z level=INFO source=server.go:168 msg=offload library=cuda layers.requested=49 layers.model=49 layers.offload=36 layers.split=36,0 memory.available="[11.5 GiB 4.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="15.4 GiB" memory.required.partial="11.5 GiB" memory.required.kv="2.4 GiB" memory.required.allocations="[11.5 GiB 0 B]" memory.weights.total="7.5 GiB" memory.weights.repeating="5.6 GiB" memory.weights.nonrepeating="1.9 GiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.3 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB" time=2025-05-23T17:19:13.038Z level=INFO source=server.go:431 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0 --ctx-size 32000 --batch-size 512 --n-gpu-layers 49 --threads 6 --parallel 1 --tensor-split 36,0 --port 44077" time=2025-05-23T17:19:13.039Z level=INFO source=sched.go:472 msg="loaded runners" count=1 time=2025-05-23T17:19:13.039Z level=INFO source=server.go:591 msg="waiting for llama runner to start responding" time=2025-05-23T17:19:13.039Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" time=2025-05-23T17:19:13.050Z level=INFO source=runner.go:836 msg="starting ollama engine" time=2025-05-23T17:19:13.054Z level=INFO source=runner.go:899 msg="Server listening on 127.0.0.1:44077" time=2025-05-23T17:19:13.100Z level=INFO source=ggml.go:73 msg="" architecture=gemma3 file_type=Q4_0 name="" description="" num_tensors=1065 num_key_values=40 load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so time=2025-05-23T17:19:13.248Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-05-23T17:19:13.290Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" time=2025-05-23T17:19:13.344Z level=INFO source=ggml.go:299 msg="model weights" buffer=CUDA0 size="8.3 GiB" time=2025-05-23T17:19:13.344Z level=INFO source=ggml.go:299 msg="model weights" buffer=CPU size="1.9 GiB" ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1323.01 MiB on device 0: cudaMalloc failed: out of memory ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 1387274752 panic: failed to reserve graph goroutine 11 [running]: github.com/ollama/ollama/runner/ollamarunner.(*Server).loadModel(0xc0007227e0, {0x557ee72afb10?, 0xc000374140?}, {0x7ffdc0720c8c?, 0x0?}, {0xc000502f00, 0x6, 0x0, 0x31, {0xc0007091f0, ...}, ...}, ...) github.com/ollama/ollama/runner/ollamarunner/runner.go:801 +0x2a5 created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1 github.com/ollama/ollama/runner/ollamarunner/runner.go:872 +0xa2b time=2025-05-23T17:19:17.133Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" time=2025-05-23T17:19:17.209Z level=ERROR source=server.go:457 msg="llama runner terminated" error="exit status 2" time=2025-05-23T17:19:17.384Z level=ERROR source=sched.go:478 msg="error loading llama server" error="llama runner process has terminated: cudaMalloc failed: out of memory\nggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 1387274752" [GIN] 2025/05/23 - 17:19:17 | 500 | 6.212018587s | 172.17.0.1 | POST "/api/chat" time=2025-05-23T17:19:22.568Z level=WARN source=sched.go:676 msg="gpu VRAM usage didn't recover within timeout" seconds=5.184530069 runner.size="15.4 GiB" runner.vram="11.5 GiB" runner.parallel=1 runner.pid=82 runner.model=/root/.ollama/models/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0 time=2025-05-23T17:19:22.867Z level=WARN source=sched.go:676 msg="gpu VRAM usage didn't recover within timeout" seconds=5.482874844 runner.size="15.4 GiB" runner.vram="11.5 GiB" runner.parallel=1 runner.pid=82 runner.model=/root/.ollama/models/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0 time=2025-05-23T17:19:23.160Z level=WARN source=sched.go:676 msg="gpu VRAM usage didn't recover within timeout" seconds=5.776333648 runner.size="15.4 GiB" runner.vram="11.5 GiB" runner.parallel=1 runner.pid=82 runner.model=/root/.ollama/models/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0 New development: I am finding if I load a larger model like Gemma3 12B Q8 it will utilize the remaining VRAM on the second GPU. With the smaller size model Gemma3 12B QAT above seems to error out as you push context above what the single GPU can handle in VRAM and won't spill over to the second GPU or spread some layers between them to allow for more VRAM allocation. 12B QAT works fine but only on the first GPU at 25k context but at 32k context the above happens.
Author
Owner

@rick-github commented on GitHub (May 23, 2025):

time=2025-05-23T17:16:43.377Z level=INFO source=server.go:168 msg=offload library=cuda layers.requested=49 layers.model=49
 layers.offload=37 layers.split=37,0 memory.available="[11.5 GiB 4.6 GiB]" memory.gpu_overhead="0 B"
 memory.required.full="14.9 GiB" memory.required.partial="11.4 GiB" memory.required.kv="2.0 GiB"
 memory.required.allocations="[11.4 GiB 0 B]" memory.weights.total="7.5 GiB" memory.weights.repeating="5.6 GiB"
 memory.weights.nonrepeating="1.9 GiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.3 GiB"
 projector.weights="806.2 MiB" projector.graph="1.0 GiB"

There's a minimum amount of memory required on a GPU before ollama can load layers on to it. That minimum is the amount to hold the projector data, the memory graph, at least two layers, a safety buffer, and some extra incidental allocations. So the minimum is 1G + 806M + 1.3G + 1G + 457M + incidental = ~4.7G (approximate as I don't have the exact figure for the layer size handy). So the small device falls just short of being able to host some layers. You can reduce num_ctx or num_batch to reduce this minimum, or enable flash attention or k/v cache quantization to make cache usage more space efficient.

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1323.01 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 1387274752
panic: failed to reserve graph

Now the runner is allocating memory on the usable device. We see from earlier that it estimated using 11.4G of 11.5G available. Since the device OOM'ed, the estimation was too tight. See here for ways for dealing with OOM situations.

I am finding if I load a larger model like Gemma3 12B Q8 it will utilize the remaining VRAM on the second GPU

Different models will compute the size of the memory graph differently - besides the impact of num_ctx and num_batch, the number of attentions head, the size of the vocab and the size of the embedding affect this value.

<!-- gh-comment-id:2905358377 --> @rick-github commented on GitHub (May 23, 2025): ``` time=2025-05-23T17:16:43.377Z level=INFO source=server.go:168 msg=offload library=cuda layers.requested=49 layers.model=49 layers.offload=37 layers.split=37,0 memory.available="[11.5 GiB 4.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="14.9 GiB" memory.required.partial="11.4 GiB" memory.required.kv="2.0 GiB" memory.required.allocations="[11.4 GiB 0 B]" memory.weights.total="7.5 GiB" memory.weights.repeating="5.6 GiB" memory.weights.nonrepeating="1.9 GiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.3 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB" ``` There's a minimum amount of memory required on a GPU before ollama can load layers on to it. That minimum is the amount to hold the projector data, the memory graph, at least two layers, a safety buffer, and some extra incidental allocations. So the minimum is 1G + 806M + 1.3G + 1G + 457M + incidental = ~4.7G (approximate as I don't have the exact figure for the layer size handy). So the small device falls just short of being able to host some layers. You can reduce `num_ctx` or `num_batch` to reduce this minimum, or enable [flash attention](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-enable-flash-attention) or [k/v cache quantization](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-set-the-quantization-type-for-the-kv-cache) to make cache usage more space efficient. ``` ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1323.01 MiB on device 0: cudaMalloc failed: out of memory ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 1387274752 panic: failed to reserve graph ``` Now the runner is allocating memory on the usable device. We see from earlier that it estimated using 11.4G of 11.5G available. Since the device OOM'ed, the estimation was too tight. See [here](https://github.com/ollama/ollama/issues/8597#issuecomment-2614533288) for ways for dealing with OOM situations. > I am finding if I load a larger model like Gemma3 12B Q8 it will utilize the remaining VRAM on the second GPU Different models will compute the size of the memory graph differently - besides the impact of `num_ctx` and `num_batch`, the number of attentions head, the size of the vocab and the size of the embedding affect this value.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#67474