[GH-ISSUE #8430] Multi GPU, default GPU setting, specific model pin to specific GPU #67474

New Issue

GiteaMirror · 2026-05-04T10:29:14-05:00

GiteaMirror commented

2026-05-04 10:29:14 -05:00

Originally created by @Bashir-Rabbit on GitHub (Jan 15, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8430

I have a multi GPU configuration and they are different GPU models with different memory size.
I wish:

we could select default GPU for all models (potentially the fastest one with higher memory)
we could select specific model per specific GPU to use small models into small VRAM GPU and use fastest/highest_VRAM card for other large models.

Originally created by @Bashir-Rabbit on GitHub (Jan 15, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8430 I have a multi GPU configuration and they are different GPU models with different memory size. I wish: 1. we could select default GPU for all models (potentially the fastest one with higher memory) 2. we could select specific model per specific GPU to use small models into small VRAM GPU and use fastest/highest_VRAM card for other large models.

GiteaMirror added the feature request label 2026-05-04 10:29:14 -05:00

GiteaMirror commented

2026-05-04 10:29:15 -05:00

@rick-github commented on GitHub (Jan 15, 2025):

Related: https://github.com/ollama/ollama/issues/3902

@rick-github commented on GitHub (Jan 15, 2025): Related: https://github.com/ollama/ollama/issues/3902

GiteaMirror commented

2026-05-04 10:29:16 -05:00

@Bashir-Rabbit commented on GitHub (Jan 15, 2025):

Thanks,
I am going to buy 5090 and I wish to have much more control over my GPUs and utilize them more efficiently.
Ollama and whole concept of deployment is amazing and more and more people will actually use it for private AI agent, instead of subscription based AI solution with risk of data to be leaked or misused.

@Bashir-Rabbit commented on GitHub (Jan 15, 2025): Thanks, I am going to buy 5090 and I wish to have much more control over my GPUs and utilize them more efficiently. Ollama and whole concept of deployment is amazing and more and more people will actually use it for private AI agent, instead of subscription based AI solution with risk of data to be leaked or misused.

GiteaMirror commented

2026-05-04 10:29:17 -05:00

@rick-github commented on GitHub (Jan 16, 2025):

It will probably be a while before #3902 sees any progress. In the meantime, you can sort of do this by running multiple ollama servers, binding a specific GPU to a server with CUDA_VISIBLE_DEVICES and then running a proxy in front to distribute the queries. litellm is an LLM proxy that can route queries based on the model name. If docker is available it's pretty easy to do a proof-of-concept. Here I've assumed 2 GPUs: GPU0 has lots of VRAM and will run big-model (actually qwen2.5:0.5b in ollama), and GPU1 is set as the default for all other models.

x-ollama: &ollama
  image: ollama/ollama:${OLLAMA_DOCKER_TAG-latest}
  volumes:
    - ${OLLAMA_MODELS-./ollama}:/root/.ollama
  environment: &env
    OLLAMA_KEEP_ALIVE: ${OLLAMA_KEEP_ALIVE--1}
    OLLAMA_LOAD_TIMEOUT: ${OLLAMA_LOAD_TIMEOUT-5m}
    OLLAMA_MAX_LOADED_MODELS: 1
    OLLAMA_NUM_PARALLEL: 1
    OLLAMA_TMPDIR: /tmp/ollama
    OLLAMA_FLASH_ATTENTION: ${OLLAMA_FLASH_ATTENTION-0}
    OLLAMA_DEBUG: ${OLLAMA_DEBUG-1}
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            capabilities: [gpu]

services:
  ollama-1:
    << : *ollama
    environment:
      << : *env
      CUDA_VISIBLE_DEVICES: 0

  ollama-2:
    <<: *ollama
    environment:
      <<: *env
      CUDA_VISIBLE_DEVICES: 1
      OLLAMA_MAX_LOADED_MODELS: 4

  litellm:
    image: litellm-lb
    build:
      dockerfile_inline: |
        FROM litellm/litellm:${LITELLM_DOCKER_TAG-v1.58.2}
        RUN cat > /config.yaml <<EOF
        model_list:
          - model_name: big-model
            litellm_params:
              model: ollama/qwen2.5:0.5b
              api_base: http://ollama-1:11434
          - model_name: "*"
            litellm_params:
              model: "ollama/*"
              api_base: http://ollama-2:11434
        litellm_settings:
          drop_params: True
        EOF
        CMD [ "--config", "/config.yaml", "--port", "4040" ]
    ports:
      - 4040:4040
    depends_on:
      - ollama-1
      - ollama-2

$ curl -s localhost:4040/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"big-model","messages":[{"role":"user","content":"2+2?"}]}' | jq
{
  "id": "chatcmpl-7860f649-b125-4733-8722-ebdaaf70d968",
  "created": 1737047060,
  "model": "ollama/qwen2.5:0.5b",
  "object": "chat.completion",
  "system_fingerprint": null,
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "2 + 2 equals 4.\n\nThis is a simple arithmetic operation that involves two numbers (2 and 2), where we add them together to get the sum.",
        "role": "assistant",
        "tool_calls": null,
        "function_call": null
      }
    }
  ],
  "usage": {
    "completion_tokens": 35,
    "prompt_tokens": 36,
    "total_tokens": 71,
    "completion_tokens_details": null,
    "prompt_tokens_details": null
  }
}

$ curl -s localhost:4040/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"llama3.2","messages":[{"role":"user","content":"2+2?"}]}' | jq
{
  "id": "chatcmpl-14a252dd-564e-4b1a-9cc3-1e6269de28fc",
  "created": 1737047497,
  "model": "ollama/llama3.2",
  "object": "chat.completion",
  "system_fingerprint": null,
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "The answer is: 4",
        "role": "assistant",
        "tool_calls": null,
        "function_call": null
      }
    }
  ],
  "usage": {
    "completion_tokens": 7,
    "prompt_tokens": 32,
    "total_tokens": 39,
    "completion_tokens_details": null,
    "prompt_tokens_details": null
  }
}

The restriction is that you have to use the OpenAI to talk to the proxy, so some of the fancier ollama options like creating models are not available.

@rick-github commented on GitHub (Jan 16, 2025): It will probably be a while before #3902 sees any progress. In the meantime, you can sort of do this by running multiple ollama servers, binding a specific GPU to a server with `CUDA_VISIBLE_DEVICES` and then running a proxy in front to distribute the queries. [litellm](https://github.com/BerriAI/litellm) is an LLM proxy that can route queries based on the model name. If docker is available it's pretty easy to do a proof-of-concept. Here I've assumed 2 GPUs: GPU0 has lots of VRAM and will run `big-model` (actually `qwen2.5:0.5b` in ollama), and GPU1 is set as the default for all other models. ```yaml x-ollama: &ollama image: ollama/ollama:${OLLAMA_DOCKER_TAG-latest} volumes: - ${OLLAMA_MODELS-./ollama}:/root/.ollama environment: &env OLLAMA_KEEP_ALIVE: ${OLLAMA_KEEP_ALIVE--1} OLLAMA_LOAD_TIMEOUT: ${OLLAMA_LOAD_TIMEOUT-5m} OLLAMA_MAX_LOADED_MODELS: 1 OLLAMA_NUM_PARALLEL: 1 OLLAMA_TMPDIR: /tmp/ollama OLLAMA_FLASH_ATTENTION: ${OLLAMA_FLASH_ATTENTION-0} OLLAMA_DEBUG: ${OLLAMA_DEBUG-1} deploy: resources: reservations: devices: - driver: nvidia capabilities: [gpu] services: ollama-1: << : *ollama environment: << : *env CUDA_VISIBLE_DEVICES: 0 ollama-2: <<: *ollama environment: <<: *env CUDA_VISIBLE_DEVICES: 1 OLLAMA_MAX_LOADED_MODELS: 4 litellm: image: litellm-lb build: dockerfile_inline: | FROM litellm/litellm:${LITELLM_DOCKER_TAG-v1.58.2} RUN cat > /config.yaml <<EOF model_list: - model_name: big-model litellm_params: model: ollama/qwen2.5:0.5b api_base: http://ollama-1:11434 - model_name: "*" litellm_params: model: "ollama/*" api_base: http://ollama-2:11434 litellm_settings: drop_params: True EOF CMD [ "--config", "/config.yaml", "--port", "4040" ] ports: - 4040:4040 depends_on: - ollama-1 - ollama-2 ``` ```console $ curl -s localhost:4040/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"big-model","messages":[{"role":"user","content":"2+2?"}]}' | jq { "id": "chatcmpl-7860f649-b125-4733-8722-ebdaaf70d968", "created": 1737047060, "model": "ollama/qwen2.5:0.5b", "object": "chat.completion", "system_fingerprint": null, "choices": [ { "finish_reason": "stop", "index": 0, "message": { "content": "2 + 2 equals 4.\n\nThis is a simple arithmetic operation that involves two numbers (2 and 2), where we add them together to get the sum.", "role": "assistant", "tool_calls": null, "function_call": null } } ], "usage": { "completion_tokens": 35, "prompt_tokens": 36, "total_tokens": 71, "completion_tokens_details": null, "prompt_tokens_details": null } } ``` ```console $ curl -s localhost:4040/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"llama3.2","messages":[{"role":"user","content":"2+2?"}]}' | jq { "id": "chatcmpl-14a252dd-564e-4b1a-9cc3-1e6269de28fc", "created": 1737047497, "model": "ollama/llama3.2", "object": "chat.completion", "system_fingerprint": null, "choices": [ { "finish_reason": "stop", "index": 0, "message": { "content": "The answer is: 4", "role": "assistant", "tool_calls": null, "function_call": null } } ], "usage": { "completion_tokens": 7, "prompt_tokens": 32, "total_tokens": 39, "completion_tokens_details": null, "prompt_tokens_details": null } } ``` The restriction is that you have to use the OpenAI to talk to the proxy, so some of the fancier ollama options like creating models are not available.

GiteaMirror commented

2026-05-04 10:29:17 -05:00

@Bashir-Rabbit commented on GitHub (Jan 16, 2025):

thanks, This is what I wanted to avoid running multiple docker containers and using proxy.
If this is only way, this is still a solution I will test as soon as I get 5090.

Again many thanks for help, will let you know how well this solution works in around 4-6 weeks.

I am still optimistic that Ollama will have more flexibility to assign models per gpu manually instead of automatic approach which is already implemented (probably work for most of the people quite well).

@Bashir-Rabbit commented on GitHub (Jan 16, 2025): thanks, This is what I wanted to avoid running multiple docker containers and using proxy. If this is only way, this is still a solution I will test as soon as I get 5090. Again many thanks for help, will let you know how well this solution works in around 4-6 weeks. I am still optimistic that Ollama will have more flexibility to assign models per gpu manually instead of automatic approach which is already implemented (probably work for most of the people quite well).

GiteaMirror commented

2026-05-04 10:29:19 -05:00

@bitcandy commented on GitHub (May 12, 2025):

I think this is very actual issue and very related to current memory balancing bugs, when some models don't utilize 100% of available GPU mem and begin using other GPUs...

As for me, not I'm experiencing issue with small model gemma3:12b-it-qat that actually need only 1/2 of my GPU ram, but the model begin to load to GPU with less mem that other and because of big context window it fail with 500 error.

I mean that it's still 1/2 VRAM available for context, but it fail.

@bitcandy commented on GitHub (May 12, 2025): I think this is very actual issue and very related to current memory balancing bugs, when some models don't utilize 100% of available GPU mem and begin using other GPUs... As for me, not I'm experiencing issue with small model gemma3:12b-it-qat that actually need only 1/2 of my GPU ram, but the model begin to load to GPU with less mem that other and because of big context window it fail with 500 error. I mean that it's still 1/2 VRAM available for context, but it fail.

GiteaMirror commented

2026-05-04 10:29:21 -05:00

@johnquix commented on GitHub (May 21, 2025):

I am coming from #9462, which is closer to my issue, but it has been marked as a duplicate of this one.

I have a similar situation to #9462: I have two 3060 12GB GPUs, and I have Automatic1111 pinned to GPU 2 so that Open WebUI can provide image generation. This process uses 7.2GB of GPU 2’s VRAM.

When I load Gemma 3 12B on GPU 1 in Ollama, even with the maximum layers specified, Ollama will not use GPU 2—instead, it offloads to the CPU and system RAM.

If I stop Automatic1111 and load any model, Ollama splits the workload across both GPUs as expected. However, if any VRAM is in use on GPU 2, Ollama seems to avoid using it entirely, regardless of the settings.

As mentioned before, I am not interested in pinning a model to a specific GPU; rather, I want Ollama to fully utilize both GPUs, even if one is partially occupied.

@johnquix commented on GitHub (May 21, 2025): I am coming from #9462, which is closer to my issue, but it has been marked as a duplicate of this one. I have a similar situation to #9462: I have two 3060 12GB GPUs, and I have Automatic1111 pinned to GPU 2 so that Open WebUI can provide image generation. This process uses 7.2GB of GPU 2’s VRAM. When I load Gemma 3 12B on GPU 1 in Ollama, even with the maximum layers specified, Ollama will not use GPU 2—instead, it offloads to the CPU and system RAM. If I stop Automatic1111 and load any model, Ollama splits the workload across both GPUs as expected. However, if any VRAM is in use on GPU 2, Ollama seems to avoid using it entirely, regardless of the settings. As mentioned before, I am not interested in pinning a model to a specific GPU; rather, I want Ollama to fully utilize both GPUs, even if one is partially occupied.

GiteaMirror commented

2026-05-04 10:29:21 -05:00

@rick-github commented on GitHub (May 21, 2025):

Server logs may aid in diagnosis.

@rick-github commented on GitHub (May 21, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in diagnosis.

GiteaMirror commented

2026-05-04 10:29:22 -05:00

@Bashir-Rabbit commented on GitHub (May 22, 2025):

I think this is very actual issue and very related to current memory balancing bugs, when some models don't utilize 100% of available GPU mem and begin using other GPUs...

As for me, not I'm experiencing issue with small model gemma3:12b-it-qat that actually need only 1/2 of my GPU ram, but the model begin to load to GPU with less mem that other and because of big context window it fail with 500 error.

I mean that it's still 1/2 VRAM available for context, but it fail.

So far I have only positive experience with Ollama even on my WSL setup it works super stable.
Only key missing functionality is a freedom to assign GPUs for specific model so create a model -> number of GPU pool, so I can assign small model into low memory GPU and still enjoy my main GPU resources for daily work.

@Bashir-Rabbit commented on GitHub (May 22, 2025): > I think this is very actual issue and very related to current memory balancing bugs, when some models don't utilize 100% of available GPU mem and begin using other GPUs... > > As for me, not I'm experiencing issue with small model gemma3:12b-it-qat that actually need only 1/2 of my GPU ram, but the model begin to load to GPU with less mem that other and because of big context window it fail with 500 error. > > I mean that it's still 1/2 VRAM available for context, but it fail. So far I have only positive experience with Ollama even on my WSL setup it works super stable. Only key missing functionality is a freedom to assign GPUs for specific model so create a model -> number of GPU pool, so I can assign small model into low memory GPU and still enjoy my main GPU resources for daily work.

GiteaMirror commented

2026-05-04 10:29:22 -05:00

@Bashir-Rabbit commented on GitHub (May 22, 2025):

I am coming from #9462, which is closer to my issue, but it has been marked as a duplicate of this one.

I have a similar situation to #9462: I have two 3060 12GB GPUs, and I have Automatic1111 pinned to GPU 2 so that Open WebUI can provide image generation. This process uses 7.2GB of GPU 2’s VRAM.

When I load Gemma 3 12B on GPU 1 in Ollama, even with the maximum layers specified, Ollama will not use GPU 2—instead, it offloads to the CPU and system RAM.

If I stop Automatic1111 and load any model, Ollama splits the workload across both GPUs as expected. However, if any VRAM is in use on GPU 2, Ollama seems to avoid using it entirely, regardless of the settings.

As mentioned before, I am not interested in pinning a model to a specific GPU; rather, I want Ollama to fully utilize both GPUs, even if one is partially occupied.

HI,
I have actually same problem:
I have 4090 and 3090, so actually I see from time to time that 3090 is used instead of 4090 due to marginally higher memory, so I decided to use larger model (to utilize 2 GPU memory), but this is suboptimal solution.

@Bashir-Rabbit commented on GitHub (May 22, 2025): > I am coming from [#9462](https://github.com/ollama/ollama/issues/9462), which is closer to my issue, but it has been marked as a duplicate of this one. > > I have a similar situation to [#9462](https://github.com/ollama/ollama/issues/9462): I have two 3060 12GB GPUs, and I have Automatic1111 pinned to GPU 2 so that Open WebUI can provide image generation. This process uses 7.2GB of GPU 2’s VRAM. > > When I load Gemma 3 12B on GPU 1 in Ollama, even with the maximum layers specified, Ollama will not use GPU 2—instead, it offloads to the CPU and system RAM. > > If I stop Automatic1111 and load any model, Ollama splits the workload across both GPUs as expected. However, if any VRAM is in use on GPU 2, Ollama seems to avoid using it entirely, regardless of the settings. > > As mentioned before, I am not interested in pinning a model to a specific GPU; rather, I want Ollama to fully utilize both GPUs, even if one is partially occupied. HI, I have actually same problem: I have 4090 and 3090, so actually I see from time to time that 3090 is used instead of 4090 due to marginally higher memory, so I decided to use larger model (to utilize 2 GPU memory), but this is suboptimal solution.

GiteaMirror commented

2026-05-04 10:29:22 -05:00

@johnquix commented on GitHub (May 23, 2025):

Server logs may aid in diagnosis.

time=2025-05-23T17:16:43.374Z level=INFO source=server.go:135 msg="system memory" total="31.2 GiB" free="20.9 GiB" free_swap="0 B"
time=2025-05-23T17:16:43.377Z level=INFO source=server.go:168 msg=offload library=cuda layers.requested=49 layers.model=49 layers.offload=37 layers.split=37,0 memory.available="[11.5 GiB 4.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="14.9 GiB" memory.required.partial="11.4 GiB" memory.required.kv="2.0 GiB" memory.required.allocations="[11.4 GiB 0 B]" memory.weights.total="7.5 GiB" memory.weights.repeating="5.6 GiB" memory.weights.nonrepeating="1.9 GiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.3 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB"
time=2025-05-23T17:16:43.431Z level=INFO source=server.go:431 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0 --ctx-size 24576 --batch-size 512 --n-gpu-layers 49 --threads 6 --parallel 1 --tensor-split 37,0 --port 42751"
time=2025-05-23T17:16:43.432Z level=INFO source=sched.go:472 msg="loaded runners" count=1
time=2025-05-23T17:16:43.432Z level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
time=2025-05-23T17:16:43.441Z level=INFO source=runner.go:836 msg="starting ollama engine"
time=2025-05-23T17:16:43.443Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
time=2025-05-23T17:16:43.445Z level=INFO source=runner.go:899 msg="Server listening on 127.0.0.1:42751"
time=2025-05-23T17:16:43.501Z level=INFO source=ggml.go:73 msg="" architecture=gemma3 file_type=Q4_0 name="" description="" num_tensors=1065 num_key_values=40
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so
time=2025-05-23T17:16:43.666Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-05-23T17:16:43.695Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
time=2025-05-23T17:16:43.763Z level=INFO source=ggml.go:299 msg="model weights" buffer=CUDA0 size="8.3 GiB"
time=2025-05-23T17:16:43.763Z level=INFO source=ggml.go:299 msg="model weights" buffer=CPU size="1.9 GiB"
time=2025-05-23T17:16:47.432Z level=INFO source=ggml.go:556 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="1.0 GiB"
time=2025-05-23T17:16:47.432Z level=INFO source=ggml.go:556 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="0 B"
time=2025-05-23T17:16:47.432Z level=INFO source=ggml.go:556 msg="compute graph" backend=CPU buffer_type=CPU size="7.5 MiB"
time=2025-05-23T17:16:47.457Z level=INFO source=server.go:630 msg="llama runner started in 4.03 seconds"
[GIN] 2025/05/23 - 17:17:07 | 200 | 25.454387855s | 172.17.0.1 | POST "/api/chat"
[GIN] 2025/05/23 - 17:17:09 | 200 | 1.653653934s | 172.17.0.1 | POST "/api/chat"
[GIN] 2025/05/23 - 17:17:11 | 200 | 1.567774392s | 172.17.0.1 | POST "/api/chat"
[GIN] 2025/05/23 - 17:18:19 | 200 | 17.540867ms | 172.17.0.1 | GET "/api/tags"
[GIN] 2025/05/23 - 17:18:19 | 404 | 4.064µs | 10.0.7.212 | GET "/models"
[GIN] 2025/05/23 - 17:18:35 | 200 | 17.429591ms | 172.17.0.1 | GET "/api/tags"
[GIN] 2025/05/23 - 17:18:35 | 404 | 4.475µs | 10.0.7.212 | GET "/models"
[GIN] 2025/05/23 - 17:18:35 | 200 | 17.641179ms | 172.17.0.1 | GET "/api/tags"
[GIN] 2025/05/23 - 17:18:35 | 404 | 4.938µs | 10.0.7.212 | GET "/models"
[GIN] 2025/05/23 - 17:18:43 | 200 | 27.752µs | 172.17.0.1 | GET "/api/version"
[GIN] 2025/05/23 - 17:18:59 | 200 | 26.933µs | 172.17.0.1 | GET "/api/version"
time=2025-05-23T17:19:12.986Z level=INFO source=server.go:135 msg="system memory" total="31.2 GiB" free="20.9 GiB" free_swap="0 B"
time=2025-05-23T17:19:12.987Z level=INFO source=server.go:168 msg=offload library=cuda layers.requested=49 layers.model=49 layers.offload=36 layers.split=36,0 memory.available="[11.5 GiB 4.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="15.4 GiB" memory.required.partial="11.5 GiB" memory.required.kv="2.4 GiB" memory.required.allocations="[11.5 GiB 0 B]" memory.weights.total="7.5 GiB" memory.weights.repeating="5.6 GiB" memory.weights.nonrepeating="1.9 GiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.3 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB"
time=2025-05-23T17:19:13.038Z level=INFO source=server.go:431 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0 --ctx-size 32000 --batch-size 512 --n-gpu-layers 49 --threads 6 --parallel 1 --tensor-split 36,0 --port 44077"
time=2025-05-23T17:19:13.039Z level=INFO source=sched.go:472 msg="loaded runners" count=1
time=2025-05-23T17:19:13.039Z level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
time=2025-05-23T17:19:13.039Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
time=2025-05-23T17:19:13.050Z level=INFO source=runner.go:836 msg="starting ollama engine"
time=2025-05-23T17:19:13.054Z level=INFO source=runner.go:899 msg="Server listening on 127.0.0.1:44077"
time=2025-05-23T17:19:13.100Z level=INFO source=ggml.go:73 msg="" architecture=gemma3 file_type=Q4_0 name="" description="" num_tensors=1065 num_key_values=40
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so
time=2025-05-23T17:19:13.248Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-05-23T17:19:13.290Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
time=2025-05-23T17:19:13.344Z level=INFO source=ggml.go:299 msg="model weights" buffer=CUDA0 size="8.3 GiB"
time=2025-05-23T17:19:13.344Z level=INFO source=ggml.go:299 msg="model weights" buffer=CPU size="1.9 GiB"
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1323.01 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 1387274752
panic: failed to reserve graph

goroutine 11 [running]:
github.com/ollama/ollama/runner/ollamarunner.(*Server).loadModel(0xc0007227e0, {0x557ee72afb10?, 0xc000374140?}, {0x7ffdc0720c8c?, 0x0?}, {0xc000502f00, 0x6, 0x0, 0x31, {0xc0007091f0, ...}, ...}, ...)
github.com/ollama/ollama/runner/ollamarunner/runner.go:801 +0x2a5
created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1
github.com/ollama/ollama/runner/ollamarunner/runner.go:872 +0xa2b
time=2025-05-23T17:19:17.133Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
time=2025-05-23T17:19:17.209Z level=ERROR source=server.go:457 msg="llama runner terminated" error="exit status 2"
time=2025-05-23T17:19:17.384Z level=ERROR source=sched.go:478 msg="error loading llama server" error="llama runner process has terminated: cudaMalloc failed: out of memory\nggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 1387274752"
[GIN] 2025/05/23 - 17:19:17 | 500 | 6.212018587s | 172.17.0.1 | POST "/api/chat"
time=2025-05-23T17:19:22.568Z level=WARN source=sched.go:676 msg="gpu VRAM usage didn't recover within timeout" seconds=5.184530069 runner.size="15.4 GiB" runner.vram="11.5 GiB" runner.parallel=1 runner.pid=82 runner.model=/root/.ollama/models/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0
time=2025-05-23T17:19:22.867Z level=WARN source=sched.go:676 msg="gpu VRAM usage didn't recover within timeout" seconds=5.482874844 runner.size="15.4 GiB" runner.vram="11.5 GiB" runner.parallel=1 runner.pid=82 runner.model=/root/.ollama/models/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0
time=2025-05-23T17:19:23.160Z level=WARN source=sched.go:676 msg="gpu VRAM usage didn't recover within timeout" seconds=5.776333648 runner.size="15.4 GiB" runner.vram="11.5 GiB" runner.parallel=1 runner.pid=82 runner.model=/root/.ollama/models/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0

New development: I am finding if I load a larger model like Gemma3 12B Q8 it will utilize the remaining VRAM on the second GPU. With the smaller size model Gemma3 12B QAT above seems to error out as you push context above what the single GPU can handle in VRAM and won't spill over to the second GPU or spread some layers between them to allow for more VRAM allocation. 12B QAT works fine but only on the first GPU at 25k context but at 32k context the above happens.

@johnquix commented on GitHub (May 23, 2025): > [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in diagnosis. time=2025-05-23T17:16:43.374Z level=INFO source=server.go:135 msg="system memory" total="31.2 GiB" free="20.9 GiB" free_swap="0 B" time=2025-05-23T17:16:43.377Z level=INFO source=server.go:168 msg=offload library=cuda layers.requested=49 layers.model=49 layers.offload=37 layers.split=37,0 memory.available="[11.5 GiB 4.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="14.9 GiB" memory.required.partial="11.4 GiB" memory.required.kv="2.0 GiB" memory.required.allocations="[11.4 GiB 0 B]" memory.weights.total="7.5 GiB" memory.weights.repeating="5.6 GiB" memory.weights.nonrepeating="1.9 GiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.3 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB" time=2025-05-23T17:16:43.431Z level=INFO source=server.go:431 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0 --ctx-size 24576 --batch-size 512 --n-gpu-layers 49 --threads 6 --parallel 1 --tensor-split 37,0 --port 42751" time=2025-05-23T17:16:43.432Z level=INFO source=sched.go:472 msg="loaded runners" count=1 time=2025-05-23T17:16:43.432Z level=INFO source=server.go:591 msg="waiting for llama runner to start responding" time=2025-05-23T17:16:43.441Z level=INFO source=runner.go:836 msg="starting ollama engine" time=2025-05-23T17:16:43.443Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" time=2025-05-23T17:16:43.445Z level=INFO source=runner.go:899 msg="Server listening on 127.0.0.1:42751" time=2025-05-23T17:16:43.501Z level=INFO source=ggml.go:73 msg="" architecture=gemma3 file_type=Q4_0 name="" description="" num_tensors=1065 num_key_values=40 load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so time=2025-05-23T17:16:43.666Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-05-23T17:16:43.695Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" time=2025-05-23T17:16:43.763Z level=INFO source=ggml.go:299 msg="model weights" buffer=CUDA0 size="8.3 GiB" time=2025-05-23T17:16:43.763Z level=INFO source=ggml.go:299 msg="model weights" buffer=CPU size="1.9 GiB" time=2025-05-23T17:16:47.432Z level=INFO source=ggml.go:556 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="1.0 GiB" time=2025-05-23T17:16:47.432Z level=INFO source=ggml.go:556 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="0 B" time=2025-05-23T17:16:47.432Z level=INFO source=ggml.go:556 msg="compute graph" backend=CPU buffer_type=CPU size="7.5 MiB" time=2025-05-23T17:16:47.457Z level=INFO source=server.go:630 msg="llama runner started in 4.03 seconds" [GIN] 2025/05/23 - 17:17:07 | 200 | 25.454387855s | 172.17.0.1 | POST "/api/chat" [GIN] 2025/05/23 - 17:17:09 | 200 | 1.653653934s | 172.17.0.1 | POST "/api/chat" [GIN] 2025/05/23 - 17:17:11 | 200 | 1.567774392s | 172.17.0.1 | POST "/api/chat" [GIN] 2025/05/23 - 17:18:19 | 200 | 17.540867ms | 172.17.0.1 | GET "/api/tags" [GIN] 2025/05/23 - 17:18:19 | 404 | 4.064µs | 10.0.7.212 | GET "/models" [GIN] 2025/05/23 - 17:18:35 | 200 | 17.429591ms | 172.17.0.1 | GET "/api/tags" [GIN] 2025/05/23 - 17:18:35 | 404 | 4.475µs | 10.0.7.212 | GET "/models" [GIN] 2025/05/23 - 17:18:35 | 200 | 17.641179ms | 172.17.0.1 | GET "/api/tags" [GIN] 2025/05/23 - 17:18:35 | 404 | 4.938µs | 10.0.7.212 | GET "/models" [GIN] 2025/05/23 - 17:18:43 | 200 | 27.752µs | 172.17.0.1 | GET "/api/version" [GIN] 2025/05/23 - 17:18:59 | 200 | 26.933µs | 172.17.0.1 | GET "/api/version" time=2025-05-23T17:19:12.986Z level=INFO source=server.go:135 msg="system memory" total="31.2 GiB" free="20.9 GiB" free_swap="0 B" time=2025-05-23T17:19:12.987Z level=INFO source=server.go:168 msg=offload library=cuda layers.requested=49 layers.model=49 layers.offload=36 layers.split=36,0 memory.available="[11.5 GiB 4.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="15.4 GiB" memory.required.partial="11.5 GiB" memory.required.kv="2.4 GiB" memory.required.allocations="[11.5 GiB 0 B]" memory.weights.total="7.5 GiB" memory.weights.repeating="5.6 GiB" memory.weights.nonrepeating="1.9 GiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.3 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB" time=2025-05-23T17:19:13.038Z level=INFO source=server.go:431 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0 --ctx-size 32000 --batch-size 512 --n-gpu-layers 49 --threads 6 --parallel 1 --tensor-split 36,0 --port 44077" time=2025-05-23T17:19:13.039Z level=INFO source=sched.go:472 msg="loaded runners" count=1 time=2025-05-23T17:19:13.039Z level=INFO source=server.go:591 msg="waiting for llama runner to start responding" time=2025-05-23T17:19:13.039Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" time=2025-05-23T17:19:13.050Z level=INFO source=runner.go:836 msg="starting ollama engine" time=2025-05-23T17:19:13.054Z level=INFO source=runner.go:899 msg="Server listening on 127.0.0.1:44077" time=2025-05-23T17:19:13.100Z level=INFO source=ggml.go:73 msg="" architecture=gemma3 file_type=Q4_0 name="" description="" num_tensors=1065 num_key_values=40 load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so time=2025-05-23T17:19:13.248Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-05-23T17:19:13.290Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" time=2025-05-23T17:19:13.344Z level=INFO source=ggml.go:299 msg="model weights" buffer=CUDA0 size="8.3 GiB" time=2025-05-23T17:19:13.344Z level=INFO source=ggml.go:299 msg="model weights" buffer=CPU size="1.9 GiB" ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1323.01 MiB on device 0: cudaMalloc failed: out of memory ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 1387274752 panic: failed to reserve graph goroutine 11 [running]: github.com/ollama/ollama/runner/ollamarunner.(*Server).loadModel(0xc0007227e0, {0x557ee72afb10?, 0xc000374140?}, {0x7ffdc0720c8c?, 0x0?}, {0xc000502f00, 0x6, 0x0, 0x31, {0xc0007091f0, ...}, ...}, ...) github.com/ollama/ollama/runner/ollamarunner/runner.go:801 +0x2a5 created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1 github.com/ollama/ollama/runner/ollamarunner/runner.go:872 +0xa2b time=2025-05-23T17:19:17.133Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" time=2025-05-23T17:19:17.209Z level=ERROR source=server.go:457 msg="llama runner terminated" error="exit status 2" time=2025-05-23T17:19:17.384Z level=ERROR source=sched.go:478 msg="error loading llama server" error="llama runner process has terminated: cudaMalloc failed: out of memory\nggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 1387274752" [GIN] 2025/05/23 - 17:19:17 | 500 | 6.212018587s | 172.17.0.1 | POST "/api/chat" time=2025-05-23T17:19:22.568Z level=WARN source=sched.go:676 msg="gpu VRAM usage didn't recover within timeout" seconds=5.184530069 runner.size="15.4 GiB" runner.vram="11.5 GiB" runner.parallel=1 runner.pid=82 runner.model=/root/.ollama/models/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0 time=2025-05-23T17:19:22.867Z level=WARN source=sched.go:676 msg="gpu VRAM usage didn't recover within timeout" seconds=5.482874844 runner.size="15.4 GiB" runner.vram="11.5 GiB" runner.parallel=1 runner.pid=82 runner.model=/root/.ollama/models/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0 time=2025-05-23T17:19:23.160Z level=WARN source=sched.go:676 msg="gpu VRAM usage didn't recover within timeout" seconds=5.776333648 runner.size="15.4 GiB" runner.vram="11.5 GiB" runner.parallel=1 runner.pid=82 runner.model=/root/.ollama/models/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0 New development: I am finding if I load a larger model like Gemma3 12B Q8 it will utilize the remaining VRAM on the second GPU. With the smaller size model Gemma3 12B QAT above seems to error out as you push context above what the single GPU can handle in VRAM and won't spill over to the second GPU or spread some layers between them to allow for more VRAM allocation. 12B QAT works fine but only on the first GPU at 25k context but at 32k context the above happens.

GiteaMirror commented

2026-05-04 10:29:24 -05:00

@rick-github commented on GitHub (May 23, 2025):

time=2025-05-23T17:16:43.377Z level=INFO source=server.go:168 msg=offload library=cuda layers.requested=49 layers.model=49
 layers.offload=37 layers.split=37,0 memory.available="[11.5 GiB 4.6 GiB]" memory.gpu_overhead="0 B"
 memory.required.full="14.9 GiB" memory.required.partial="11.4 GiB" memory.required.kv="2.0 GiB"
 memory.required.allocations="[11.4 GiB 0 B]" memory.weights.total="7.5 GiB" memory.weights.repeating="5.6 GiB"
 memory.weights.nonrepeating="1.9 GiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.3 GiB"
 projector.weights="806.2 MiB" projector.graph="1.0 GiB"

There's a minimum amount of memory required on a GPU before ollama can load layers on to it. That minimum is the amount to hold the projector data, the memory graph, at least two layers, a safety buffer, and some extra incidental allocations. So the minimum is 1G + 806M + 1.3G + 1G + 457M + incidental = ~4.7G (approximate as I don't have the exact figure for the layer size handy). So the small device falls just short of being able to host some layers. You can reduce num_ctx or num_batch to reduce this minimum, or enable flash attention or k/v cache quantization to make cache usage more space efficient.

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1323.01 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 1387274752
panic: failed to reserve graph

Now the runner is allocating memory on the usable device. We see from earlier that it estimated using 11.4G of 11.5G available. Since the device OOM'ed, the estimation was too tight. See here for ways for dealing with OOM situations.

I am finding if I load a larger model like Gemma3 12B Q8 it will utilize the remaining VRAM on the second GPU

Different models will compute the size of the memory graph differently - besides the impact of num_ctx and num_batch, the number of attentions head, the size of the vocab and the size of the embedding affect this value.

@rick-github commented on GitHub (May 23, 2025): ``` time=2025-05-23T17:16:43.377Z level=INFO source=server.go:168 msg=offload library=cuda layers.requested=49 layers.model=49 layers.offload=37 layers.split=37,0 memory.available="[11.5 GiB 4.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="14.9 GiB" memory.required.partial="11.4 GiB" memory.required.kv="2.0 GiB" memory.required.allocations="[11.4 GiB 0 B]" memory.weights.total="7.5 GiB" memory.weights.repeating="5.6 GiB" memory.weights.nonrepeating="1.9 GiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.3 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB" ``` There's a minimum amount of memory required on a GPU before ollama can load layers on to it. That minimum is the amount to hold the projector data, the memory graph, at least two layers, a safety buffer, and some extra incidental allocations. So the minimum is 1G + 806M + 1.3G + 1G + 457M + incidental = ~4.7G (approximate as I don't have the exact figure for the layer size handy). So the small device falls just short of being able to host some layers. You can reduce `num_ctx` or `num_batch` to reduce this minimum, or enable [flash attention](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-enable-flash-attention) or [k/v cache quantization](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-set-the-quantization-type-for-the-kv-cache) to make cache usage more space efficient. ``` ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1323.01 MiB on device 0: cudaMalloc failed: out of memory ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 1387274752 panic: failed to reserve graph ``` Now the runner is allocating memory on the usable device. We see from earlier that it estimated using 11.4G of 11.5G available. Since the device OOM'ed, the estimation was too tight. See [here](https://github.com/ollama/ollama/issues/8597#issuecomment-2614533288) for ways for dealing with OOM situations. > I am finding if I load a larger model like Gemma3 12B Q8 it will utilize the remaining VRAM on the second GPU Different models will compute the size of the memory graph differently - besides the impact of `num_ctx` and `num_batch`, the number of attentions head, the size of the vocab and the size of the embedding affect this value.

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

dhiltgen/llama-runner

hoyyeva/anthropic-local-image-path

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#67474