[GH-ISSUE #10125] qwen2.5-coder-cline:14b not using NVIDIA GPU #53156

Closed
opened 2026-04-29 02:08:39 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @SvenMeyer on GitHub (Apr 4, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10125

What is the issue?

qwen2.5-coder:32b-instruct-q8_0 offloads as many layers as possible to the 8 GB GPU

qwen2.5-coder-cline:7b offloads some layers to the GPU, but actually there should be space for more

qwen2.5-coder-cline:14b (and the 32b model) do not use the GPU at all, put all layers just into CPU RAM

Something went wrong when the cline derivative of the model was created ?

I even created a new modelfile which sets mmap to true, but that did not change anything.

Relevant log output


$ ollama run --verbose maryasov/qwen2.5-coder:32b-instruct-q8_0

load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 11 repeating layers to GPU
load_tensors: offloaded 11/65 layers to GPU
load_tensors:        CUDA0 model buffer size =  5435.42 MiB
load_tensors:   CPU_Mapped model buffer size = 33202.08 MiB


$ ollama run --verbose maryasov/qwen2.5-coder-cline:7b

load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 8 repeating layers to GPU
load_tensors: offloaded 8/29 layers to GPU
load_tensors:        CUDA0 model buffer size =  1103.35 MiB
load_tensors:   CPU_Mapped model buffer size =  4460.45 MiB


$ ollama run --verbose maryasov/qwen2.5-coder-cline:14b

load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors:          CPU model buffer size =  8566.04 MiB

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.6.2

Originally created by @SvenMeyer on GitHub (Apr 4, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10125 ### What is the issue? `qwen2.5-coder:32b-instruct-q8_0` offloads as many layers as possible to the 8 GB GPU `qwen2.5-coder-cline:7b` offloads some layers to the GPU, but actually there should be space for more `qwen2.5-coder-cline:14b` (and the `32b` model) do not use the GPU at all, put all layers just into CPU RAM Something went wrong when the `cline` derivative of the model was created ? I even created a new modelfile which sets `mmap` to true, but that did not change anything. ### Relevant log output ```shell $ ollama run --verbose maryasov/qwen2.5-coder:32b-instruct-q8_0 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: offloading 11 repeating layers to GPU load_tensors: offloaded 11/65 layers to GPU load_tensors: CUDA0 model buffer size = 5435.42 MiB load_tensors: CPU_Mapped model buffer size = 33202.08 MiB $ ollama run --verbose maryasov/qwen2.5-coder-cline:7b load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: offloading 8 repeating layers to GPU load_tensors: offloaded 8/29 layers to GPU load_tensors: CUDA0 model buffer size = 1103.35 MiB load_tensors: CPU_Mapped model buffer size = 4460.45 MiB $ ollama run --verbose maryasov/qwen2.5-coder-cline:14b load_tensors: loading model tensors, this can take a while... (mmap = false) load_tensors: CPU model buffer size = 8566.04 MiB ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.6.2
GiteaMirror added the bugneeds more info labels 2026-04-29 02:08:40 -05:00
Author
Owner

@rick-github commented on GitHub (Apr 7, 2025):

Modelfile?

<!-- gh-comment-id:2782998269 --> @rick-github commented on GitHub (Apr 7, 2025): Modelfile?
Author
Owner

@SvenMeyer commented on GitHub (Apr 8, 2025):

@rick-github As the 32b version showed the same problem, I paste that here. It was the last one I tried - unmodified, as downloaded from ollama (only reformatted here for better readability).

https://ollama.com/maryasov/qwen2.5-coder-cline

{
    "schemaVersion": 2,
    "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
    "config": {
        "mediaType": "application/vnd.docker.container.image.v1+json",
        "digest": "sha256:5df93646d8858926437da52dca2ab9c23abd0391145aa087d4ae3b20cb50ecb1",
        "size": 488
    },
    "layers": [
        {
            "mediaType": "application/vnd.ollama.image.model",
            "digest": "sha256:ac3d1ba8aa77755dab3806d9024e9c385ea0d5b412d6bdf9157f8a4a7e9fc0d9",
            "size": 19851336384,
            "from": "/usr/share/ollama/.ollama/models/blobs/sha256-ac3d1ba8aa77755dab3806d9024e9c385ea0d5b412d6bdf9157f8a4a7e9fc0d9"
        },
        {
            "mediaType": "application/vnd.ollama.image.template",
            "digest": "sha256:3852d4fe7f87a6e29157b211ff0b037d2c0624023b6c849599e9b4e9148316f8",
            "size": 1899
        },
        {
            "mediaType": "application/vnd.ollama.image.license",
            "digest": "sha256:832dd9e00a68dd83b3c3fb9f5588dad7dcf337a0db50f7d9483f310cd292e92e",
            "size": 11343
        },
        {
            "mediaType": "application/vnd.ollama.image.params",
            "digest": "sha256:8b2542ec39db97e0dad14ab4fe6dd6301e127390ca064522c9ce752a26f1526e",
            "size": 119
        }
    ]
}

Then tried to use it with this Modelfile

FROM maryasov/qwen2.5-coder-cline:32b
PARAMETER use_mmap true

... which did not change anything.

<!-- gh-comment-id:2785216625 --> @SvenMeyer commented on GitHub (Apr 8, 2025): @rick-github As the 32b version showed the same problem, I paste that here. It was the last one I tried - unmodified, as downloaded from ollama (only reformatted here for better readability). [https://ollama.com/maryasov/qwen2.5-coder-cline](https://ollama.com/maryasov/qwen2.5-coder-cline) ``` { "schemaVersion": 2, "mediaType": "application/vnd.docker.distribution.manifest.v2+json", "config": { "mediaType": "application/vnd.docker.container.image.v1+json", "digest": "sha256:5df93646d8858926437da52dca2ab9c23abd0391145aa087d4ae3b20cb50ecb1", "size": 488 }, "layers": [ { "mediaType": "application/vnd.ollama.image.model", "digest": "sha256:ac3d1ba8aa77755dab3806d9024e9c385ea0d5b412d6bdf9157f8a4a7e9fc0d9", "size": 19851336384, "from": "/usr/share/ollama/.ollama/models/blobs/sha256-ac3d1ba8aa77755dab3806d9024e9c385ea0d5b412d6bdf9157f8a4a7e9fc0d9" }, { "mediaType": "application/vnd.ollama.image.template", "digest": "sha256:3852d4fe7f87a6e29157b211ff0b037d2c0624023b6c849599e9b4e9148316f8", "size": 1899 }, { "mediaType": "application/vnd.ollama.image.license", "digest": "sha256:832dd9e00a68dd83b3c3fb9f5588dad7dcf337a0db50f7d9483f310cd292e92e", "size": 11343 }, { "mediaType": "application/vnd.ollama.image.params", "digest": "sha256:8b2542ec39db97e0dad14ab4fe6dd6301e127390ca064522c9ce752a26f1526e", "size": 119 } ] } ``` Then tried to use it with this Modelfile ``` FROM maryasov/qwen2.5-coder-cline:32b PARAMETER use_mmap true ``` ... which did not change anything.
Author
Owner

@SvenMeyer commented on GitHub (Apr 8, 2025):

same result with version 0.6.5

<!-- gh-comment-id:2785221529 --> @SvenMeyer commented on GitHub (Apr 8, 2025): same result with version 0.6.5
Author
Owner

@rick-github commented on GitHub (Apr 8, 2025):

From the params blob:

{
    "num_ctx": 32768,
    "stop": [
        "<|im_start|>",
        "<|im_end|>",
        "<|endoftext|>"
    ],
    "temperature": 0
}

A 32K context window requires 8G, which may be too large to be supported along with the model weights and computation graph on your GPU. Server logs may show details.

<!-- gh-comment-id:2786965550 --> @rick-github commented on GitHub (Apr 8, 2025): From the `params` blob: ``` { "num_ctx": 32768, "stop": [ "<|im_start|>", "<|im_end|>", "<|endoftext|>" ], "temperature": 0 } ``` A 32K context window requires 8G, which may be too large to be supported along with the model weights and computation graph on your GPU. [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may show details.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#53156