[GH-ISSUE #14045] glm-4.7-flash is slow and uses a lot of cpu #71237

Open
opened 2026-05-05 00:50:53 -05:00 by GiteaMirror · 12 comments
Owner

Originally created by @inforithmics on GitHub (Feb 3, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14045

Originally assigned to: @jmorganca on GitHub.

What is the issue?

I downloaded 0.15.5 rc and let it run on 7900 xtx and the model was very slow and used a lot of cpu. Memory was allocated on gpu. I reverted back to 15.4 and it was fast again.

Relevant log output


OS

Windows

GPU

AMD

CPU

Intel

Ollama version

0.5.15 rc

Originally created by @inforithmics on GitHub (Feb 3, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14045 Originally assigned to: @jmorganca on GitHub. ### What is the issue? I downloaded 0.15.5 rc and let it run on 7900 xtx and the model was very slow and used a lot of cpu. Memory was allocated on gpu. I reverted back to 15.4 and it was fast again. ### Relevant log output ```shell ``` ### OS Windows ### GPU AMD ### CPU Intel ### Ollama version 0.5.15 rc
GiteaMirror added the bug label 2026-05-05 00:50:53 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 3, 2026):

Server logs may aid in debugging.

<!-- gh-comment-id:3841715772 --> @rick-github commented on GitHub (Feb 3, 2026): [Server logs](https://docs.ollama.com/troubleshooting) may aid in debugging.
Author
Owner

@rick-github commented on GitHub (Feb 3, 2026):

glm-4.7-flash   0.15.3       prompt eval rate: 5.32 tokens/s eval rate: 131.31 tokens/s
glm-4.7-flash   0.15.4       prompt eval rate: 5.33 tokens/s eval rate: 132.45 tokens/s
glm-4.7-flash   0.15.5-rc0   prompt eval rate: 5.34 tokens/s eval rate: 131.33 tokens/s
glm-4.7-flash   0.15.5-rc1   prompt eval rate: 5.30 tokens/s eval rate: 21.11 tokens/s
qwen3           0.15.3       prompt eval rate: 8.20 tokens/s eval rate: 189.55 tokens/s
qwen3           0.15.4       prompt eval rate: 7.44 tokens/s eval rate: 192.03 tokens/s
qwen3           0.15.5-rc0   prompt eval rate: 7.72 tokens/s eval rate: 192.54 tokens/s
qwen3           0.15.5-rc1   prompt eval rate: 7.54 tokens/s eval rate: 163.58 tokens/s

Between rc0 and rc1 is the vendor sync (#13832), so that seems the likely cause.

<!-- gh-comment-id:3841971356 --> @rick-github commented on GitHub (Feb 3, 2026): ``` glm-4.7-flash 0.15.3 prompt eval rate: 5.32 tokens/s eval rate: 131.31 tokens/s glm-4.7-flash 0.15.4 prompt eval rate: 5.33 tokens/s eval rate: 132.45 tokens/s glm-4.7-flash 0.15.5-rc0 prompt eval rate: 5.34 tokens/s eval rate: 131.33 tokens/s glm-4.7-flash 0.15.5-rc1 prompt eval rate: 5.30 tokens/s eval rate: 21.11 tokens/s qwen3 0.15.3 prompt eval rate: 8.20 tokens/s eval rate: 189.55 tokens/s qwen3 0.15.4 prompt eval rate: 7.44 tokens/s eval rate: 192.03 tokens/s qwen3 0.15.5-rc0 prompt eval rate: 7.72 tokens/s eval rate: 192.54 tokens/s qwen3 0.15.5-rc1 prompt eval rate: 7.54 tokens/s eval rate: 163.58 tokens/s ``` Between rc0 and rc1 is the vendor sync (#13832), so that seems the likely cause.
Author
Owner

@inforithmics commented on GitHub (Feb 3, 2026):

When I checked out the code from the vendor sync. It was still fast. But I saw that glm 4.7 was implemented in the experimental engine. Maybe this influences it somehow try to investigate further.

<!-- gh-comment-id:3842497660 --> @inforithmics commented on GitHub (Feb 3, 2026): When I checked out the code from the vendor sync. It was still fast. But I saw that glm 4.7 was implemented in the experimental engine. Maybe this influences it somehow try to investigate further.
Author
Owner

@rick-github commented on GitHub (Feb 3, 2026):

Some other models are also affected to a lesser degree, eg qwen3 and lfm2.5-thinking, but not glm-4.7 or minimax-m2.1.

glm-4.7-flash        0.15.4       prompt eval rate: 5.40 tokens/s eval rate: 131.65 tokens/s
glm-4.7-flash        0.15.5-rc0   prompt eval rate: 5.33 tokens/s eval rate: 130.27 tokens/s
glm-4.7-flash        0.15.5-rc1   prompt eval rate: 5.27 tokens/s eval rate: 21.79 tokens/s
qwen3                0.15.4       prompt eval rate: 8.10 tokens/s eval rate: 192.90 tokens/s
qwen3                0.15.5-rc0   prompt eval rate: 7.54 tokens/s eval rate: 192.73 tokens/s
qwen3                0.15.5-rc1   prompt eval rate: 7.77 tokens/s eval rate: 166.06 tokens/s
lfm2.5-thinking:1.2b 0.15.4       prompt eval rate: 189.03 tokens/s eval rate: 783.76 tokens/s
lfm2.5-thinking:1.2b 0.15.5-rc0   prompt eval rate: 4117.98 tokens/s eval rate: 782.86 tokens/s
lfm2.5-thinking:1.2b 0.15.5-rc1   prompt eval rate: 4027.82 tokens/s eval rate: 574.14 tokens/s
frob/glm-4.7         0.15.4       prompt eval rate: 6.34 tokens/s eval rate: 49.45 tokens/s
frob/glm-4.7         0.15.5-rc0   prompt eval rate: 6.66 tokens/s eval rate: 49.59 tokens/s
frob/glm-4.7         0.15.5-rc1   prompt eval rate: 6.26 tokens/s eval rate: 49.76 tokens/s
frob/minimax-m2.1    0.15.4       prompt eval rate: 160.88 tokens/s eval rate: 119.91 tokens/s
frob/minimax-m2.1    0.15.5-rc0   prompt eval rate: 170.86 tokens/s eval rate: 119.78 tokens/s
frob/minimax-m2.1    0.15.5-rc1   prompt eval rate: 157.15 tokens/s eval rate: 121.02 tokens/s
<!-- gh-comment-id:3842666792 --> @rick-github commented on GitHub (Feb 3, 2026): Some other models are also affected to a lesser degree, eg qwen3 and lfm2.5-thinking, but not glm-4.7 or minimax-m2.1. ``` glm-4.7-flash 0.15.4 prompt eval rate: 5.40 tokens/s eval rate: 131.65 tokens/s glm-4.7-flash 0.15.5-rc0 prompt eval rate: 5.33 tokens/s eval rate: 130.27 tokens/s glm-4.7-flash 0.15.5-rc1 prompt eval rate: 5.27 tokens/s eval rate: 21.79 tokens/s qwen3 0.15.4 prompt eval rate: 8.10 tokens/s eval rate: 192.90 tokens/s qwen3 0.15.5-rc0 prompt eval rate: 7.54 tokens/s eval rate: 192.73 tokens/s qwen3 0.15.5-rc1 prompt eval rate: 7.77 tokens/s eval rate: 166.06 tokens/s lfm2.5-thinking:1.2b 0.15.4 prompt eval rate: 189.03 tokens/s eval rate: 783.76 tokens/s lfm2.5-thinking:1.2b 0.15.5-rc0 prompt eval rate: 4117.98 tokens/s eval rate: 782.86 tokens/s lfm2.5-thinking:1.2b 0.15.5-rc1 prompt eval rate: 4027.82 tokens/s eval rate: 574.14 tokens/s frob/glm-4.7 0.15.4 prompt eval rate: 6.34 tokens/s eval rate: 49.45 tokens/s frob/glm-4.7 0.15.5-rc0 prompt eval rate: 6.66 tokens/s eval rate: 49.59 tokens/s frob/glm-4.7 0.15.5-rc1 prompt eval rate: 6.26 tokens/s eval rate: 49.76 tokens/s frob/minimax-m2.1 0.15.4 prompt eval rate: 160.88 tokens/s eval rate: 119.91 tokens/s frob/minimax-m2.1 0.15.5-rc0 prompt eval rate: 170.86 tokens/s eval rate: 119.78 tokens/s frob/minimax-m2.1 0.15.5-rc1 prompt eval rate: 157.15 tokens/s eval rate: 121.02 tokens/s ```
Author
Owner

@jessegross commented on GitHub (Feb 3, 2026):

Confirmed that it is the GGML bump:

ef00199fb4e6d045e11e76baaab9049f3234939d is the first bad commit
commit ef00199fb4e6d045e11e76baaab9049f3234939d
Author: Jeffrey Morgan <jmorganca@gmail.com>
Date:   Mon Feb 2 17:31:59 2026 -0800

    Update vendor ggml code to a5bb8ba4 (#13832)
    
    Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
    Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
    Co-authored-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
<!-- gh-comment-id:3843054649 --> @jessegross commented on GitHub (Feb 3, 2026): Confirmed that it is the GGML bump: ``` ef00199fb4e6d045e11e76baaab9049f3234939d is the first bad commit commit ef00199fb4e6d045e11e76baaab9049f3234939d Author: Jeffrey Morgan <jmorganca@gmail.com> Date: Mon Feb 2 17:31:59 2026 -0800 Update vendor ggml code to a5bb8ba4 (#13832) Co-authored-by: Daniel Hiltgen <daniel@ollama.com> Co-authored-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com> ```
Author
Owner

@inforithmics commented on GitHub (Feb 3, 2026):

Could this help?
https://github.com/ollama/ollama/pull/13597
Because how it looks something is run on the cpu and normally this means that an ggml operation isn‘t supported.

<!-- gh-comment-id:3843338986 --> @inforithmics commented on GitHub (Feb 3, 2026): Could this help? https://github.com/ollama/ollama/pull/13597 Because how it looks something is run on the cpu and normally this means that an ggml operation isn‘t supported.
Author
Owner

@inforithmics commented on GitHub (Feb 3, 2026):

Or maybe this caused it
https://github.com/ggml-org/llama.cpp/pull/18986

<!-- gh-comment-id:3843348870 --> @inforithmics commented on GitHub (Feb 3, 2026): Or maybe this caused it https://github.com/ggml-org/llama.cpp/pull/18986
Author
Owner

@rick-github commented on GitHub (Feb 3, 2026):

Doesn't appear to affect ROCm devices, above was Nvidia.

glm-4.7-flash        0.15.4-rocm       prompt eval rate: 93.72 tokens/s eval rate: 36.22 tokens/s
glm-4.7-flash        0.15.5-rc0-rocm   prompt eval rate: 82.84 tokens/s eval rate: 35.72 tokens/s
glm-4.7-flash        0.15.5-rc1-rocm   prompt eval rate: 94.07 tokens/s eval rate: 35.67 tokens/s
qwen3                0.15.4-rocm       prompt eval rate: 179.80 tokens/s eval rate: 31.69 tokens/s
qwen3                0.15.5-rc0-rocm   prompt eval rate: 172.61 tokens/s eval rate: 32.02 tokens/s
qwen3                0.15.5-rc1-rocm   prompt eval rate: 177.51 tokens/s eval rate: 32.14 tokens/s
lfm2.5-thinking:1.2b 0.15.4-rocm       prompt eval rate: 269.60 tokens/s eval rate: 170.19 tokens/s
lfm2.5-thinking:1.2b 0.15.5-rc0-rocm   prompt eval rate: 278.23 tokens/s eval rate: 168.17 tokens/s
lfm2.5-thinking:1.2b 0.15.5-rc1-rocm   prompt eval rate: 285.79 tokens/s eval rate: 166.60 tokens/s

<!-- gh-comment-id:3843444949 --> @rick-github commented on GitHub (Feb 3, 2026): Doesn't appear to affect ROCm devices, above was Nvidia. ``` glm-4.7-flash 0.15.4-rocm prompt eval rate: 93.72 tokens/s eval rate: 36.22 tokens/s glm-4.7-flash 0.15.5-rc0-rocm prompt eval rate: 82.84 tokens/s eval rate: 35.72 tokens/s glm-4.7-flash 0.15.5-rc1-rocm prompt eval rate: 94.07 tokens/s eval rate: 35.67 tokens/s qwen3 0.15.4-rocm prompt eval rate: 179.80 tokens/s eval rate: 31.69 tokens/s qwen3 0.15.5-rc0-rocm prompt eval rate: 172.61 tokens/s eval rate: 32.02 tokens/s qwen3 0.15.5-rc1-rocm prompt eval rate: 177.51 tokens/s eval rate: 32.14 tokens/s lfm2.5-thinking:1.2b 0.15.4-rocm prompt eval rate: 269.60 tokens/s eval rate: 170.19 tokens/s lfm2.5-thinking:1.2b 0.15.5-rc0-rocm prompt eval rate: 278.23 tokens/s eval rate: 168.17 tokens/s lfm2.5-thinking:1.2b 0.15.5-rc1-rocm prompt eval rate: 285.79 tokens/s eval rate: 166.60 tokens/s ```
Author
Owner

@jmorganca commented on GitHub (Feb 3, 2026):

Thanks @rick-github for the testing. Working on this!

<!-- gh-comment-id:3843553981 --> @jmorganca commented on GitHub (Feb 3, 2026): Thanks @rick-github for the testing. Working on this!
Author
Owner

@inforithmics commented on GitHub (Feb 4, 2026):

Doesn't appear to affect ROCm devices, above was Nvidia.

glm-4.7-flash        0.15.4-rocm       prompt eval rate: 93.72 tokens/s eval rate: 36.22 tokens/s
glm-4.7-flash        0.15.5-rc0-rocm   prompt eval rate: 82.84 tokens/s eval rate: 35.72 tokens/s
glm-4.7-flash        0.15.5-rc1-rocm   prompt eval rate: 94.07 tokens/s eval rate: 35.67 tokens/s
qwen3                0.15.4-rocm       prompt eval rate: 179.80 tokens/s eval rate: 31.69 tokens/s
qwen3                0.15.5-rc0-rocm   prompt eval rate: 172.61 tokens/s eval rate: 32.02 tokens/s
qwen3                0.15.5-rc1-rocm   prompt eval rate: 177.51 tokens/s eval rate: 32.14 tokens/s
lfm2.5-thinking:1.2b 0.15.4-rocm       prompt eval rate: 269.60 tokens/s eval rate: 170.19 tokens/s
lfm2.5-thinking:1.2b 0.15.5-rc0-rocm   prompt eval rate: 278.23 tokens/s eval rate: 168.17 tokens/s
lfm2.5-thinking:1.2b 0.15.5-rc1-rocm   prompt eval rate: 285.79 tokens/s eval rate: 166.60 tokens/s

Intresting How much VRAM does this ROCM Device have on a ROCM Device with 24 GB It showed this behavior too.
All Layers where allocated on the Gpu some Graph Memory was allocated on Cpu and more on Gpu.

<!-- gh-comment-id:3845498488 --> @inforithmics commented on GitHub (Feb 4, 2026): > Doesn't appear to affect ROCm devices, above was Nvidia. > > ``` > glm-4.7-flash 0.15.4-rocm prompt eval rate: 93.72 tokens/s eval rate: 36.22 tokens/s > glm-4.7-flash 0.15.5-rc0-rocm prompt eval rate: 82.84 tokens/s eval rate: 35.72 tokens/s > glm-4.7-flash 0.15.5-rc1-rocm prompt eval rate: 94.07 tokens/s eval rate: 35.67 tokens/s > qwen3 0.15.4-rocm prompt eval rate: 179.80 tokens/s eval rate: 31.69 tokens/s > qwen3 0.15.5-rc0-rocm prompt eval rate: 172.61 tokens/s eval rate: 32.02 tokens/s > qwen3 0.15.5-rc1-rocm prompt eval rate: 177.51 tokens/s eval rate: 32.14 tokens/s > lfm2.5-thinking:1.2b 0.15.4-rocm prompt eval rate: 269.60 tokens/s eval rate: 170.19 tokens/s > lfm2.5-thinking:1.2b 0.15.5-rc0-rocm prompt eval rate: 278.23 tokens/s eval rate: 168.17 tokens/s > lfm2.5-thinking:1.2b 0.15.5-rc1-rocm prompt eval rate: 285.79 tokens/s eval rate: 166.60 tokens/s > ``` Intresting How much VRAM does this ROCM Device have on a ROCM Device with 24 GB It showed this behavior too. All Layers where allocated on the Gpu some Graph Memory was allocated on Cpu and more on Gpu.
Author
Owner

@inforithmics commented on GitHub (Feb 4, 2026):

I compared the log outputs of Ollama 0.15.4 rocm and 0.15.5-rc1 rocm (Vulkan is enabled)

And found following interesting things.

0.15.4:
ggml_backend_vk_get_device_memory called: uuid 00000000-0300-0000-0000-000000000000time=2026-02-04T08:49:53.983+01:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="175.6 MiB"
time=2026-02-04T08:49:53.983+01:00 level=INFO source=device.go:272 msg="total memory" size="21.4 GiB"
t
0.15.5-rc1:
ggml_backend_vk_device_get_memory called: uuid 00000000-0300-0000-0000-000000000000

change:
change of method name ggml_backend_vk_device_get_memory

0.15.4:
time=2026-02-04T08:47:10.272+01:00 level=DEBUG source=server.go:974 msg="available gpu" id=0 library=ROCm "available layer vram"="23.0 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="398.6 MiB"
time=2026-02-04T08:47:10.272+01:00 level=DEBUG source=server.go:974 msg="available gpu" id=868080a7-0400-0000-0002-000000000000 library=Vulkan "available layer vram"="63.5 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="0 B"
0.15.5-rc1:
time=2026-02-04T08:49:53.981+01:00 level=DEBUG source=server.go:975 msg="available gpu" id=0 library=ROCm "available layer vram"="23.1 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="311.6 MiB"
time=2026-02-04T08:49:53.981+01:00 level=DEBUG source=server.go:975 msg="available gpu" id=00000000-0300-0000-0000-000000000000 library=Vulkan "available layer vram"="23.1 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="0 B"
time=2026-02-04T08:49:53.981+01:00 level=DEBUG source=server.go:975 msg="available gpu" id=868080a7-0400-0000-0002-000000000000 library=Vulkan "available layer vram"="63.5 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="0 B"

change:
id=00000000-0300-0000-0000-000000000000 library=Vulkan "available layer vram"="23.1 GiB"

  1. vulkan uses this device which should already supported by rocm (So it isn't filtered out) maybe the pci_id oder something similar isn't returned anymore. So Ollama doesn't detect the duplicate.
  2. It shows 23.1 GiB available but rocm already allocated memory on this device.

0.15.4:
time=2026-02-04T08:47:10.273+01:00 level=INFO source=device.go:262 msg="compute graph" device=ROCm0 size="398.6 MiB"
time=2026-02-04T08:47:10.273+01:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="4.0 MiB"
0.15.5-rc1:
time=2026-02-04T08:49:53.983+01:00 level=INFO source=device.go:262 msg="compute graph" device=ROCm0 size="311.6 MiB"
time=2026-02-04T08:49:53.983+01:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="175.6 MiB"

change:
The Total compute graph size is higher and much more memory is allocated on cpu

Ollama0.15.4.log
Ollama0.15.5-rc1.log

<!-- gh-comment-id:3846014673 --> @inforithmics commented on GitHub (Feb 4, 2026): I compared the log outputs of Ollama 0.15.4 rocm and 0.15.5-rc1 rocm (Vulkan is enabled) And found following interesting things. 0.15.4: ggml_backend_vk_get_device_memory called: uuid 00000000-0300-0000-0000-000000000000time=2026-02-04T08:49:53.983+01:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="175.6 MiB" time=2026-02-04T08:49:53.983+01:00 level=INFO source=device.go:272 msg="total memory" size="21.4 GiB" t 0.15.5-rc1: ggml_backend_vk_device_get_memory called: uuid 00000000-0300-0000-0000-000000000000 change: change of method name ggml_backend_vk_device_get_memory 0.15.4: time=2026-02-04T08:47:10.272+01:00 level=DEBUG source=server.go:974 msg="available gpu" id=0 library=ROCm "available layer vram"="23.0 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="398.6 MiB" time=2026-02-04T08:47:10.272+01:00 level=DEBUG source=server.go:974 msg="available gpu" id=868080a7-0400-0000-0002-000000000000 library=Vulkan "available layer vram"="63.5 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="0 B" 0.15.5-rc1: time=2026-02-04T08:49:53.981+01:00 level=DEBUG source=server.go:975 msg="available gpu" id=0 library=ROCm "available layer vram"="23.1 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="311.6 MiB" time=2026-02-04T08:49:53.981+01:00 level=DEBUG source=server.go:975 msg="available gpu" id=00000000-0300-0000-0000-000000000000 library=Vulkan "available layer vram"="23.1 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="0 B" time=2026-02-04T08:49:53.981+01:00 level=DEBUG source=server.go:975 msg="available gpu" id=868080a7-0400-0000-0002-000000000000 library=Vulkan "available layer vram"="63.5 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="0 B" change: id=00000000-0300-0000-0000-000000000000 library=Vulkan "available layer vram"="23.1 GiB" 1. vulkan uses this device which should already supported by rocm (So it isn't filtered out) maybe the pci_id oder something similar isn't returned anymore. So Ollama doesn't detect the duplicate. 2. It shows 23.1 GiB available but rocm already allocated memory on this device. 0.15.4: time=2026-02-04T08:47:10.273+01:00 level=INFO source=device.go:262 msg="compute graph" device=ROCm0 size="398.6 MiB" time=2026-02-04T08:47:10.273+01:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="4.0 MiB" 0.15.5-rc1: time=2026-02-04T08:49:53.983+01:00 level=INFO source=device.go:262 msg="compute graph" device=ROCm0 size="311.6 MiB" time=2026-02-04T08:49:53.983+01:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="175.6 MiB" change: The Total compute graph size is higher and much more memory is allocated on cpu [Ollama0.15.4.log](https://github.com/user-attachments/files/25065743/Ollama0.15.4.log) [Ollama0.15.5-rc1.log](https://github.com/user-attachments/files/25065744/Ollama0.15.5-rc1.log)
Author
Owner

@inforithmics commented on GitHub (Feb 5, 2026):

I played around with the reverted Vendor Update on this branch https://github.com/inforithmics/ollama/tree/MainBeforeRevert
And the results are now that glm.4.7-flash works now faster as before. 100 TG/s instead of 20 TG/s

Changes made.

Vendor Update to https://github.com/ggml-org/llama.cpp/pull/19324
Add top_k pull request. https://github.com/ollama/ollama/pull/13597

The Graph Memory on CPU is now again 4MB, and the model although uses less VRAM on GPU (22GB with 64000 context size)

<!-- gh-comment-id:3851443820 --> @inforithmics commented on GitHub (Feb 5, 2026): I played around with the reverted Vendor Update on this branch https://github.com/inforithmics/ollama/tree/MainBeforeRevert And the results are now that glm.4.7-flash works now faster as before. 100 TG/s instead of 20 TG/s Changes made. Vendor Update to https://github.com/ggml-org/llama.cpp/pull/19324 Add top_k pull request. https://github.com/ollama/ollama/pull/13597 The Graph Memory on CPU is now again 4MB, and the model although uses less VRAM on GPU (22GB with 64000 context size)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#71237