[GH-ISSUE #11479] REGRESSION: v0.10.0-rc0 super slow and exhausting RAM on 32GB RAM CPU-only environment #33339

New Issue

GiteaMirror · 2026-04-22T15:54:51-05:00

GiteaMirror commented

2026-04-22 15:54:51 -05:00

Originally created by @FieldMouse-AI on GitHub (Jul 21, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11479

What is the issue?

v0.10.0-rc0 is showing slowdowns and memory exhaustion similar to what I found occuring with v0.9.2.

Note that v0.9.6 might have some issues of occasional slowness, it is relatively quite performant. So, for the time being I am staying on this version.

I would like to add that whatever magic that was done to make v0.9.6 such a standout performer, please bring it back! It was quite nice! 🤗

Anybody else experiencing this? 🤗

Is there anything you would like me to add? 🤗

🤔 UPDATE:

For testing to make sure that SWAP is not affecting performance, I turn off swap completely (sudo swapoff -a) and begin testing with about 14-15GB of the 32GB free.
The above allows for a total apples-to-apples comparison between runs.
It also makes the RAM exhaustion more frightening as one watches it happening in real time.

🤯🤯 UPDATE 2 - Something is up, but might NOT be a regression!!!

See below in the comments for how things changed after further testing!

Relevant log output

OS

Linux

GPU

No response

CPU

AMD

Ollama version

v0.10.0-rc0

Originally created by @FieldMouse-AI on GitHub (Jul 21, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11479 ### What is the issue? [v0.10.0-rc0](https://github.com/ollama/ollama/releases/tag/v0.10.0-rc0) is showing slowdowns and memory exhaustion similar to what I found occuring with [v0.9.2](https://github.com/ollama/ollama/releases/tag/v0.9.2). Note that [v0.9.6](https://github.com/ollama/ollama/releases/tag/v0.9.6) might have some issues of occasional slowness, it is relatively quite performant. So, for the time being I am staying on this version. I would like to add that whatever magic that was done to make [v0.9.6](https://github.com/ollama/ollama/releases/tag/v0.9.6) such a standout performer, please bring it back! It was quite nice! 🤗 Anybody else experiencing this? 🤗 Is there anything you would like me to add? 🤗 ### 🤔 UPDATE: 1. For testing to make sure that SWAP is not affecting performance, I turn off swap completely (`sudo swapoff -a`) and begin testing with about 14-15GB of the 32GB free. 2. The above allows for a total apples-to-apples comparison between runs. 3. It also makes the RAM exhaustion more frightening as one watches it happening in real time. ### 🤯🤯 UPDATE 2 - Something is up, but might NOT be a regression!!! See below in the comments for how things changed after further testing! ### Relevant log output ```shell ``` ### OS Linux ### GPU _No response_ ### CPU AMD ### Ollama version v0.10.0-rc0

GiteaMirror added the bug label 2026-04-22 15:54:51 -05:00

GiteaMirror closed this issue

2026-04-22 15:54:52 -05:00

GiteaMirror commented

2026-04-22 15:54:54 -05:00

@rick-github commented on GitHub (Jul 21, 2025):

Server logs may aid in debugging.

@rick-github commented on GitHub (Jul 21, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging.

GiteaMirror commented

2026-04-22 15:54:54 -05:00

@rick-github commented on GitHub (Jul 22, 2025):

qwen3, "why is the sky blue?", parallel=1, num_gpu=0, otherwise defaults. Not much difference between 0.9.6 and 0.10.0-rc0 in terms of RSS and TPS.

@rick-github commented on GitHub (Jul 22, 2025): qwen3, "why is the sky blue?", parallel=1, num_gpu=0, otherwise defaults. Not much difference between 0.9.6 and 0.10.0-rc0 in terms of RSS and TPS. <img width="822" height="359" alt="Image" src="https://github.com/user-attachments/assets/edab7aad-4e70-4c66-bf9c-52baa94285f1" />

GiteaMirror commented

2026-04-22 15:54:55 -05:00

@FieldMouse-AI commented on GitHub (Jul 22, 2025):

qwen3, "why is the sky blue?", parallel=1, num_gpu=0, otherwise defaults. Not much difference between 0.9.6 and 0.10.0-rc0 in terms of RSS and TPS.

@rick-github , thank you for your reply!

I have an UPDATE and an OBSERVATION that might direct us towards a solution!!!

Let me share with you my configuration:

CPU: AMD Ryzen 7 5800U
RAM: 32GB (with plenty free)
OS/Kernel: Ubuntu Linix 22.04.5 LTS, 6.8.0-64-generic (Host and Docker container)
Ollama environment: Docker container, with OLLAMA_LLM_DEVICE=CPU explicitly and consistently set.

First, after testing 0.9.2, then testing 010.0-rc0, and finding them both slower than my 0.9.6 environment, I decided to reinstall 0.9.6 from the GitHub release in the same way that I installed 0.9.2 and 010.0-rc0.

That version of 0.9.6 that I downloaded from GitHub had the exact same slow performance profile as 0.9.2 and 010.0-rc0! 🤯

Then I rebuilt the Ollama docker container using ollama.com's curl -fsSL https://ollama.com/install.sh | sh installer for Linux.
Currently, this is installing 0.9.6 as per ollama --version.

But when I ran my tests, it was back to being speedy again!

I am guessing that the GitHub build for 0.9.6 (and perhaps other cases) produces a different build result than what is distributed direectly from ollama.com's curl -fsSL https://ollama.com/install.sh | sh installer for Linux. Like, maybe it is missing special optimizations (like AVX2/FMA specifically for my Ryzen CPU) or incorporated a less optimized llama.cpp commit compared to the official install.sh.

Concrete Performance Data (Before & After) using only Ollama 0.9.6. The 1b model is llama3.2:1b:

Original (install.sh) (Good) Performance (1b model): ~1m41s - 2m inference times.
Degraded (GitHub download) (Bad) Performance (1b model): ~5m inference times.
Restored (install.sh) (Good) Performance (1b model) after reverting to install.sh: Cold start ~1m4s, warm start ~1m19s.

PS:

When I tried testing the different versions of Ollama, I used the following in my Dockerfile to select the version. This version is what gave me the slow version of the ollama binary. Of course I would change OLLAMA_VERSION to whatever version I wanted installed for testing.

RUN mkdir -p /tmp/ollama_install && \
    curl -L "https://github.com/ollama/ollama/releases/download/v${OLLAMA_VERSION}/ollama-linux-amd64.tgz" -o /tmp/ollama_install/ollama-linux-amd64.tgz && \
    tar -xzf /tmp/ollama_install/ollama-linux-amd64.tgz -C /tmp/ollama_install && \
    mv /tmp/ollama_install/bin/ollama /usr/local/bin/ollama && \
    chmod +x /usr/local/bin/ollama && \
    rm -rf /tmp/ollama_install

Replacing the above with the following is what gave me the speedy version of the binary:

RUN curl -fsSL https://ollama.com/install.sh | sh

So, these are my findings.

There was no regression in the common sense, I believe.
However, I do believe that there is evidence that the builds are different with the most optimized builds avaialble via ollama.com's curl -fsSL https://ollama.com/install.sh | sh installer for Linux.

What do you think? 🤗

@FieldMouse-AI commented on GitHub (Jul 22, 2025): > qwen3, "why is the sky blue?", parallel=1, num_gpu=0, otherwise defaults. Not much difference between 0.9.6 and 0.10.0-rc0 in terms of RSS and TPS. > > <img alt="Image" width="822" height="359" src="https://private-user-images.githubusercontent.com/14946854/468968699-edab7aad-4e70-4c66-bf9c-52baa94285f1.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTMxODIzOTcsIm5iZiI6MTc1MzE4MjA5NywicGF0aCI6Ii8xNDk0Njg1NC80Njg5Njg2OTktZWRhYjdhYWQtNGU3MC00YzY2LWJmOWMtNTJiYWE5NDI4NWYxLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTA3MjIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwNzIyVDExMDEzN1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWZjMjZjZjc0YTg2MTIxYzI2MWQxNDc1ZWNjMDRlYWE0NTA5NzAxMTkxYmZmZmVkYzExZjJhMzI3NjNmNGMwYzYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.bHtx3_xcfhGFBQkLzRVXsA84roGmrRK6sJv0hrYLUeY"> @rick-github , thank you for your reply! I have an **UPDATE** and an **OBSERVATION** that might direct us towards a solution!!! Let me share with you my configuration: - CPU: AMD Ryzen 7 5800U - RAM: 32GB (with plenty free) - OS/Kernel: Ubuntu Linix 22.04.5 LTS, 6.8.0-64-generic (Host and Docker container) - Ollama environment: Docker container, with OLLAMA_LLM_DEVICE=CPU explicitly and consistently set. First, after testing 0.9.2, then testing 010.0-rc0, and finding them both slower than my 0.9.6 environment, I decided to **reinstall 0.9.6 from the GitHub release** in the same way that I installed 0.9.2 and 010.0-rc0. That version of 0.9.6 that I downloaded from GitHub had the exact same slow performance profile as 0.9.2 and 010.0-rc0! 🤯 Then I rebuilt the Ollama docker container using ollama.com's `curl -fsSL https://ollama.com/install.sh | sh` installer for Linux. Currently, this is installing 0.9.6 as per `ollama --version`. But when I ran my tests, it was back to being speedy again! I am guessing that the GitHub build for 0.9.6 (and perhaps other cases) produces a different build result than what is distributed direectly from ollama.com's `curl -fsSL https://ollama.com/install.sh | sh` installer for Linux. Like, maybe it is missing special optimizations (like AVX2/FMA specifically for my Ryzen CPU) or incorporated a less optimized `llama.cpp` commit compared to the official `install.sh`. Concrete Performance Data (Before & After) using only Ollama 0.9.6. The `1b` model is `llama3.2:1b`: - Original (`install.sh`) (Good) Performance (`1b` model): ~1m41s - 2m inference times. - Degraded (`GitHub download`) (Bad) Performance (`1b` model): ~5m inference times. - Restored (`install.sh`) (Good) Performance (`1b` model) after reverting to install.sh: Cold start ~1m4s, warm start ~1m19s. PS: When I tried testing the different versions of Ollama, I used the following in my Dockerfile to select the version. This version is what gave me the slow version of the `ollama` binary. Of course I would change `OLLAMA_VERSION` to whatever version I wanted installed for testing. ``` RUN mkdir -p /tmp/ollama_install && \ curl -L "https://github.com/ollama/ollama/releases/download/v${OLLAMA_VERSION}/ollama-linux-amd64.tgz" -o /tmp/ollama_install/ollama-linux-amd64.tgz && \ tar -xzf /tmp/ollama_install/ollama-linux-amd64.tgz -C /tmp/ollama_install && \ mv /tmp/ollama_install/bin/ollama /usr/local/bin/ollama && \ chmod +x /usr/local/bin/ollama && \ rm -rf /tmp/ollama_install ``` Replacing the above with the following is what gave me the speedy version of the binary: ``` RUN curl -fsSL https://ollama.com/install.sh | sh ``` So, these are my findings. There was no regression in the common sense, I believe. However, I do believe that there is evidence that the builds are different with the most optimized builds avaialble via ollama.com's `curl -fsSL https://ollama.com/install.sh | sh` installer for Linux. What do you think? 🤗

GiteaMirror commented

2026-04-22 15:54:56 -05:00

@rick-github commented on GitHub (Jul 22, 2025):

OLLAMA_LLM_DEVICE is not an Ollama configuration variable.

The slowness is due to your Franken-container. The correct way to update the ollama image is to pull the new version:

docker pull ollama/ollama:0.9.6

By downloading the tar.gz inside the container and mving the binary to /usr/local/bin/ollama you are destroying the CPU and GPU backends that are required for fast inference.

@rick-github commented on GitHub (Jul 22, 2025): `OLLAMA_LLM_DEVICE` is not an Ollama configuration variable. The slowness is due to your Franken-container. The correct way to update the ollama image is to pull the new version: ``` docker pull ollama/ollama:0.9.6 ``` By downloading the tar.gz inside the container and `mv`ing the binary to `/usr/local/bin/ollama` you are destroying the CPU and GPU backends that are required for fast inference.

GiteaMirror commented

2026-04-22 15:54:58 -05:00

@rick-github commented on GitHub (Jul 23, 2025):

I realized you might be installing ollama inside a custom container rather than updating ollama, which is what I took away from "rebuilt the Ollama docker container". In that case, you want to modify the install process:

RUN mkdir -p /tmp/ollama_install && \
    curl -L "https://github.com/ollama/ollama/releases/download/v${OLLAMA_VERSION}/ollama-linux-amd64.tgz" -o /tmp/ollama_install/ollama-linux-amd64.tgz && \
    tar -xzf /tmp/ollama_install/ollama-linux-amd64.tgz -C /usr/local && \
    rm -rf /tmp/ollama_install

@rick-github commented on GitHub (Jul 23, 2025): I realized you might be installing ollama inside a custom container rather than updating ollama, which is what I took away from "rebuilt the Ollama docker container". In that case, you want to modify the install process: ```shell RUN mkdir -p /tmp/ollama_install && \ curl -L "https://github.com/ollama/ollama/releases/download/v${OLLAMA_VERSION}/ollama-linux-amd64.tgz" -o /tmp/ollama_install/ollama-linux-amd64.tgz && \ tar -xzf /tmp/ollama_install/ollama-linux-amd64.tgz -C /usr/local && \ rm -rf /tmp/ollama_install ```

GiteaMirror commented

2026-04-22 15:54:58 -05:00

@FieldMouse-AI commented on GitHub (Jul 23, 2025):

I realized you might be installing ollama inside a custom container rather than updating ollama, which is what I took away from "rebuilt the Ollama docker container". In that case, you want to modify the install process:

RUN mkdir -p /tmp/ollama_install &&
curl -L "https://github.com/ollama/ollama/releases/download/v${OLLAMA_VERSION}/ollama-linux-amd64.tgz" -o /tmp/ollama_install/ollama-linux-amd64.tgz &&
tar -xzf /tmp/ollama_install/ollama-linux-amd64.tgz -C /usr/local &&
rm -rf /tmp/ollama_install

🤗 Ooo! I see what you did there!

I get to keep using my Dockerfile for Ollama and I get all of the optimized goodness with it.

I will give this a try!.

Thanks! 😊

@FieldMouse-AI commented on GitHub (Jul 23, 2025): > I realized you might be installing ollama inside a custom container rather than updating ollama, which is what I took away from "rebuilt the Ollama docker container". In that case, you want to modify the install process: > > RUN mkdir -p /tmp/ollama_install && \ > curl -L "https://github.com/ollama/ollama/releases/download/v${OLLAMA_VERSION}/ollama-linux-amd64.tgz" -o /tmp/ollama_install/ollama-linux-amd64.tgz && \ > tar -xzf /tmp/ollama_install/ollama-linux-amd64.tgz -C /usr/local && \ > rm -rf /tmp/ollama_install 🤗 Ooo! I see what you did there! I get to keep using my `Dockerfile` for Ollama **and** I get all of the optimized goodness with it. I will give this a try!. Thanks! 😊

GiteaMirror commented

2026-04-22 15:54:59 -05:00

@FieldMouse-AI commented on GitHub (Jul 23, 2025):

I realized you might be installing ollama inside a custom container rather than updating ollama, which is what I took away from "rebuilt the Ollama docker container". In that case, you want to modify the install process:
RUN mkdir -p /tmp/ollama_install &&
curl -L "https://github.com/ollama/ollama/releases/download/v${OLLAMA_VERSION}/ollama-linux-amd64.tgz" -o /tmp/ollama_install/ollama-linux-amd64.tgz &&
tar -xzf /tmp/ollama_install/ollama-linux-amd64.tgz -C /usr/local &&
rm -rf /tmp/ollama_install

🤗 Ooo! I see what you did there!

I get to keep using my Dockerfile for Ollama and I get all of the optimized goodness with it.

I will give this a try!.

Thanks! 😊

@rick-github , Great news!!!!!!!!!

Your solution to help get my Dockerfile's RUN statement to work was 1000% spot on perfect!!!

I am now testing with 0.9.6 and I will move on to test 0.9.2 and 010.0-rc0 soon as time permits!

Thanks! I will post back here with the results of the other tests!

Thanks!
🤗🤗🤗

@FieldMouse-AI commented on GitHub (Jul 23, 2025): > > I realized you might be installing ollama inside a custom container rather than updating ollama, which is what I took away from "rebuilt the Ollama docker container". In that case, you want to modify the install process: > > RUN mkdir -p /tmp/ollama_install && > > curl -L "[https://github.com/ollama/ollama/releases/download/v${OLLAMA_VERSION}/ollama-linux-amd64.tgz](https://github.com/ollama/ollama/releases/download/v$%7BOLLAMA_VERSION%7D/ollama-linux-amd64.tgz)" -o /tmp/ollama_install/ollama-linux-amd64.tgz && > > tar -xzf /tmp/ollama_install/ollama-linux-amd64.tgz -C /usr/local && > > rm -rf /tmp/ollama_install > > 🤗 Ooo! I see what you did there! > > I get to keep using my `Dockerfile` for Ollama **and** I get all of the optimized goodness with it. > > I will give this a try!. > > Thanks! 😊 @rick-github , Great news!!!!!!!!! Your solution to help get my `Dockerfile`'s `RUN` statement to work was 1000% spot on perfect!!! I am now testing with 0.9.6 and I will move on to test 0.9.2 and 010.0-rc0 soon as time permits! Thanks! I will post back here with the results of the other tests! Thanks! 🤗🤗🤗

GiteaMirror commented

2026-04-22 15:55:00 -05:00

@FieldMouse-AI commented on GitHub (Jul 23, 2025):

🤗 Ooo! I see what you did there!
I get to keep using my Dockerfile for Ollama and I get all of the optimized goodness with it.
I will give this a try!.
Thanks! 😊

@rick-github , Great news!!!!!!!!!

Your solution to help get my Dockerfile's RUN statement to work was 1000% spot on perfect!!!

I am now testing with 0.9.6 and I will move on to test 0.9.2 and 010.0-rc0 soon as time permits!

Thanks! I will post back here with the results of the other tests!

Thanks! 🤗🤗🤗

Hello, again, @rick-github ! As promised, I've returned with the results of my testing.

For all of my tests I feed about 10,000 tokens worth of text to a 16384 num_ctx instance of llama3.2:1b. I run it 3 times and take the average. Please note that I could not test 0.10.0-rc0 as it was unavailable, so I switched to testing 0.10.0-rc1, instead.

0.9.2: inference time: 2m10s 👈 Speedier than 0.9.6?
0.9.6: inference time: 2m30s 👈 Regression?
0.10.0-rc1: inference time: 2m45s 👈 Regression?

Based on these results, it would seem that 0.9.2 is the most performant version of these 3. So, given that I can now switch to and lock down on particular versions of Ollama as I need (thanks for the Dockerfile fix, @rick-github ), it would seem best for me to stick with 0.9.2 until a more performant release becomes available.

What do you think??? 🤗

@FieldMouse-AI commented on GitHub (Jul 23, 2025): > > 🤗 Ooo! I see what you did there! > > I get to keep using my `Dockerfile` for Ollama **and** I get all of the optimized goodness with it. > > I will give this a try!. > > Thanks! 😊 > > [@rick-github](https://github.com/rick-github) , Great news!!!!!!!!! > > Your solution to help get my `Dockerfile`'s `RUN` statement to work was 1000% spot on perfect!!! > > I am now testing with 0.9.6 and I will move on to test 0.9.2 and 010.0-rc0 soon as time permits! > > Thanks! I will post back here with the results of the other tests! > > Thanks! 🤗🤗🤗 Hello, again, @rick-github ! As promised, I've returned with the results of my testing. For all of my tests I feed about 10,000 tokens worth of text to a 16384 `num_ctx` instance of `llama3.2:1b`. I run it 3 times and take the average. Please note that I could not test 0.10.0-rc0 as it was unavailable, so I switched to testing 0.10.0-rc1, instead. - **0.9.2: inference time: 2m10s 👈 Speedier than 0.9.6?** - **0.9.6:** inference time: 2m30s 👈 Regression? - **0.10.0-rc1**: inference time: 2m45s 👈 Regression? Based on these results, it would seem that 0.9.2 is the most performant version of these 3. So, given that I can now switch to and lock down on particular versions of Ollama as I need (thanks for the `Dockerfile` fix, @rick-github ), it would seem best for me to stick with **0.9.2** until a more performant release becomes available. What do you think??? 🤗

GiteaMirror commented

2026-04-22 15:55:02 -05:00

@rick-github commented on GitHub (Jul 23, 2025):

Server logs may aid in debugging.

@rick-github commented on GitHub (Jul 23, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging.

GiteaMirror commented

2026-04-22 15:55:03 -05:00

@FieldMouse-AI commented on GitHub (Jul 24, 2025):

Server logs may aid in debugging.

Hello, @rick-github .

I didn't want to make you have to wait too long for a response as I did my testing.

As it turns out, when I tested the models more strenuously (short times between runs, my usual mode) as well as more lightly (long times between runs, eg. I would do a run then make coffee, then do another run and go and make breakfast, etc), I noticed sometihng interesting: The time results for 0.9.2 and 0.9.6 started to become more similar. 🤔

As an example, under low pressure, just now while doing tests during breakfast preparation, I discovered 0.9.6 was getting results like 2m20s and 2m36s. 🤔

And under heavy presssure, both the 0.9.2 and 0.9.6 models started rising into the 2m50s to 3m range. 🤯

Now, I'm an old LISP language designer type so memory management problems like this would crop up where if I did not give my system enough time for the garbage collector to at least reoganize allocations, the system would crawl during runtime because it would still need to do the reoganizations.

IMHO, it would suggest that what is afoot here is the memory manager of my OS (Ubuntu Linux 22.04.5 LTS) queitly cleaning things up while I was stirring cream-o-wheat.

This is just a guess on my part, but my hunch seems to suggest that pushing further with redoing the trials would likely bear something like that out.

To do this test will take longer as I would have to do a bunch of back-to-back runs followed by, perhaps a reboot, then a bunch of long-wait-between-runs runs.

Oh, and I will have to include logs with the runs.

So, that's the heads-up. 🤗

@FieldMouse-AI commented on GitHub (Jul 24, 2025): > [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging. Hello, @rick-github . I didn't want to make you have to wait too long for a response as I did my testing. As it turns out, when I tested the models more strenuously (short times between runs, my usual mode) as well as more lightly (long times between runs, eg. I would do a run then make coffee, then do another run and go and make breakfast, etc), I noticed sometihng interesting: The time results for 0.9.2 and 0.9.6 started to become more similar. 🤔 As an example, under low pressure, just now while doing tests during breakfast preparation, I discovered 0.9.6 was getting results like 2m20s and 2m36s. 🤔 And under heavy presssure, **both the 0.9.2 and 0.9.6 models** started rising into the 2m50s to 3m range. 🤯 Now, I'm an old LISP language designer type so memory management problems like this would crop up where if I did not give my system enough time for the garbage collector to at least reoganize allocations, the system would crawl during runtime because it would still need to do the reoganizations. IMHO, it would suggest that what is afoot here is the memory manager of my OS (Ubuntu Linux 22.04.5 LTS) queitly cleaning things up while I was stirring cream-o-wheat. This is just a guess on my part, but my hunch seems to suggest that pushing further with redoing the trials would likely bear something like that out. To do this test will take longer as I would have to do a bunch of back-to-back runs followed by, perhaps a reboot, then a bunch of long-wait-between-runs runs. Oh, and I will have to include logs with the runs. So, that's the heads-up. 🤗

GiteaMirror commented

2026-04-22 15:55:04 -05:00

@FieldMouse-AI commented on GitHub (Jul 27, 2025):

😊 Hello, @rick-github , as promised, I reran all of my tests on the proper fully optimized installations of Ollama for the following versions:

0.9.2
0.9.6
0.10.0-rc2 (0.10.0-rc1 was not availble anymore, so I decided to just do 0.10.0-rc2, instead)

Along with the charts below, I have also attached the logs for each set of runs that was performed. Please see the attachments.

My overall impressiojn looking at the proper installations is that it is clearly faster than the non-optimized versions that I had originally tested on all counts.

I am curious to know what you might discover from these results. 🤗

quick-0.9.2-ollama.log
quick-0.9.6-ollama.log
quick-0.10.0-rc2-ollama.log
slow-0.9.2-ollama.log
slow-0.9.6-ollama.log
slow-0.10.0-rc2-ollama.log

About the test environmnet

Host

OS: Ubuntu 22.04.5 LTS x86_64
Kernel: 6.8.0-64-generic
CPU: AMD Ryzen 7 5800U with Radeon Graphics (8-cores/16-threads) @ 4.500GHz
RAM: 32GB
Swap: 0GB (turned off using sudo swapoff -a)

Ollama Server

Runs in its own Docker container
- OS: Ubuntu 22.04.5 LTS x86_64
Model tested: llama3.2:1b-instruct-q8_0
OLLAMA_DEBUG=1
OLLAMA_KEEP_ALIVE=-1
OLLAMA_NUM_PARALLEL=1
OLLAMA_KV_CACHE_TYPE=q8_0
OLLAMA_CONTEXT_LENGTH=131072
OLLAMA_LLM_DEVICE=CPU

Quick Turnaround Runs

These are runs where the each run of my workflow was run as close as back-to-back as possible.

ollama version	mode	est. msg len	est. msg tokens	num_predict	desired total tokens	num_ctx	inference time
0.9.2	cold start	28,045	9,348	1,000	10,348	16,384	3:29
	warm start	28,301	9,433	1,000	10,433	16,384	3:26
	warm start	27,790	9,263	1,000	10,263	16,384	3:24
	warm start	26,786	8,928	1,000	9,928	16,384	2:56
						warm start average	3:15

ollama version	mode	est. msg len	est. msg tokens	num_predict	desired total tokens	num_ctx	inference time
0.9.6	cold start	27,074	9,024	1,000	10,024	16,384	2:43
	warm start	28,662	9,554	1,000	10,554	16,384	2:53
	warm start	27,853	9,284	1,000	10,284	16,384	3:05
	warm start	28,006	9,335	1,000	10,335	16,384	3:30
						warm start average	3:09

ollama version	mode	est. msg len	est. msg tokens	num_predict	desired total tokens	num_ctx	inference time
0.10.0-rc2	cold start	27,336	9,112	1,000	10,112	16,384	2:58
	warm start	27,833	9,277	1,000	10,277	16,384	3:20
	warm start	27,883	9,294	1,000	10,294	16,384	3:21
	warm start	27,995	9,331	1,000	10,331	16,384	3:30
						warm start average	3:24

Slow Turnaround Runs

These runs are runs where the time between runs is close to the time it take to make breakfast -- for me that is about 10 minutes between runs. The idea here was to give the OS time to reorganize memory while not under pressure if it so wanted to.

ollama version	mode	est. msg len	est. msg tokens	num_predict	desired total tokens	num_ctx	inference time
0.9.2	cold start	27,897	9,299	1,000	10,299	16,384	3:05
	warm start	28,443	9,481	1,000	10,481	16,384	3:05
	warm start	28,883	9,627	1,000	10,627	16,384	3:16
	warm start	28,431	9,477	1,000	10,477	16,384	3:40
						warm start average	3:20

ollama version	mode	est. msg len	est. msg tokens	num_predict	desired total tokens	num_ctx	inference time
0.9.6	cold start	29,049	9,683	1,000	10,683	16,384	4:21
	warm start	28,832	9,610	1,000	10,610	16,384	3:45
	warm start	27,993	9,331	1,000	10,331	16,384	3:04
	warm start	28,003	9,334	1,000	10,334	16,384	3:15
						warm start average	3:21

ollama version	mode	est. msg len	est. msg tokens	num_predict	desired total tokens	num_ctx	inference time
0.10.0-rc2	cold start	29,769	9,923	1,000	10,923	16,384	3:38
	warm start	29,858	9,952	1,000	10,952	16,384	3:52
	warm start	32,447	10,815	1,000	11,815	16,384	4:00
	warm start	29,693	9,897	1,000	10,897	16,384	3:14
						warm start average	3:42

@FieldMouse-AI commented on GitHub (Jul 27, 2025): 😊 Hello, @rick-github , as promised, I reran all of my tests on the proper fully optimized installations of Ollama for the following versions: - 0.9.2 - 0.9.6 - 0.10.0-rc2 (0.10.0-rc1 was not availble anymore, so I decided to just do 0.10.0-rc2, instead) Along with the charts below, I have also attached the logs for each set of runs that was performed. Please see the attachments. My overall impressiojn looking at the proper installations is that it is clearly faster than the non-optimized versions that I had originally tested on all counts. I am curious to know what you might discover from these results. 🤗 [quick-0.9.2-ollama.log](https://github.com/user-attachments/files/21453154/quick-0.9.2-ollama.log) [quick-0.9.6-ollama.log](https://github.com/user-attachments/files/21453149/quick-0.9.6-ollama.log) [quick-0.10.0-rc2-ollama.log](https://github.com/user-attachments/files/21453150/quick-0.10.0-rc2-ollama.log) [slow-0.9.2-ollama.log](https://github.com/user-attachments/files/21453151/slow-0.9.2-ollama.log) [slow-0.9.6-ollama.log](https://github.com/user-attachments/files/21453153/slow-0.9.6-ollama.log) [slow-0.10.0-rc2-ollama.log](https://github.com/user-attachments/files/21453152/slow-0.10.0-rc2-ollama.log) ### About the test environmnet Host - OS: Ubuntu 22.04.5 LTS x86_64 - Kernel: 6.8.0-64-generic - CPU: AMD Ryzen 7 5800U with Radeon Graphics (8-cores/16-threads) @ 4.500GHz - RAM: 32GB - Swap: 0GB (turned off using `sudo swapoff -a`) Ollama Server - Runs in its own Docker container - OS: Ubuntu 22.04.5 LTS x86_64 - Model tested: `llama3.2:1b-instruct-q8_0` - `OLLAMA_DEBUG=1` - `OLLAMA_KEEP_ALIVE=-1` - `OLLAMA_NUM_PARALLEL=1` - `OLLAMA_KV_CACHE_TYPE=q8_0` - `OLLAMA_CONTEXT_LENGTH=131072` - `OLLAMA_LLM_DEVICE=CPU` ### Quick Turnaround Runs These are runs where the each run of my workflow was run as close as back-to-back as possible. <google-sheets-html-origin><style type="text/css"></style> ollama version | mode | est. msg len | est. msg tokens | num_predict | desired total tokens | num_ctx | inference time -- | -- | -- | -- | -- | -- | -- | -- 0.9.2 | cold start | 28,045 | 9,348 | 1,000 | 10,348 | 16,384 | 3:29 | warm start | 28,301 | 9,433 | 1,000 | 10,433 | 16,384 | 3:26 | warm start | 27,790 | 9,263 | 1,000 | 10,263 | 16,384 | 3:24 | warm start | 26,786 | 8,928 | 1,000 | 9,928 | 16,384 | 2:56 | | | | | | warm start average | 3:15 | | | | | | | ollama version | mode | est. msg len | est. msg tokens | num_predict | desired total tokens | num_ctx | inference time 0.9.6 | cold start | 27,074 | 9,024 | 1,000 | 10,024 | 16,384 | 2:43 | warm start | 28,662 | 9,554 | 1,000 | 10,554 | 16,384 | 2:53 | warm start | 27,853 | 9,284 | 1,000 | 10,284 | 16,384 | 3:05 | warm start | 28,006 | 9,335 | 1,000 | 10,335 | 16,384 | 3:30 | | | | | | warm start average | 3:09 | | | | | | | ollama version | mode | est. msg len | est. msg tokens | num_predict | desired total tokens | num_ctx | inference time 0.10.0-rc2 | cold start | 27,336 | 9,112 | 1,000 | 10,112 | 16,384 | 2:58 | warm start | 27,833 | 9,277 | 1,000 | 10,277 | 16,384 | 3:20 | warm start | 27,883 | 9,294 | 1,000 | 10,294 | 16,384 | 3:21 | warm start | 27,995 | 9,331 | 1,000 | 10,331 | 16,384 | 3:30 | | | | | | warm start average | 3:24 ### Slow Turnaround Runs These runs are runs where the time between runs is close to the time it take to make breakfast -- for me that is about 10 minutes between runs. The idea here was to give the OS time to reorganize memory while not under pressure if it so wanted to. <google-sheets-html-origin><style type="text/css"></style> ollama version | mode | est. msg len | est. msg tokens | num_predict | desired total tokens | num_ctx | inference time -- | -- | -- | -- | -- | -- | -- | -- 0.9.2 | cold start | 27,897 | 9,299 | 1,000 | 10,299 | 16,384 | 3:05 | warm start | 28,443 | 9,481 | 1,000 | 10,481 | 16,384 | 3:05 | warm start | 28,883 | 9,627 | 1,000 | 10,627 | 16,384 | 3:16 | warm start | 28,431 | 9,477 | 1,000 | 10,477 | 16,384 | 3:40 | | | | | | warm start average | 3:20 | | | | | | | ollama version | mode | est. msg len | est. msg tokens | num_predict | desired total tokens | num_ctx | inference time 0.9.6 | cold start | 29,049 | 9,683 | 1,000 | 10,683 | 16,384 | 4:21 | warm start | 28,832 | 9,610 | 1,000 | 10,610 | 16,384 | 3:45 | warm start | 27,993 | 9,331 | 1,000 | 10,331 | 16,384 | 3:04 | warm start | 28,003 | 9,334 | 1,000 | 10,334 | 16,384 | 3:15 | | | | | | warm start average | 3:21 | | | | | | | ollama version | mode | est. msg len | est. msg tokens | num_predict | desired total tokens | num_ctx | inference time 0.10.0-rc2 | cold start | 29,769 | 9,923 | 1,000 | 10,923 | 16,384 | 3:38 | warm start | 29,858 | 9,952 | 1,000 | 10,952 | 16,384 | 3:52 | warm start | 32,447 | 10,815 | 1,000 | 11,815 | 16,384 | 4:00 | warm start | 29,693 | 9,897 | 1,000 | 10,897 | 16,384 | 3:14 | | | | | | warm start average | 3:42

GiteaMirror commented

2026-04-22 15:55:04 -05:00

@FieldMouse-AI commented on GitHub (Jul 30, 2025):

🤗 Hello, @rick-github , I noticed a few hours ago that version 0.10.0 was just released.

So, I ran only WARM START runs since I had some time. Cold start runs imply that I rebooted my system -- and sorry, I just didn't have the time for a reboot. 🙇

The first thing that I noticed is that just by eyeballing it, the new version appears to be using less system RAM for the models. I have no good measures as I did not measure this before, but I noticed that the RAM used/available appears lower than before.

Next, the warm start performance appers to be not just better, but much better.

Please note that I did incease num_predict from 1,000 in the previous runs, to 4,000. This was because I updated my application to produce longer responses, though the response produced always came in under 1,000 tokens.

ollama version	mode	est. msg len	est. msg tokens	num_predict	desired total tokens	num_ctx	inference time
0.10.0	warm start	16,093	5,364	4,000	9,364	16,384	1:07 🤯
	warm start	16,094	5,364	4,000	9,364	16,384	0:13 🤯🤯
	warm start	16,094	5,364	4,000	9,364	16,384	0:14 🤯🤯
	warm start	20,396	6,798	4,000	10,798	16,384	1:44 🤯

(Sorry, I did not compute an average this time because the timings are pretty far apart, so I felt that an average would not produce a representative value).

First, 🤯!!!! The speed improvements are not just much better, but unexpectedly astounding!!!

Even the two slow runs are at least twice as fast as the my previous fastest runs. I am thinking that even though I am configured with swapoff -a, that it is possible that a model got unloaded along the way and needed to be releaded, so I am possibly getting hit with a model reload penalty, maybe.

🤔❓ I am starting to think that if I had more available RAM available to keep all of my models in RAM without unload/reload events, that I would be getting consistentent sub 20 second inferences with my test.

For the record: My previous tests were run after a reboot with only Chrome and some terminals open.

UPDATE: I found that I also had 2 terminals open to ollama run sessions where I was doing other tests that were taking up as much as 12-14GB!! So, it might be likely that memory pressure from the unloading and reloading of models might have been what caused my slower runs. Still, it's only my hunch.

This time, I happen to be doing other work, so I have a lot of browsers open along with the GUI-based dBeaver SQL tool. So, memory while available, is still more constrained than in my previous tests.

These are just my armchair, back of the napkin timings, but OMG, I am quite happy. This is fast. 🤯

@rick-github , I am really curious about any comments that you could offer.

Thanks to you and the crew, @rick-github ! 🤗

@FieldMouse-AI commented on GitHub (Jul 30, 2025): 🤗 Hello, @rick-github , I noticed a few hours ago that version 0.10.0 was just released. So, I ran **only WARM START runs** since I had some time. Cold start runs imply that I rebooted my system -- and sorry, I just didn't have the time for a reboot. 🙇 The first thing that I noticed is that just by eyeballing it, the new version appears to be using less system RAM for the models. I have no good measures as I did not measure this before, but I noticed that the RAM used/available appears lower than before. Next, the warm start performance appers to be not just better, but **much** better. Please note that I did incease `num_predict` from `1,000` in the previous runs, to `4,000`. This was because I updated my application to produce longer responses, though the response produced always came in under `1,000` tokens. <google-sheets-html-origin><style type="text/css"></style> ollama version | mode | est. msg len | est. msg tokens | num_predict | desired total tokens | num_ctx | inference time -- | -- | -- | -- | -- | -- | -- | -- 0.10.0 | warm start | 16,093 | 5,364 | 4,000 | 9,364 | 16,384 | 1:07 🤯 | warm start | 16,094 | 5,364 | 4,000 | 9,364 | 16,384 | 0:13 🤯🤯 | warm start | 16,094 | 5,364 | 4,000 | 9,364 | 16,384 | 0:14 🤯🤯 | warm start | 20,396 | 6,798 | 4,000 | 10,798 | 16,384 | 1:44 🤯 (Sorry, I did not compute an average this time because the timings are pretty far apart, so I felt that an average would not produce a representative value). First, 🤯!!!! The speed improvements are not just **much** better, but unexpectedly **astounding**!!! Even the two slow runs are at **least twice as fast as the my previous fastest runs**. I am thinking that even though I am configured with `swapoff -a`, that it is possible that a model got unloaded along the way and needed to be releaded, so I am possibly getting hit with a model reload penalty, maybe. 🤔❓ I am starting to think that if I had more available RAM available to keep all of my models in RAM without unload/reload events, that I would be getting consistentent sub 20 second inferences with my test. For the record: My previous tests were run after a reboot with only Chrome and some terminals open. **UPDATE:** I found that I also had 2 terminals open to `ollama run` sessions where I was doing other tests that were taking up as much as 12-14GB!! So, it might be likely that memory pressure from the unloading and reloading of models might have been what caused my slower runs. Still, it's only my hunch. This time, I happen to be doing other work, so I have a lot of browsers open along with the GUI-based dBeaver SQL tool. So, memory while available, is still more constrained than in my previous tests. These are just my armchair, back of the napkin timings, but OMG, I am quite happy. This is fast. 🤯 @rick-github , I am really curious about any comments that you could offer. Thanks to you and the crew, @rick-github ! 🤗

GiteaMirror commented

2026-04-22 15:55:04 -05:00

@FieldMouse-AI commented on GitHub (Aug 4, 2025):

@rick-github , I have to admit that after further testing, things are quite quick now.

If you are fine with this, I will be happy to close this issue.

@FieldMouse-AI commented on GitHub (Aug 4, 2025): @rick-github , I have to admit that after further testing, things are quite quick now. If you are fine with this, I will be happy to close this issue.

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#33339