[GH-ISSUE #11479] REGRESSION: v0.10.0-rc0 super slow and exhausting RAM on 32GB RAM CPU-only environment #33339

Closed
opened 2026-04-22 15:54:51 -05:00 by GiteaMirror · 13 comments
Owner

Originally created by @FieldMouse-AI on GitHub (Jul 21, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11479

What is the issue?

v0.10.0-rc0 is showing slowdowns and memory exhaustion similar to what I found occuring with v0.9.2.

Note that v0.9.6 might have some issues of occasional slowness, it is relatively quite performant. So, for the time being I am staying on this version.

I would like to add that whatever magic that was done to make v0.9.6 such a standout performer, please bring it back! It was quite nice! 🤗

Anybody else experiencing this? 🤗

Is there anything you would like me to add? 🤗

🤔 UPDATE:

  1. For testing to make sure that SWAP is not affecting performance, I turn off swap completely (sudo swapoff -a) and begin testing with about 14-15GB of the 32GB free.
  2. The above allows for a total apples-to-apples comparison between runs.
  3. It also makes the RAM exhaustion more frightening as one watches it happening in real time.

🤯🤯 UPDATE 2 - Something is up, but might NOT be a regression!!!

See below in the comments for how things changed after further testing!

Relevant log output


OS

Linux

GPU

No response

CPU

AMD

Ollama version

v0.10.0-rc0

Originally created by @FieldMouse-AI on GitHub (Jul 21, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11479 ### What is the issue? [v0.10.0-rc0](https://github.com/ollama/ollama/releases/tag/v0.10.0-rc0) is showing slowdowns and memory exhaustion similar to what I found occuring with [v0.9.2](https://github.com/ollama/ollama/releases/tag/v0.9.2). Note that [v0.9.6](https://github.com/ollama/ollama/releases/tag/v0.9.6) might have some issues of occasional slowness, it is relatively quite performant. So, for the time being I am staying on this version. I would like to add that whatever magic that was done to make [v0.9.6](https://github.com/ollama/ollama/releases/tag/v0.9.6) such a standout performer, please bring it back! It was quite nice! 🤗 Anybody else experiencing this? 🤗 Is there anything you would like me to add? 🤗 ### 🤔 UPDATE: 1. For testing to make sure that SWAP is not affecting performance, I turn off swap completely (`sudo swapoff -a`) and begin testing with about 14-15GB of the 32GB free. 2. The above allows for a total apples-to-apples comparison between runs. 3. It also makes the RAM exhaustion more frightening as one watches it happening in real time. ### 🤯🤯 UPDATE 2 - Something is up, but might NOT be a regression!!! See below in the comments for how things changed after further testing! ### Relevant log output ```shell ``` ### OS Linux ### GPU _No response_ ### CPU AMD ### Ollama version v0.10.0-rc0
GiteaMirror added the bug label 2026-04-22 15:54:51 -05:00
Author
Owner

@rick-github commented on GitHub (Jul 21, 2025):

Server logs may aid in debugging.

<!-- gh-comment-id:3095020315 --> @rick-github commented on GitHub (Jul 21, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging.
Author
Owner

@rick-github commented on GitHub (Jul 22, 2025):

qwen3, "why is the sky blue?", parallel=1, num_gpu=0, otherwise defaults. Not much difference between 0.9.6 and 0.10.0-rc0 in terms of RSS and TPS.

Image
<!-- gh-comment-id:3100900110 --> @rick-github commented on GitHub (Jul 22, 2025): qwen3, "why is the sky blue?", parallel=1, num_gpu=0, otherwise defaults. Not much difference between 0.9.6 and 0.10.0-rc0 in terms of RSS and TPS. <img width="822" height="359" alt="Image" src="https://github.com/user-attachments/assets/edab7aad-4e70-4c66-bf9c-52baa94285f1" />
Author
Owner

@FieldMouse-AI commented on GitHub (Jul 22, 2025):

qwen3, "why is the sky blue?", parallel=1, num_gpu=0, otherwise defaults. Not much difference between 0.9.6 and 0.10.0-rc0 in terms of RSS and TPS.

Image

@rick-github , thank you for your reply!

I have an UPDATE and an OBSERVATION that might direct us towards a solution!!!

Let me share with you my configuration:

  • CPU: AMD Ryzen 7 5800U
  • RAM: 32GB (with plenty free)
  • OS/Kernel: Ubuntu Linix 22.04.5 LTS, 6.8.0-64-generic (Host and Docker container)
  • Ollama environment: Docker container, with OLLAMA_LLM_DEVICE=CPU explicitly and consistently set.

First, after testing 0.9.2, then testing 010.0-rc0, and finding them both slower than my 0.9.6 environment, I decided to reinstall 0.9.6 from the GitHub release in the same way that I installed 0.9.2 and 010.0-rc0.

That version of 0.9.6 that I downloaded from GitHub had the exact same slow performance profile as 0.9.2 and 010.0-rc0! 🤯

Then I rebuilt the Ollama docker container using ollama.com's curl -fsSL https://ollama.com/install.sh | sh installer for Linux.
Currently, this is installing 0.9.6 as per ollama --version.

But when I ran my tests, it was back to being speedy again!

I am guessing that the GitHub build for 0.9.6 (and perhaps other cases) produces a different build result than what is distributed direectly from ollama.com's curl -fsSL https://ollama.com/install.sh | sh installer for Linux. Like, maybe it is missing special optimizations (like AVX2/FMA specifically for my Ryzen CPU) or incorporated a less optimized llama.cpp commit compared to the official install.sh.

Concrete Performance Data (Before & After) using only Ollama 0.9.6. The 1b model is llama3.2:1b:

  • Original (install.sh) (Good) Performance (1b model): ~1m41s - 2m inference times.
  • Degraded (GitHub download) (Bad) Performance (1b model): ~5m inference times.
  • Restored (install.sh) (Good) Performance (1b model) after reverting to install.sh: Cold start ~1m4s, warm start ~1m19s.

PS:

When I tried testing the different versions of Ollama, I used the following in my Dockerfile to select the version. This version is what gave me the slow version of the ollama binary. Of course I would change OLLAMA_VERSION to whatever version I wanted installed for testing.

RUN mkdir -p /tmp/ollama_install && \
    curl -L "https://github.com/ollama/ollama/releases/download/v${OLLAMA_VERSION}/ollama-linux-amd64.tgz" -o /tmp/ollama_install/ollama-linux-amd64.tgz && \
    tar -xzf /tmp/ollama_install/ollama-linux-amd64.tgz -C /tmp/ollama_install && \
    mv /tmp/ollama_install/bin/ollama /usr/local/bin/ollama && \
    chmod +x /usr/local/bin/ollama && \
    rm -rf /tmp/ollama_install

Replacing the above with the following is what gave me the speedy version of the binary:

RUN curl -fsSL https://ollama.com/install.sh | sh

So, these are my findings.

There was no regression in the common sense, I believe.
However, I do believe that there is evidence that the builds are different with the most optimized builds avaialble via ollama.com's curl -fsSL https://ollama.com/install.sh | sh installer for Linux.

What do you think? 🤗

<!-- gh-comment-id:3102310340 --> @FieldMouse-AI commented on GitHub (Jul 22, 2025): > qwen3, "why is the sky blue?", parallel=1, num_gpu=0, otherwise defaults. Not much difference between 0.9.6 and 0.10.0-rc0 in terms of RSS and TPS. > > <img alt="Image" width="822" height="359" src="https://private-user-images.githubusercontent.com/14946854/468968699-edab7aad-4e70-4c66-bf9c-52baa94285f1.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTMxODIzOTcsIm5iZiI6MTc1MzE4MjA5NywicGF0aCI6Ii8xNDk0Njg1NC80Njg5Njg2OTktZWRhYjdhYWQtNGU3MC00YzY2LWJmOWMtNTJiYWE5NDI4NWYxLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTA3MjIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwNzIyVDExMDEzN1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWZjMjZjZjc0YTg2MTIxYzI2MWQxNDc1ZWNjMDRlYWE0NTA5NzAxMTkxYmZmZmVkYzExZjJhMzI3NjNmNGMwYzYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.bHtx3_xcfhGFBQkLzRVXsA84roGmrRK6sJv0hrYLUeY"> @rick-github , thank you for your reply! I have an **UPDATE** and an **OBSERVATION** that might direct us towards a solution!!! Let me share with you my configuration: - CPU: AMD Ryzen 7 5800U - RAM: 32GB (with plenty free) - OS/Kernel: Ubuntu Linix 22.04.5 LTS, 6.8.0-64-generic (Host and Docker container) - Ollama environment: Docker container, with OLLAMA_LLM_DEVICE=CPU explicitly and consistently set. First, after testing 0.9.2, then testing 010.0-rc0, and finding them both slower than my 0.9.6 environment, I decided to **reinstall 0.9.6 from the GitHub release** in the same way that I installed 0.9.2 and 010.0-rc0. That version of 0.9.6 that I downloaded from GitHub had the exact same slow performance profile as 0.9.2 and 010.0-rc0! 🤯 Then I rebuilt the Ollama docker container using ollama.com's `curl -fsSL https://ollama.com/install.sh | sh` installer for Linux. Currently, this is installing 0.9.6 as per `ollama --version`. But when I ran my tests, it was back to being speedy again! I am guessing that the GitHub build for 0.9.6 (and perhaps other cases) produces a different build result than what is distributed direectly from ollama.com's `curl -fsSL https://ollama.com/install.sh | sh` installer for Linux. Like, maybe it is missing special optimizations (like AVX2/FMA specifically for my Ryzen CPU) or incorporated a less optimized `llama.cpp` commit compared to the official `install.sh`. Concrete Performance Data (Before & After) using only Ollama 0.9.6. The `1b` model is `llama3.2:1b`: - Original (`install.sh`) (Good) Performance (`1b` model): ~1m41s - 2m inference times. - Degraded (`GitHub download`) (Bad) Performance (`1b` model): ~5m inference times. - Restored (`install.sh`) (Good) Performance (`1b` model) after reverting to install.sh: Cold start ~1m4s, warm start ~1m19s. PS: When I tried testing the different versions of Ollama, I used the following in my Dockerfile to select the version. This version is what gave me the slow version of the `ollama` binary. Of course I would change `OLLAMA_VERSION` to whatever version I wanted installed for testing. ``` RUN mkdir -p /tmp/ollama_install && \ curl -L "https://github.com/ollama/ollama/releases/download/v${OLLAMA_VERSION}/ollama-linux-amd64.tgz" -o /tmp/ollama_install/ollama-linux-amd64.tgz && \ tar -xzf /tmp/ollama_install/ollama-linux-amd64.tgz -C /tmp/ollama_install && \ mv /tmp/ollama_install/bin/ollama /usr/local/bin/ollama && \ chmod +x /usr/local/bin/ollama && \ rm -rf /tmp/ollama_install ``` Replacing the above with the following is what gave me the speedy version of the binary: ``` RUN curl -fsSL https://ollama.com/install.sh | sh ``` So, these are my findings. There was no regression in the common sense, I believe. However, I do believe that there is evidence that the builds are different with the most optimized builds avaialble via ollama.com's `curl -fsSL https://ollama.com/install.sh | sh` installer for Linux. What do you think? 🤗
Author
Owner

@rick-github commented on GitHub (Jul 22, 2025):

OLLAMA_LLM_DEVICE is not an Ollama configuration variable.

The slowness is due to your Franken-container. The correct way to update the ollama image is to pull the new version:

docker pull ollama/ollama:0.9.6

By downloading the tar.gz inside the container and mving the binary to /usr/local/bin/ollama you are destroying the CPU and GPU backends that are required for fast inference.

<!-- gh-comment-id:3102367677 --> @rick-github commented on GitHub (Jul 22, 2025): `OLLAMA_LLM_DEVICE` is not an Ollama configuration variable. The slowness is due to your Franken-container. The correct way to update the ollama image is to pull the new version: ``` docker pull ollama/ollama:0.9.6 ``` By downloading the tar.gz inside the container and `mv`ing the binary to `/usr/local/bin/ollama` you are destroying the CPU and GPU backends that are required for fast inference.
Author
Owner

@rick-github commented on GitHub (Jul 23, 2025):

I realized you might be installing ollama inside a custom container rather than updating ollama, which is what I took away from "rebuilt the Ollama docker container". In that case, you want to modify the install process:

RUN mkdir -p /tmp/ollama_install && \
    curl -L "https://github.com/ollama/ollama/releases/download/v${OLLAMA_VERSION}/ollama-linux-amd64.tgz" -o /tmp/ollama_install/ollama-linux-amd64.tgz && \
    tar -xzf /tmp/ollama_install/ollama-linux-amd64.tgz -C /usr/local && \
    rm -rf /tmp/ollama_install
<!-- gh-comment-id:3105566134 --> @rick-github commented on GitHub (Jul 23, 2025): I realized you might be installing ollama inside a custom container rather than updating ollama, which is what I took away from "rebuilt the Ollama docker container". In that case, you want to modify the install process: ```shell RUN mkdir -p /tmp/ollama_install && \ curl -L "https://github.com/ollama/ollama/releases/download/v${OLLAMA_VERSION}/ollama-linux-amd64.tgz" -o /tmp/ollama_install/ollama-linux-amd64.tgz && \ tar -xzf /tmp/ollama_install/ollama-linux-amd64.tgz -C /usr/local && \ rm -rf /tmp/ollama_install ```
Author
Owner

@FieldMouse-AI commented on GitHub (Jul 23, 2025):

I realized you might be installing ollama inside a custom container rather than updating ollama, which is what I took away from "rebuilt the Ollama docker container". In that case, you want to modify the install process:

RUN mkdir -p /tmp/ollama_install &&
curl -L "https://github.com/ollama/ollama/releases/download/v${OLLAMA_VERSION}/ollama-linux-amd64.tgz" -o /tmp/ollama_install/ollama-linux-amd64.tgz &&
tar -xzf /tmp/ollama_install/ollama-linux-amd64.tgz -C /usr/local &&
rm -rf /tmp/ollama_install

🤗 Ooo! I see what you did there!

I get to keep using my Dockerfile for Ollama and I get all of the optimized goodness with it.

I will give this a try!.

Thanks! 😊

<!-- gh-comment-id:3108088727 --> @FieldMouse-AI commented on GitHub (Jul 23, 2025): > I realized you might be installing ollama inside a custom container rather than updating ollama, which is what I took away from "rebuilt the Ollama docker container". In that case, you want to modify the install process: > > RUN mkdir -p /tmp/ollama_install && \ > curl -L "https://github.com/ollama/ollama/releases/download/v${OLLAMA_VERSION}/ollama-linux-amd64.tgz" -o /tmp/ollama_install/ollama-linux-amd64.tgz && \ > tar -xzf /tmp/ollama_install/ollama-linux-amd64.tgz -C /usr/local && \ > rm -rf /tmp/ollama_install 🤗 Ooo! I see what you did there! I get to keep using my `Dockerfile` for Ollama **and** I get all of the optimized goodness with it. I will give this a try!. Thanks! 😊
Author
Owner

@FieldMouse-AI commented on GitHub (Jul 23, 2025):

I realized you might be installing ollama inside a custom container rather than updating ollama, which is what I took away from "rebuilt the Ollama docker container". In that case, you want to modify the install process:
RUN mkdir -p /tmp/ollama_install &&
curl -L "https://github.com/ollama/ollama/releases/download/v${OLLAMA_VERSION}/ollama-linux-amd64.tgz" -o /tmp/ollama_install/ollama-linux-amd64.tgz &&
tar -xzf /tmp/ollama_install/ollama-linux-amd64.tgz -C /usr/local &&
rm -rf /tmp/ollama_install

🤗 Ooo! I see what you did there!

I get to keep using my Dockerfile for Ollama and I get all of the optimized goodness with it.

I will give this a try!.

Thanks! 😊

@rick-github , Great news!!!!!!!!!

Your solution to help get my Dockerfile's RUN statement to work was 1000% spot on perfect!!!

I am now testing with 0.9.6 and I will move on to test 0.9.2 and 010.0-rc0 soon as time permits!

Thanks! I will post back here with the results of the other tests!

Thanks!
🤗🤗🤗

<!-- gh-comment-id:3109967905 --> @FieldMouse-AI commented on GitHub (Jul 23, 2025): > > I realized you might be installing ollama inside a custom container rather than updating ollama, which is what I took away from "rebuilt the Ollama docker container". In that case, you want to modify the install process: > > RUN mkdir -p /tmp/ollama_install && > > curl -L "[https://github.com/ollama/ollama/releases/download/v${OLLAMA_VERSION}/ollama-linux-amd64.tgz](https://github.com/ollama/ollama/releases/download/v$%7BOLLAMA_VERSION%7D/ollama-linux-amd64.tgz)" -o /tmp/ollama_install/ollama-linux-amd64.tgz && > > tar -xzf /tmp/ollama_install/ollama-linux-amd64.tgz -C /usr/local && > > rm -rf /tmp/ollama_install > > 🤗 Ooo! I see what you did there! > > I get to keep using my `Dockerfile` for Ollama **and** I get all of the optimized goodness with it. > > I will give this a try!. > > Thanks! 😊 @rick-github , Great news!!!!!!!!! Your solution to help get my `Dockerfile`'s `RUN` statement to work was 1000% spot on perfect!!! I am now testing with 0.9.6 and I will move on to test 0.9.2 and 010.0-rc0 soon as time permits! Thanks! I will post back here with the results of the other tests! Thanks! 🤗🤗🤗
Author
Owner

@FieldMouse-AI commented on GitHub (Jul 23, 2025):

🤗 Ooo! I see what you did there!
I get to keep using my Dockerfile for Ollama and I get all of the optimized goodness with it.
I will give this a try!.
Thanks! 😊

@rick-github , Great news!!!!!!!!!

Your solution to help get my Dockerfile's RUN statement to work was 1000% spot on perfect!!!

I am now testing with 0.9.6 and I will move on to test 0.9.2 and 010.0-rc0 soon as time permits!

Thanks! I will post back here with the results of the other tests!

Thanks! 🤗🤗🤗

Hello, again, @rick-github ! As promised, I've returned with the results of my testing.

For all of my tests I feed about 10,000 tokens worth of text to a 16384 num_ctx instance of llama3.2:1b. I run it 3 times and take the average. Please note that I could not test 0.10.0-rc0 as it was unavailable, so I switched to testing 0.10.0-rc1, instead.

  • 0.9.2: inference time: 2m10s 👈 Speedier than 0.9.6?
  • 0.9.6: inference time: 2m30s 👈 Regression?
  • 0.10.0-rc1: inference time: 2m45s 👈 Regression?

Based on these results, it would seem that 0.9.2 is the most performant version of these 3. So, given that I can now switch to and lock down on particular versions of Ollama as I need (thanks for the Dockerfile fix, @rick-github ), it would seem best for me to stick with 0.9.2 until a more performant release becomes available.

What do you think??? 🤗

<!-- gh-comment-id:3110219840 --> @FieldMouse-AI commented on GitHub (Jul 23, 2025): > > 🤗 Ooo! I see what you did there! > > I get to keep using my `Dockerfile` for Ollama **and** I get all of the optimized goodness with it. > > I will give this a try!. > > Thanks! 😊 > > [@rick-github](https://github.com/rick-github) , Great news!!!!!!!!! > > Your solution to help get my `Dockerfile`'s `RUN` statement to work was 1000% spot on perfect!!! > > I am now testing with 0.9.6 and I will move on to test 0.9.2 and 010.0-rc0 soon as time permits! > > Thanks! I will post back here with the results of the other tests! > > Thanks! 🤗🤗🤗 Hello, again, @rick-github ! As promised, I've returned with the results of my testing. For all of my tests I feed about 10,000 tokens worth of text to a 16384 `num_ctx` instance of `llama3.2:1b`. I run it 3 times and take the average. Please note that I could not test 0.10.0-rc0 as it was unavailable, so I switched to testing 0.10.0-rc1, instead. - **0.9.2: inference time: 2m10s 👈 Speedier than 0.9.6?** - **0.9.6:** inference time: 2m30s 👈 Regression? - **0.10.0-rc1**: inference time: 2m45s 👈 Regression? Based on these results, it would seem that 0.9.2 is the most performant version of these 3. So, given that I can now switch to and lock down on particular versions of Ollama as I need (thanks for the `Dockerfile` fix, @rick-github ), it would seem best for me to stick with **0.9.2** until a more performant release becomes available. What do you think??? 🤗
Author
Owner

@rick-github commented on GitHub (Jul 23, 2025):

Server logs may aid in debugging.

<!-- gh-comment-id:3110226480 --> @rick-github commented on GitHub (Jul 23, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging.
Author
Owner

@FieldMouse-AI commented on GitHub (Jul 24, 2025):

Server logs may aid in debugging.

Hello, @rick-github .

I didn't want to make you have to wait too long for a response as I did my testing.

As it turns out, when I tested the models more strenuously (short times between runs, my usual mode) as well as more lightly (long times between runs, eg. I would do a run then make coffee, then do another run and go and make breakfast, etc), I noticed sometihng interesting: The time results for 0.9.2 and 0.9.6 started to become more similar. 🤔

As an example, under low pressure, just now while doing tests during breakfast preparation, I discovered 0.9.6 was getting results like 2m20s and 2m36s. 🤔

And under heavy presssure, both the 0.9.2 and 0.9.6 models started rising into the 2m50s to 3m range. 🤯

Now, I'm an old LISP language designer type so memory management problems like this would crop up where if I did not give my system enough time for the garbage collector to at least reoganize allocations, the system would crawl during runtime because it would still need to do the reoganizations.

IMHO, it would suggest that what is afoot here is the memory manager of my OS (Ubuntu Linux 22.04.5 LTS) queitly cleaning things up while I was stirring cream-o-wheat.

This is just a guess on my part, but my hunch seems to suggest that pushing further with redoing the trials would likely bear something like that out.

To do this test will take longer as I would have to do a bunch of back-to-back runs followed by, perhaps a reboot, then a bunch of long-wait-between-runs runs.

Oh, and I will have to include logs with the runs.

So, that's the heads-up. 🤗

<!-- gh-comment-id:3111518607 --> @FieldMouse-AI commented on GitHub (Jul 24, 2025): > [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging. Hello, @rick-github . I didn't want to make you have to wait too long for a response as I did my testing. As it turns out, when I tested the models more strenuously (short times between runs, my usual mode) as well as more lightly (long times between runs, eg. I would do a run then make coffee, then do another run and go and make breakfast, etc), I noticed sometihng interesting: The time results for 0.9.2 and 0.9.6 started to become more similar. 🤔 As an example, under low pressure, just now while doing tests during breakfast preparation, I discovered 0.9.6 was getting results like 2m20s and 2m36s. 🤔 And under heavy presssure, **both the 0.9.2 and 0.9.6 models** started rising into the 2m50s to 3m range. 🤯 Now, I'm an old LISP language designer type so memory management problems like this would crop up where if I did not give my system enough time for the garbage collector to at least reoganize allocations, the system would crawl during runtime because it would still need to do the reoganizations. IMHO, it would suggest that what is afoot here is the memory manager of my OS (Ubuntu Linux 22.04.5 LTS) queitly cleaning things up while I was stirring cream-o-wheat. This is just a guess on my part, but my hunch seems to suggest that pushing further with redoing the trials would likely bear something like that out. To do this test will take longer as I would have to do a bunch of back-to-back runs followed by, perhaps a reboot, then a bunch of long-wait-between-runs runs. Oh, and I will have to include logs with the runs. So, that's the heads-up. 🤗
Author
Owner

@FieldMouse-AI commented on GitHub (Jul 27, 2025):

😊 Hello, @rick-github , as promised, I reran all of my tests on the proper fully optimized installations of Ollama for the following versions:

  • 0.9.2
  • 0.9.6
  • 0.10.0-rc2 (0.10.0-rc1 was not availble anymore, so I decided to just do 0.10.0-rc2, instead)

Along with the charts below, I have also attached the logs for each set of runs that was performed. Please see the attachments.

My overall impressiojn looking at the proper installations is that it is clearly faster than the non-optimized versions that I had originally tested on all counts.

I am curious to know what you might discover from these results. 🤗

quick-0.9.2-ollama.log
quick-0.9.6-ollama.log
quick-0.10.0-rc2-ollama.log
slow-0.9.2-ollama.log
slow-0.9.6-ollama.log
slow-0.10.0-rc2-ollama.log

About the test environmnet

Host

  • OS: Ubuntu 22.04.5 LTS x86_64
  • Kernel: 6.8.0-64-generic
  • CPU: AMD Ryzen 7 5800U with Radeon Graphics (8-cores/16-threads) @ 4.500GHz
  • RAM: 32GB
  • Swap: 0GB (turned off using sudo swapoff -a)

Ollama Server

  • Runs in its own Docker container
    • OS: Ubuntu 22.04.5 LTS x86_64
  • Model tested: llama3.2:1b-instruct-q8_0
  • OLLAMA_DEBUG=1
  • OLLAMA_KEEP_ALIVE=-1
  • OLLAMA_NUM_PARALLEL=1
  • OLLAMA_KV_CACHE_TYPE=q8_0
  • OLLAMA_CONTEXT_LENGTH=131072
  • OLLAMA_LLM_DEVICE=CPU

Quick Turnaround Runs

These are runs where the each run of my workflow was run as close as back-to-back as possible.

<style type="text/css"></style>

ollama version mode est. msg len est. msg tokens num_predict desired total tokens num_ctx inference time
0.9.2 cold start 28,045 9,348 1,000 10,348 16,384 3:29
  warm start 28,301 9,433 1,000 10,433 16,384 3:26
  warm start 27,790 9,263 1,000 10,263 16,384 3:24
  warm start 26,786 8,928 1,000 9,928 16,384 2:56
            warm start average 3:15
               
ollama version mode est. msg len est. msg tokens num_predict desired total tokens num_ctx inference time
0.9.6 cold start 27,074 9,024 1,000 10,024 16,384 2:43
  warm start 28,662 9,554 1,000 10,554 16,384 2:53
  warm start 27,853 9,284 1,000 10,284 16,384 3:05
  warm start 28,006 9,335 1,000 10,335 16,384 3:30
            warm start average 3:09
               
ollama version mode est. msg len est. msg tokens num_predict desired total tokens num_ctx inference time
0.10.0-rc2 cold start 27,336 9,112 1,000 10,112 16,384 2:58
  warm start 27,833 9,277 1,000 10,277 16,384 3:20
  warm start 27,883 9,294 1,000 10,294 16,384 3:21
  warm start 27,995 9,331 1,000 10,331 16,384 3:30
            warm start average 3:24

Slow Turnaround Runs

These runs are runs where the time between runs is close to the time it take to make breakfast -- for me that is about 10 minutes between runs. The idea here was to give the OS time to reorganize memory while not under pressure if it so wanted to.

<style type="text/css"></style>

ollama version mode est. msg len est. msg tokens num_predict desired total tokens num_ctx inference time
0.9.2 cold start 27,897 9,299 1,000 10,299 16,384 3:05
  warm start 28,443 9,481 1,000 10,481 16,384 3:05
  warm start 28,883 9,627 1,000 10,627 16,384 3:16
  warm start 28,431 9,477 1,000 10,477 16,384 3:40
            warm start average 3:20
               
ollama version mode est. msg len est. msg tokens num_predict desired total tokens num_ctx inference time
0.9.6 cold start 29,049 9,683 1,000 10,683 16,384 4:21
  warm start 28,832 9,610 1,000 10,610 16,384 3:45
  warm start 27,993 9,331 1,000 10,331 16,384 3:04
  warm start 28,003 9,334 1,000 10,334 16,384 3:15
            warm start average 3:21
               
ollama version mode est. msg len est. msg tokens num_predict desired total tokens num_ctx inference time
0.10.0-rc2 cold start 29,769 9,923 1,000 10,923 16,384 3:38
  warm start 29,858 9,952 1,000 10,952 16,384 3:52
  warm start 32,447 10,815 1,000 11,815 16,384 4:00
  warm start 29,693 9,897 1,000 10,897 16,384 3:14
            warm start average 3:42
<!-- gh-comment-id:3124273572 --> @FieldMouse-AI commented on GitHub (Jul 27, 2025): 😊 Hello, @rick-github , as promised, I reran all of my tests on the proper fully optimized installations of Ollama for the following versions: - 0.9.2 - 0.9.6 - 0.10.0-rc2 (0.10.0-rc1 was not availble anymore, so I decided to just do 0.10.0-rc2, instead) Along with the charts below, I have also attached the logs for each set of runs that was performed. Please see the attachments. My overall impressiojn looking at the proper installations is that it is clearly faster than the non-optimized versions that I had originally tested on all counts. I am curious to know what you might discover from these results. 🤗 [quick-0.9.2-ollama.log](https://github.com/user-attachments/files/21453154/quick-0.9.2-ollama.log) [quick-0.9.6-ollama.log](https://github.com/user-attachments/files/21453149/quick-0.9.6-ollama.log) [quick-0.10.0-rc2-ollama.log](https://github.com/user-attachments/files/21453150/quick-0.10.0-rc2-ollama.log) [slow-0.9.2-ollama.log](https://github.com/user-attachments/files/21453151/slow-0.9.2-ollama.log) [slow-0.9.6-ollama.log](https://github.com/user-attachments/files/21453153/slow-0.9.6-ollama.log) [slow-0.10.0-rc2-ollama.log](https://github.com/user-attachments/files/21453152/slow-0.10.0-rc2-ollama.log) ### About the test environmnet Host - OS: Ubuntu 22.04.5 LTS x86_64 - Kernel: 6.8.0-64-generic - CPU: AMD Ryzen 7 5800U with Radeon Graphics (8-cores/16-threads) @ 4.500GHz - RAM: 32GB - Swap: 0GB (turned off using `sudo swapoff -a`) Ollama Server - Runs in its own Docker container - OS: Ubuntu 22.04.5 LTS x86_64 - Model tested: `llama3.2:1b-instruct-q8_0` - `OLLAMA_DEBUG=1` - `OLLAMA_KEEP_ALIVE=-1` - `OLLAMA_NUM_PARALLEL=1` - `OLLAMA_KV_CACHE_TYPE=q8_0` - `OLLAMA_CONTEXT_LENGTH=131072` - `OLLAMA_LLM_DEVICE=CPU` ### Quick Turnaround Runs These are runs where the each run of my workflow was run as close as back-to-back as possible. <google-sheets-html-origin><style type="text/css"><!--td {border: 1px solid #cccccc;}br {mso-data-placement:same-cell;}--></style> ollama version | mode | est. msg len | est. msg tokens | num_predict | desired total tokens | num_ctx | inference time -- | -- | -- | -- | -- | -- | -- | -- 0.9.2 | cold start | 28,045 | 9,348 | 1,000 | 10,348 | 16,384 | 3:29   | warm start | 28,301 | 9,433 | 1,000 | 10,433 | 16,384 | 3:26   | warm start | 27,790 | 9,263 | 1,000 | 10,263 | 16,384 | 3:24   | warm start | 26,786 | 8,928 | 1,000 | 9,928 | 16,384 | 2:56   |   |   |   |   |   | warm start average | 3:15   |   |   |   |   |   |   |   ollama version | mode | est. msg len | est. msg tokens | num_predict | desired total tokens | num_ctx | inference time 0.9.6 | cold start | 27,074 | 9,024 | 1,000 | 10,024 | 16,384 | 2:43   | warm start | 28,662 | 9,554 | 1,000 | 10,554 | 16,384 | 2:53   | warm start | 27,853 | 9,284 | 1,000 | 10,284 | 16,384 | 3:05   | warm start | 28,006 | 9,335 | 1,000 | 10,335 | 16,384 | 3:30   |   |   |   |   |   | warm start average | 3:09   |   |   |   |   |   |   |   ollama version | mode | est. msg len | est. msg tokens | num_predict | desired total tokens | num_ctx | inference time 0.10.0-rc2 | cold start | 27,336 | 9,112 | 1,000 | 10,112 | 16,384 | 2:58   | warm start | 27,833 | 9,277 | 1,000 | 10,277 | 16,384 | 3:20   | warm start | 27,883 | 9,294 | 1,000 | 10,294 | 16,384 | 3:21   | warm start | 27,995 | 9,331 | 1,000 | 10,331 | 16,384 | 3:30   |   |   |   |   |   | warm start average | 3:24 ### Slow Turnaround Runs These runs are runs where the time between runs is close to the time it take to make breakfast -- for me that is about 10 minutes between runs. The idea here was to give the OS time to reorganize memory while not under pressure if it so wanted to. <google-sheets-html-origin><style type="text/css"><!--td {border: 1px solid #cccccc;}br {mso-data-placement:same-cell;}--></style> ollama version | mode | est. msg len | est. msg tokens | num_predict | desired total tokens | num_ctx | inference time -- | -- | -- | -- | -- | -- | -- | -- 0.9.2 | cold start | 27,897 | 9,299 | 1,000 | 10,299 | 16,384 | 3:05   | warm start | 28,443 | 9,481 | 1,000 | 10,481 | 16,384 | 3:05   | warm start | 28,883 | 9,627 | 1,000 | 10,627 | 16,384 | 3:16   | warm start | 28,431 | 9,477 | 1,000 | 10,477 | 16,384 | 3:40   |   |   |   |   |   | warm start average | 3:20   |   |   |   |   |   |   |   ollama version | mode | est. msg len | est. msg tokens | num_predict | desired total tokens | num_ctx | inference time 0.9.6 | cold start | 29,049 | 9,683 | 1,000 | 10,683 | 16,384 | 4:21   | warm start | 28,832 | 9,610 | 1,000 | 10,610 | 16,384 | 3:45   | warm start | 27,993 | 9,331 | 1,000 | 10,331 | 16,384 | 3:04   | warm start | 28,003 | 9,334 | 1,000 | 10,334 | 16,384 | 3:15   |   |   |   |   |   | warm start average | 3:21   |   |   |   |   |   |   |   ollama version | mode | est. msg len | est. msg tokens | num_predict | desired total tokens | num_ctx | inference time 0.10.0-rc2 | cold start | 29,769 | 9,923 | 1,000 | 10,923 | 16,384 | 3:38   | warm start | 29,858 | 9,952 | 1,000 | 10,952 | 16,384 | 3:52   | warm start | 32,447 | 10,815 | 1,000 | 11,815 | 16,384 | 4:00   | warm start | 29,693 | 9,897 | 1,000 | 10,897 | 16,384 | 3:14   |   |   |   |   |   | warm start average | 3:42
Author
Owner

@FieldMouse-AI commented on GitHub (Jul 30, 2025):

🤗 Hello, @rick-github , I noticed a few hours ago that version 0.10.0 was just released.

So, I ran only WARM START runs since I had some time. Cold start runs imply that I rebooted my system -- and sorry, I just didn't have the time for a reboot. 🙇

The first thing that I noticed is that just by eyeballing it, the new version appears to be using less system RAM for the models. I have no good measures as I did not measure this before, but I noticed that the RAM used/available appears lower than before.

Next, the warm start performance appers to be not just better, but much better.

Please note that I did incease num_predict from 1,000 in the previous runs, to 4,000. This was because I updated my application to produce longer responses, though the response produced always came in under 1,000 tokens.

<style type="text/css"></style>

ollama version mode est. msg len est. msg tokens num_predict desired total tokens num_ctx inference time
0.10.0 warm start 16,093 5,364 4,000 9,364 16,384 1:07 🤯
  warm start 16,094 5,364 4,000 9,364 16,384 0:13 🤯🤯
  warm start 16,094 5,364 4,000 9,364 16,384 0:14 🤯🤯
  warm start 20,396 6,798 4,000 10,798 16,384 1:44 🤯

(Sorry, I did not compute an average this time because the timings are pretty far apart, so I felt that an average would not produce a representative value).

First, 🤯!!!! The speed improvements are not just much better, but unexpectedly astounding!!!

Even the two slow runs are at least twice as fast as the my previous fastest runs. I am thinking that even though I am configured with swapoff -a, that it is possible that a model got unloaded along the way and needed to be releaded, so I am possibly getting hit with a model reload penalty, maybe.

🤔 I am starting to think that if I had more available RAM available to keep all of my models in RAM without unload/reload events, that I would be getting consistentent sub 20 second inferences with my test.

For the record: My previous tests were run after a reboot with only Chrome and some terminals open.

UPDATE: I found that I also had 2 terminals open to ollama run sessions where I was doing other tests that were taking up as much as 12-14GB!! So, it might be likely that memory pressure from the unloading and reloading of models might have been what caused my slower runs. Still, it's only my hunch.

This time, I happen to be doing other work, so I have a lot of browsers open along with the GUI-based dBeaver SQL tool. So, memory while available, is still more constrained than in my previous tests.

These are just my armchair, back of the napkin timings, but OMG, I am quite happy. This is fast. 🤯

@rick-github , I am really curious about any comments that you could offer.

Thanks to you and the crew, @rick-github ! 🤗

<!-- gh-comment-id:3137775777 --> @FieldMouse-AI commented on GitHub (Jul 30, 2025): 🤗 Hello, @rick-github , I noticed a few hours ago that version 0.10.0 was just released. So, I ran **only WARM START runs** since I had some time. Cold start runs imply that I rebooted my system -- and sorry, I just didn't have the time for a reboot. 🙇 The first thing that I noticed is that just by eyeballing it, the new version appears to be using less system RAM for the models. I have no good measures as I did not measure this before, but I noticed that the RAM used/available appears lower than before. Next, the warm start performance appers to be not just better, but **much** better. Please note that I did incease `num_predict` from `1,000` in the previous runs, to `4,000`. This was because I updated my application to produce longer responses, though the response produced always came in under `1,000` tokens. <google-sheets-html-origin><style type="text/css"><!--td {border: 1px solid #cccccc;}br {mso-data-placement:same-cell;}--></style> ollama version | mode | est. msg len | est. msg tokens | num_predict | desired total tokens | num_ctx | inference time -- | -- | -- | -- | -- | -- | -- | -- 0.10.0 | warm start | 16,093 | 5,364 | 4,000 | 9,364 | 16,384 | 1:07 🤯   | warm start | 16,094 | 5,364 | 4,000 | 9,364 | 16,384 | 0:13 🤯🤯   | warm start | 16,094 | 5,364 | 4,000 | 9,364 | 16,384 | 0:14 🤯🤯   | warm start | 20,396 | 6,798 | 4,000 | 10,798 | 16,384 | 1:44 🤯 (Sorry, I did not compute an average this time because the timings are pretty far apart, so I felt that an average would not produce a representative value). First, 🤯!!!! The speed improvements are not just **much** better, but unexpectedly **astounding**!!! Even the two slow runs are at **least twice as fast as the my previous fastest runs**. I am thinking that even though I am configured with `swapoff -a`, that it is possible that a model got unloaded along the way and needed to be releaded, so I am possibly getting hit with a model reload penalty, maybe. 🤔❓ I am starting to think that if I had more available RAM available to keep all of my models in RAM without unload/reload events, that I would be getting consistentent sub 20 second inferences with my test. For the record: My previous tests were run after a reboot with only Chrome and some terminals open. **UPDATE:** I found that I also had 2 terminals open to `ollama run` sessions where I was doing other tests that were taking up as much as 12-14GB!! So, it might be likely that memory pressure from the unloading and reloading of models might have been what caused my slower runs. Still, it's only my hunch. This time, I happen to be doing other work, so I have a lot of browsers open along with the GUI-based dBeaver SQL tool. So, memory while available, is still more constrained than in my previous tests. These are just my armchair, back of the napkin timings, but OMG, I am quite happy. This is fast. 🤯 @rick-github , I am really curious about any comments that you could offer. Thanks to you and the crew, @rick-github ! 🤗
Author
Owner

@FieldMouse-AI commented on GitHub (Aug 4, 2025):

@rick-github , I have to admit that after further testing, things are quite quick now.

If you are fine with this, I will be happy to close this issue.

<!-- gh-comment-id:3152575165 --> @FieldMouse-AI commented on GitHub (Aug 4, 2025): @rick-github , I have to admit that after further testing, things are quite quick now. If you are fine with this, I will be happy to close this issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#33339