[GH-ISSUE #1727] ollama doesn't use system RAM #63019

New Issue

GiteaMirror · 2026-05-03T11:16:42-05:00

GiteaMirror commented

2026-05-03 11:16:42 -05:00

Originally created by @DrGood01 on GitHub (Dec 27, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1727

Originally assigned to: @dhiltgen on GitHub.

I'm running Ollama on a ubuntu 22 linux laptop with 32 G of RAM and a NVIDIA gtx 1650. Ollama loads the models exclusively in the graphic card RAM, and doesn't use any of the system RAM at all. Very frustrating, as it exists with "Error: llama runner exited, you may not have enough available memory to run this model" as soon as I try to chat...

Originally created by @DrGood01 on GitHub (Dec 27, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1727 Originally assigned to: @dhiltgen on GitHub. I'm running Ollama on a ubuntu 22 linux laptop with 32 G of RAM and a NVIDIA gtx 1650. Ollama loads the models exclusively in the graphic card RAM, and doesn't use any of the system RAM at all. Very frustrating, as it exists with "Error: llama runner exited, you may not have enough available memory to run this model" as soon as I try to chat...

GiteaMirror added the nvidia label 2026-05-03 11:16:42 -05:00

GiteaMirror closed this issue

2026-05-03 11:16:45 -05:00

GiteaMirror commented

2026-05-03 11:16:46 -05:00

@iplayfast commented on GitHub (Dec 27, 2023):

I ran into this as well, The way to get around it is to tell ollama you have no gpu. Then it will load into memory.
my mixtralcpu model is as follows.

FROM mixtral:latest
TEMPLATE """ [INST] {{ .System }} {{ .Prompt }} [/INST]"""
PARAMETER num_gpu 0
PARAMETER num_ctx 32768
PARAMETER stop "</s>"
PARAMETER stop "[INST]"
PARAMETER stop "[/INST]"


PARAMETER temperature .9
PARAMETER num_ctx 32768
system "You are an intellegent AI that is always helpful"

modify it for the model you are trying, and create a new model
llama create mixtralcpu

@iplayfast commented on GitHub (Dec 27, 2023): I ran into this as well, The way to get around it is to tell ollama you have no gpu. Then it will load into memory. my mixtralcpu model is as follows. ``` FROM mixtral:latest TEMPLATE """ [INST] {{ .System }} {{ .Prompt }} [/INST]""" PARAMETER num_gpu 0 PARAMETER num_ctx 32768 PARAMETER stop "</s>" PARAMETER stop "[INST]" PARAMETER stop "[/INST]" PARAMETER temperature .9 PARAMETER num_ctx 32768 system "You are an intellegent AI that is always helpful" ``` modify it for the model you are trying, and create a new model llama create mixtralcpu

GiteaMirror commented

2026-05-03 11:16:47 -05:00

@PollastreGH commented on GitHub (Dec 27, 2023):

Can confirm that I'm running into this issue as-well, EndeavourOS Linux desktop with 64GB of RAM and an RTX-3080.

Update: For me this seems to only be happening on 13b models. All 7b models I've tried and a 70b model (dolphin-mixtral) do not have this issue. Strange. Additionally, this didn't happen for me when I was on WSL2, but it does now that I'm on native Linux.

@PollastreGH commented on GitHub (Dec 27, 2023): Can confirm that I'm running into this issue as-well, EndeavourOS Linux desktop with 64GB of RAM and an RTX-3080. Update: For me this seems to only be happening on 13b models. All 7b models I've tried and a 70b model (dolphin-mixtral) do not have this issue. Strange. Additionally, this didn't happen for me when I was on WSL2, but it does now that I'm on native Linux.

GiteaMirror commented

2026-05-03 11:16:47 -05:00

@DrGood01 commented on GitHub (Dec 30, 2023):

iplayfast, thank you so much! I'm now running mixtralcpu on my laptop! It's loading into RAM, which is nice. But it also fills the swap space. Is there a way to tell it not to fill swap? Thanks again.
edit: I'm wondering now if there's a way to tell the model that it should use the calculation capacities of the graphic card?

@DrGood01 commented on GitHub (Dec 30, 2023): iplayfast, thank you so much! I'm now running mixtralcpu on my laptop! It's loading into RAM, which is nice. But it also fills the swap space. Is there a way to tell it not to fill swap? Thanks again. edit: I'm wondering now if there's a way to tell the model that it should use the calculation capacities of the graphic card?

GiteaMirror commented

2026-05-03 11:16:48 -05:00

@easp commented on GitHub (Jan 2, 2024):

But it also fills the swap space. Is there a way to tell it not to fill swap?

If you don't have enough RAM, your system will use swap. The solution is to either get more RAM and/or reduce the RAM demands of your computer by closing files, quitting apps, using smaller models.

@easp commented on GitHub (Jan 2, 2024): > But it also fills the swap space. Is there a way to tell it not to fill swap? If you don't have enough RAM, your system will use swap. The solution is to either get more RAM and/or reduce the RAM demands of your computer by closing files, quitting apps, using smaller models.

GiteaMirror commented

2026-05-03 11:16:48 -05:00

@DrGood01 commented on GitHub (Jan 3, 2024):

thanks easp. I've got 32G or RAM, and while working, my mixtralcpu uses only 7 or 8 G of it, while rapidly filling swap. Any idea?

@DrGood01 commented on GitHub (Jan 3, 2024): thanks easp. I've got 32G or RAM, and while working, my mixtralcpu uses only 7 or 8 G of it, while rapidly filling swap. Any idea?

GiteaMirror commented

2026-05-03 11:16:49 -05:00

@Nantris commented on GitHub (Jan 4, 2024):

It seems like for me the ollama never uses system memory at all, which doesn't make any sense to me, but it is reading from the disk at 140MB/s nonstop while it generates though and take up to 15 minutes for a brief response, so maybe it really isn't using system memory.

No GPU involvement.

Specifically I'm on via WSL1 (which I know is not officially supported, but is the only option I have.)

I hope there might be a Windows version soon! LLMs are just too heavy to boot up in traditional VMs.

@Nantris commented on GitHub (Jan 4, 2024): It seems like for me the ollama never uses system memory at all, which doesn't make any sense to me, but it is reading from the disk at 140MB/s nonstop while it generates though and take up to 15 minutes for a brief response, so maybe it really isn't using system memory. No GPU involvement. Specifically I'm on via WSL1 (which I know is not officially supported, but is the only option I have.) I hope there might be a Windows version soon! LLMs are just too heavy to boot up in traditional VMs.

GiteaMirror commented

2026-05-03 11:16:49 -05:00

@gbrohammer commented on GitHub (Jan 9, 2024):

Same problem, Ubuntu 64GB RAM laptop with RTX 3050 TI (4GB VRAM) fails to load the LLAMA2 model

@gbrohammer commented on GitHub (Jan 9, 2024): Same problem, Ubuntu 64GB RAM laptop with RTX 3050 TI (4GB VRAM) fails to load the LLAMA2 model

GiteaMirror commented

2026-05-03 11:16:50 -05:00

@bsu3338 commented on GitHub (Jan 19, 2024):

I am having the same problem. I am using the docker image. Solution from @iplayfast did not work for me. I tried q5_k_m models of mixtral, mistral, and llama2. I am also running within a VM.

@bsu3338 commented on GitHub (Jan 19, 2024): I am having the same problem. I am using the docker image. Solution from @iplayfast did not work for me. I tried q5_k_m models of mixtral, mistral, and llama2. I am also running within a VM.

GiteaMirror commented

2026-05-03 11:16:50 -05:00

@bsu3338 commented on GitHub (Jan 28, 2024):

My problem was caused because the Hyper-V VM was running with Dynamic Memory. After removing that option, everything worked as designed. I do not know where, but it would be good to be made a note somewhere in documentation.

@bsu3338 commented on GitHub (Jan 28, 2024): My problem was caused because the Hyper-V VM was running with Dynamic Memory. After removing that option, everything worked as designed. I do not know where, but it would be good to be made a note somewhere in documentation.

GiteaMirror commented

2026-05-03 11:16:51 -05:00

@pdevine commented on GitHub (Mar 11, 2024):

This should be working better in that ollama should offload a portion to the GPU, and a portion to the CPU. Can you test again with ollama version 0.1.28?

There are also a change coming in 0.1.29 where you will be able to set the amount of VRAM that you want to use which should force it to use the system memory instead.

@pdevine commented on GitHub (Mar 11, 2024): This should be working better in that ollama should offload a portion to the GPU, and a portion to the CPU. Can you test again with ollama version 0.1.28? There are also a change coming in 0.1.29 where you will be able to set the amount of VRAM that you want to use which should force it to use the system memory instead.

GiteaMirror commented

2026-05-03 11:16:51 -05:00

@Nantris commented on GitHub (Mar 19, 2024):

It works here on Windows now that WSL is no longer involved.

@Nantris commented on GitHub (Mar 19, 2024): It works here on Windows now that WSL is no longer involved.

GiteaMirror commented

2026-05-03 11:16:52 -05:00

@mzpqnxow commented on GitHub (Apr 11, 2024):

But it also fills the swap space. Is there a way to tell it not to fill swap?

If you don't have enough RAM, your system will use swap. The solution is to either get more RAM and/or reduce the RAM demands of your computer by closing files, quitting apps, using smaller models.

Important note on this, specifically for most Linux distributions. Arguably the most important thing for Linux desktop users with more than 16GB RAM

Most popular Linux distributions (all Debian-based distros, at least) advise the kernel to use swap for an unreasonably large portion of memory allocations, even when there’s still plenty of physical RAM available

This is a really nasty default setting that in my opinion should be adjusted or determined dynamically or by asking the user at installation time. For most workloads, a system with 32GB RAM should never proactively swap

You can tell the kernel not to swap so aggressively by setting the swappiness value lower. It’s a scale of 1-100, Debian sets it to 40 or 60 by default. I reduce it to 1 (effectively, “hardly ever swap”)

$ sudo sysctl -w vm.swappiness=1

as usual, Arch docs are the best on the subject. There is a link there with counterpoints that you may want to consider over my suggestion. I prefer to reduce disk i/o, ymmv.

@mzpqnxow commented on GitHub (Apr 11, 2024): > > But it also fills the swap space. Is there a way to tell it not to fill swap? > > If you don't have enough RAM, your system will use swap. The solution is to either get more RAM and/or reduce the RAM demands of your computer by closing files, quitting apps, using smaller models. Important note on this, specifically for most Linux distributions. Arguably the most important thing for Linux desktop users with more than 16GB RAM Most popular Linux distributions (all Debian-based distros, at least) advise the kernel to use swap for an unreasonably large portion of memory allocations, **even when there’s still plenty of physical RAM available** This is a really nasty default setting that in my opinion should be adjusted or determined dynamically or by asking the user at installation time. For most workloads, a system with 32GB RAM should never proactively swap You can tell the kernel not to swap so aggressively by setting the swappiness value lower. It’s a scale of 1-100, Debian sets it to 40 or 60 by default. I reduce it to 1 (effectively, “hardly ever swap”) $ sudo sysctl -w vm.swappiness=1 as usual, [Arch docs](https://wiki.archlinux.org/title/Swap#Swappiness) are the best on the subject. There is a [link](https://chrisdown.name/2018/01/02/in-defence-of-swap.html) there with counterpoints that you may want to consider over my suggestion. I prefer to reduce disk i/o, ymmv.

GiteaMirror commented

2026-05-03 11:16:53 -05:00

@mzpqnxow commented on GitHub (Apr 11, 2024):

thanks easp. I've got 32G or RAM, and while working, my mixtralcpu uses only 7 or 8 G of it, while rapidly filling swap. Any idea?

Check swappiness (see my previous comment, should have replied to your comment sorry)

@mzpqnxow commented on GitHub (Apr 11, 2024): > thanks easp. I've got 32G or RAM, and while working, my mixtralcpu uses only 7 or 8 G of it, while rapidly filling swap. Any idea? Check swappiness (see my previous comment, should have replied to your comment sorry)

GiteaMirror commented

2026-05-03 11:16:54 -05:00

@ConfoundedHermit commented on GitHub (Apr 12, 2024):

I have this same issue on Windows native v0.1.31 with any model. Loads models into GPU vRAM, but larger models obviously run like molasses when it caps the vRAM. System RAM is not touched by the model at all and there is 100+ GB free.

@ConfoundedHermit commented on GitHub (Apr 12, 2024): I have this same issue on Windows native v0.1.31 with any model. Loads models into GPU vRAM, but larger models obviously run like molasses when it caps the vRAM. System RAM is not touched by the model at all and there is 100+ GB free.

GiteaMirror commented

2026-05-03 11:16:54 -05:00

@dhiltgen commented on GitHub (Apr 12, 2024):

I've lost track of what this issue is tracking. It sounds like the initial problem was we miscalculated the number of layers of the model to load on the GPU, and ran out of VRAM and crashed. In general, Ollama is going to try to use the GPU and VRAM before system memory. We've been improving our prediction algorithms to get closer to fully utilizing the GPU's VRAM, without exceeding it, so I'd definitely encourage you to try the latest release.

@dhiltgen commented on GitHub (Apr 12, 2024): I've lost track of what this issue is tracking. It sounds like the initial problem was we miscalculated the number of layers of the model to load on the GPU, and ran out of VRAM and crashed. In general, Ollama is going to try to use the GPU and VRAM before system memory. We've been improving our prediction algorithms to get closer to fully utilizing the GPU's VRAM, without exceeding it, so I'd definitely encourage you to try the latest release.

GiteaMirror commented

2026-05-03 11:16:55 -05:00

@Nantris commented on GitHub (Apr 12, 2024):

I am pretty confident that when I tested it was properly using system RAM, as I haven't nearly enough VRAM to store an entire model and yet response times were pretty reasonable (a few seconds, as opposed to previously a few minutes.)

Thanks for the great work.

@Nantris commented on GitHub (Apr 12, 2024): I am pretty confident that when I tested it was properly using system RAM, as I haven't nearly enough VRAM to store an entire model and yet response times were pretty reasonable (a few seconds, as opposed to previously a few minutes.) Thanks for the great work.

GiteaMirror commented

2026-05-03 11:16:56 -05:00

@Louden7 commented on GitHub (Apr 14, 2024):

I still think this is an issue with linux, or more specifically Ubuntu.

Troubleshooting steps taken:

updated Ollama
Removed all other LLMs from the local server
Restarted service
Set the default swappiness to 5 (from 60) as suggested above in this thread.

I am running Ollama 0.1.31 locally on a Ubuntu 22.04.4 LTS with 16GB RAM and 12GB RTX 3080ti and old Ryzen 1800x. Any LLM smaller then 12GB runs flawlessly since its all on the GPU's memory. However when I tried testing with the 19GB codellama:34b It loads all ~10GB on the GPU but then nothing on the available 16GB of RAM resulting in extremely slow response times.

Screenshots below:

TMUX split screen of htop (top half) and nvtop (bottom half). Note: Average GPU% was ~7%
OpenwebUI with details on prompt response.

@Louden7 commented on GitHub (Apr 14, 2024): I still think this is an issue with linux, or more specifically Ubuntu. Troubleshooting steps taken: 1. updated Ollama 2. Removed all other LLMs from the local server 3. Restarted service 4. Set the default swappiness to 5 (from 60) as suggested above in this thread. I am running Ollama 0.1.31 locally on a Ubuntu 22.04.4 LTS with 16GB RAM and 12GB RTX 3080ti and old Ryzen 1800x. Any LLM smaller then 12GB runs flawlessly since its all on the GPU's memory. However when I tried testing with the 19GB codellama:34b It loads all ~10GB on the GPU but then nothing on the available 16GB of RAM resulting in extremely slow response times. Screenshots below: 1. TMUX split screen of htop (top half) and nvtop (bottom half). Note: Average GPU% was ~7% <img width="1440" alt="htop-nvtop" src="https://github.com/ollama/ollama/assets/22922778/5c39d777-431b-4ac8-9e78-f2876f4b09a4"> 2. OpenwebUI with details on prompt response. <img width="1316" alt="OpenwebUI" src="https://github.com/ollama/ollama/assets/22922778/c41d044c-5b4a-442f-9c3d-bd8800e53629">

GiteaMirror commented

2026-05-03 11:16:56 -05:00

@joshwkearney commented on GitHub (Apr 24, 2024):

I'll second this, I'm having the same problem running on Zorin 17.1 (based on Ubuntu). Hardware is a Ryzen 3800x, 1080ti 12GB, and 32GB of ram. If I run models much larger than 8b it can't all fit into vram but it doesn't use my system memory at all. I tried the troubleshooting above but no luck

@joshwkearney commented on GitHub (Apr 24, 2024): I'll second this, I'm having the same problem running on Zorin 17.1 (based on Ubuntu). Hardware is a Ryzen 3800x, 1080ti 12GB, and 32GB of ram. If I run models much larger than 8b it can't all fit into vram but it doesn't use my system memory at all. I tried the troubleshooting above but no luck

GiteaMirror commented

2026-05-03 11:16:57 -05:00

@siakc commented on GitHub (Apr 30, 2024):

Is this related?

@siakc commented on GitHub (Apr 30, 2024): Is [this](https://github.com/ollama/ollama/issues/3837) related?

GiteaMirror commented

2026-05-03 11:16:58 -05:00

@Louden7 commented on GitHub (Apr 30, 2024):

Yes both of these issues seem related. Trying to run a larger model that does not all fit on GPU VRAM should store the remaining in system RAM but by the images I shared above does not.

Ideally (and I may be wrong) in this case it would fill up GPU VRAM, then system RAM and share the compute load on both GPU and CPU favoring GPU for performance.

@Louden7 commented on GitHub (Apr 30, 2024): Yes both of these issues seem related. Trying to run a larger model that does not all fit on GPU VRAM _should_ store the remaining in system RAM but by the images I shared above does not. Ideally (and I may be wrong) in this case it would fill up GPU VRAM, then system RAM and share the compute load on both GPU and CPU favoring GPU for performance.

GiteaMirror commented

2026-05-03 11:16:59 -05:00

@Nantris commented on GitHub (Apr 30, 2024):

Is it fair to think this only affects Linux at this point? I haven't re-tested Windows as it's kind of a pain and I haven't much use for it, but it worked for me last time I tried.

@Nantris commented on GitHub (Apr 30, 2024): Is it fair to think this only affects Linux at this point? I haven't re-tested Windows as it's kind of a pain and I haven't much use for it, but it worked for me last time I tried.

GiteaMirror commented

2026-05-03 11:17:01 -05:00

@easp commented on GitHub (May 1, 2024):

@Louden7

Ideally (and I may be wrong) in this case it would fill up GPU VRAM, then system RAM and share the compute load on both GPU and CPU favoring GPU for performance.

The portion in VRAM is computed on the GPU, the portion in system RAM is computed by the CPU. The bottleneck is memory bandwidth, not compute. Transferring data from system RAM to the GPU is slower than transferring it to the CPU.

Model weights are memory mapped. They are accounted for in buffer/file cache, which is generally counted as available memory. Performance with 19GB of model weights is bad because the portion that doesn't fit in VRAM is processed by the CPU, which is much slower than the GPU. Your GPU utilization is low because it's spending most of its time waiting for the CPU.

@easp commented on GitHub (May 1, 2024): @Louden7 > Ideally (and I may be wrong) in this case it would fill up GPU VRAM, then system RAM and share the compute load on both GPU and CPU favoring GPU for performance. The portion in VRAM is computed on the GPU, the portion in system RAM is computed by the CPU. The bottleneck is memory bandwidth, not compute. Transferring data from system RAM to the GPU is slower than transferring it to the CPU. Model weights are memory mapped. They are accounted for in buffer/file cache, which is generally counted as available memory. Performance with 19GB of model weights is bad because the portion that doesn't fit in VRAM is processed by the CPU, which is much slower than the GPU. Your GPU utilization is low because it's spending most of its time waiting for the CPU.

GiteaMirror commented

2026-05-03 11:17:01 -05:00

@Louden7 commented on GitHub (May 2, 2024):

@easp
That makes sense. Thank you for the detailed explanation!

I am still curious about htop not showing the correct system RAM utilization.

@Louden7 commented on GitHub (May 2, 2024): @easp That makes sense. Thank you for the detailed explanation! I am still curious about htop not showing the correct system RAM utilization.

GiteaMirror commented

2026-05-03 11:17:02 -05:00

@siakc commented on GitHub (May 3, 2024):

If you like I can run some more commands for you to see what is going on.

@siakc commented on GitHub (May 3, 2024): If you like I can run some more commands for you to see what is going on.

GiteaMirror commented

2026-05-03 11:17:03 -05:00

@pdevine commented on GitHub (May 16, 2024):

I'm going to go ahead and close this. Models should work w/ hybrid CPU/GPU. If you want to see what portion is offloaded you can now use the new ollama ps command.

@pdevine commented on GitHub (May 16, 2024): I'm going to go ahead and close this. Models should work w/ hybrid CPU/GPU. If you want to see what portion is offloaded you can now use the new `ollama ps` command.

GiteaMirror commented

2026-05-03 11:17:04 -05:00

@kwikiel commented on GitHub (Sep 8, 2024):

The issue seems to be that some people would expect Ollama to load models to RAM first, then keep them there as long as possible and when there is some requests -> load from RAM to VRAM

I have 128 GB RAM and 72 GB VRAM ( 3x3090 ) so I can keep the models in RAM instead of loading them from disk for each time it's dropped from the GPU.

This seems kind of non-standard use case and maybe this can be circumvented by using RAM-disk for storing models so it can be fixed without changing anything in the Ollama code.

@kwikiel commented on GitHub (Sep 8, 2024): The issue seems to be that some people would expect Ollama to load models to RAM first, then keep them there as long as possible and when there is some requests -> load from RAM to VRAM I have 128 GB RAM and 72 GB VRAM ( 3x3090 ) so I can keep the models in RAM instead of loading them from disk for each time it's dropped from the GPU. This seems kind of non-standard use case and maybe this can be circumvented by using RAM-disk for storing models so it can be fixed without changing anything in the Ollama code.

GiteaMirror commented

2026-05-03 11:17:05 -05:00

@summersonnn commented on GitHub (Oct 21, 2024):

Hey.
I have 12 GB VRAM and 64GB RAM.
When I run a 53 GB model, I observe that my VRAM is almost full but my RAM and swap do not change. So, where is the model loaded to?
qwen2.5:72b 424bad2cc13f 53 GB 78%/22% CPU/GPU

It feels like my model is partially loaded onto GPU and processed by CPU (otherwise why would I see spike in cpu usage?) My GPU usage seldomly exceeds 30% and probably around 10% in average. What's going on?

Shouldn't I be seeing like 11 GB VRAM usage (like now) + 42 GB RAM usage ?

@summersonnn commented on GitHub (Oct 21, 2024): Hey. I have 12 GB VRAM and 64GB RAM. When I run a 53 GB model, I observe that my VRAM is almost full but my RAM and swap do not change. So, where is the model loaded to? qwen2.5:72b 424bad2cc13f 53 GB 78%/22% CPU/GPU It feels like my model is partially loaded onto GPU and processed by CPU (otherwise why would I see spike in cpu usage?) My GPU usage seldomly exceeds 30% and probably around 10% in average. What's going on? Shouldn't I be seeing like 11 GB VRAM usage (like now) + 42 GB RAM usage ?

GiteaMirror commented

2026-05-03 11:17:07 -05:00

@easp commented on GitHub (Oct 21, 2024):

@summersonnn https://github.com/ollama/ollama/issues/1727#issuecomment-2087971975

@easp commented on GitHub (Oct 21, 2024): @summersonnn https://github.com/ollama/ollama/issues/1727#issuecomment-2087971975

GiteaMirror commented

2026-05-03 11:17:08 -05:00

@cyberluke commented on GitHub (Jan 15, 2025):

The issue seems to be that some people would expect Ollama to load models to RAM first, then keep them there as long as possible and when there is some requests -> load from RAM to VRAM

I have 128 GB RAM and 72 GB VRAM ( 3x3090 ) so I can keep the models in RAM instead of loading them from disk for each time it's dropped from the GPU.

This seems kind of non-standard use case and maybe this can be circumvented by using RAM-disk for storing models so it can be fixed without changing anything in the Ollama code.

Yes, I am expecting exactly this behavior! It seems Ollama is not that efficient. But openVINO can do it.

@cyberluke commented on GitHub (Jan 15, 2025): > The issue seems to be that some people would expect Ollama to load models to RAM first, then keep them there as long as possible and when there is some requests -> load from RAM to VRAM > > I have 128 GB RAM and 72 GB VRAM ( 3x3090 ) so I can keep the models in RAM instead of loading them from disk for each time it's dropped from the GPU. > > This seems kind of non-standard use case and maybe this can be circumvented by using RAM-disk for storing models so it can be fixed without changing anything in the Ollama code. Yes, I am expecting exactly this behavior! It seems Ollama is not that efficient. But openVINO can do it.

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

parth-launch-plan-gating

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#63019