[GH-ISSUE #6008] Ollama is running on both CPU and GPU - expected to use GPU only #65795

New Issue

GiteaMirror · 2026-05-03T22:44:10-05:00

GiteaMirror commented

2026-05-03 22:44:10 -05:00

Originally created by @wxletter on GitHub (Jul 27, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6008

Originally assigned to: @dhiltgen on GitHub.

What is the issue?
When I run "ollama run llama3.1:70b", I can see that 22.9/24 GB of dedicated GPU memory is used, and 18.9/31.9 GB of shared GPU memory is used (it's in Chinese so I did the translation).

From "server.log" I can see "offloaded 42/81 layers to GPU", and when I'm chatting with llama3.1 the response is very slow, "ollama ps" shows:

Memory should be enough to run this model, then why only 42/81 layers are offloaded to GPU, and ollama is still using CPU? Is there a way to force ollama to use GPU? Server log attached, let me know if there's any other info that could be helpful.

OS
Windows11

GPU
Nvidia RTX 4090

CPU
Intel i7 13700KF

RAM
64GB

Ollama version
0.3.0
server.log

Originally created by @wxletter on GitHub (Jul 27, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6008 Originally assigned to: @dhiltgen on GitHub. **What is the issue?** When I run "ollama run llama3.1:70b", I can see that 22.9/24 GB of dedicated GPU memory is used, and 18.9/31.9 GB of shared GPU memory is used (it's in Chinese so I did the translation). <img width="301" alt="6f29d8b7b0c24ad60fcc36b0cf56593" src="https://github.com/user-attachments/assets/3f64fa11-684f-445a-bbfc-f06155829d9c"> From "server.log" I can see "offloaded 42/81 layers to GPU", and when I'm chatting with llama3.1 the response is very slow, "ollama ps" shows: ![image](https://github.com/user-attachments/assets/59b0d507-14c0-4d84-a929-3e6068a19638) Memory should be enough to run this model, then why only 42/81 layers are offloaded to GPU, and ollama is still using CPU? Is there a way to force ollama to use GPU? Server log attached, let me know if there's any other info that could be helpful. **OS** Windows11 **GPU** Nvidia RTX 4090 **CPU** Intel i7 13700KF **RAM** 64GB **Ollama version** 0.3.0 [server.log](https://github.com/user-attachments/files/16398489/server.log)

GiteaMirror added the question label 2026-05-03 22:44:10 -05:00

GiteaMirror closed this issue

2026-05-03 22:44:16 -05:00

GiteaMirror commented

2026-05-03 22:44:17 -05:00

@wxletter commented on GitHub (Jul 27, 2024):

@mxmp210
Thanks for your reply, however I don't have this issue running llama3.1:70b, the model can be loaded succesfully but the response is very slow since Ollama is running on CPU. I have 64 GB RAM.

@wxletter commented on GitHub (Jul 27, 2024): @mxmp210 Thanks for your reply, however I don't have this issue running llama3.1:70b, the model can be loaded succesfully but the response is very slow since Ollama is running on CPU. I have 64 GB RAM.

GiteaMirror commented

2026-05-03 22:44:17 -05:00

@wxletter commented on GitHub (Jul 27, 2024):

@dhiltgen could you help to take a look at this issue? I'm not sure if it'd be rude to @ you like this, I just saw you're helping people with other problems, and I think you may help me on this, thanks in advance!

@wxletter commented on GitHub (Jul 27, 2024): @dhiltgen could you help to take a look at this issue? I'm not sure if it'd be rude to @ you like this, I just saw you're helping people with other problems, and I think you may help me on this, thanks in advance!

GiteaMirror commented

2026-05-03 22:44:18 -05:00

@rick-github commented on GitHub (Jul 27, 2024):

time=2024-07-27T14:30:02.936+08:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=42 layers.split="" memory.available="[21.3 GiB]" memory.required.full="39.3 GiB" memory.required.partial="21.1 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[21.1 GiB]" memory.weights.total="36.5 GiB" memory.weights.repeating="35.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="1.1 GiB"

ollama is using the GPU: almost all (21.1 of 24G) is being used for the model. But the model is larger than the available dedicated VRAM, it needs a total 39.3G of RAM. So some it has to spill into system RAM, which is 18.2G of shared GPU memory.

@rick-github commented on GitHub (Jul 27, 2024): ``` time=2024-07-27T14:30:02.936+08:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=42 layers.split="" memory.available="[21.3 GiB]" memory.required.full="39.3 GiB" memory.required.partial="21.1 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[21.1 GiB]" memory.weights.total="36.5 GiB" memory.weights.repeating="35.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="1.1 GiB" ``` ollama is using the GPU: almost all (21.1 of 24G) is being used for the model. But the model is larger than the available dedicated VRAM, it needs a total 39.3G of RAM. So some it has to spill into system RAM, which is 18.2G of shared GPU memory.

GiteaMirror commented

2026-05-03 22:44:18 -05:00

@wxletter commented on GitHub (Jul 27, 2024):

time=2024-07-27T14:30:02.936+08:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=42 layers.split="" memory.available="[21.3 GiB]" memory.required.full="39.3 GiB" memory.required.partial="21.1 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[21.1 GiB]" memory.weights.total="36.5 GiB" memory.weights.repeating="35.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="1.1 GiB"
ollama is using the GPU: almost all (21.1 of 24G) is being used for the model. But the model is larger than the available dedicated VRAM, it needs a total 39.3G of RAM. So some it has to spill into system RAM, which is 18.2G of shared GPU memory.

I understand that the dedicated 24 GB VRAM is not enough to load the model so shared GPU memory is used, although the "shared GPU memory" is actually RAM, it should be considered as "VRAM" just speed is lower than real VRAM. So do you mean that no matter it's shared GPU memory or just RAM, as long as part of the layers are offloaded to RAM, Ollama will use CPU + GPU?

@wxletter commented on GitHub (Jul 27, 2024): > ``` > time=2024-07-27T14:30:02.936+08:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=42 layers.split="" memory.available="[21.3 GiB]" memory.required.full="39.3 GiB" memory.required.partial="21.1 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[21.1 GiB]" memory.weights.total="36.5 GiB" memory.weights.repeating="35.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="1.1 GiB" > ``` > > ollama is using the GPU: almost all (21.1 of 24G) is being used for the model. But the model is larger than the available dedicated VRAM, it needs a total 39.3G of RAM. So some it has to spill into system RAM, which is 18.2G of shared GPU memory. I understand that the dedicated 24 GB VRAM is not enough to load the model so shared GPU memory is used, although the "shared GPU memory" is actually RAM, it should be considered as "VRAM" just speed is lower than real VRAM. So do you mean that no matter it's shared GPU memory or just RAM, as long as part of the layers are offloaded to RAM, Ollama will use CPU + GPU?

GiteaMirror commented

2026-05-03 22:44:19 -05:00

@rick-github commented on GitHub (Jul 27, 2024):

I don't have a deep knowledge of Nvidia devices/drivers or how llama.ccp uses them, but generally the problem with RAM limited peripherals is memory bandwidth. PCIe devices can access system RAM, but the speed is lower than the speed at which the CPU can access it. If we take current top of the line tech, a x16 PCIe4 bus has about 32GBps simplex data transfer rate while a DDR5 based CPU/RAM system has about 64GBps. (GPUs have a higher bandwidth with their local memory due the increased width of the bus, GDDR6 is typically > 800GBps while HBM is measured in TBps). So while it's technically possible to have a PCI based device access system RAM, it's usually more efficient to allow the CPU to process the data in system RAM and the PCI device process data in its own RAM.

Somebody with more knowledge of Nvidia cards and llama.cpp could provide more insight.

@rick-github commented on GitHub (Jul 27, 2024): I don't have a deep knowledge of Nvidia devices/drivers or how llama.ccp uses them, but generally the problem with RAM limited peripherals is memory bandwidth. PCIe devices can access system RAM, but the speed is lower than the speed at which the CPU can access it. If we take current top of the line tech, a x16 PCIe4 bus has about 32GBps simplex data transfer rate while a DDR5 based CPU/RAM system has about 64GBps. (GPUs have a higher bandwidth with their local memory due the increased width of the bus, GDDR6 is typically > 800GBps while HBM is measured in TBps). So while it's technically possible to have a PCI based device access system RAM, it's usually more efficient to allow the CPU to process the data in system RAM and the PCI device process data in its own RAM. Somebody with more knowledge of Nvidia cards and llama.cpp could provide more insight.

GiteaMirror commented

2026-05-03 22:44:20 -05:00

@wxletter commented on GitHub (Jul 28, 2024):

I don't have a deep knowledge of Nvidia devices/drivers or how llama.ccp uses them, but generally the problem with RAM limited peripherals is memory bandwidth. PCIe devices can access system RAM, but the speed is lower than the speed at which the CPU can access it. If we take current top of the line tech, a x16 PCIe4 bus has about 32GBps simplex data transfer rate while a DDR5 based CPU/RAM system has about 64GBps. (GPUs have a higher bandwidth with their local memory due the increased width of the bus, GDDR6 is typically > 800GBps while HBM is measured in TBps). So while it's technically possible to have a PCI based device access system RAM, it's usually more efficient to allow the CPU to process the data in system RAM and the PCI device process data in its own RAM.

Somebody with more knowledge of Nvidia cards and llama.cpp could provide more insight.

Thanks very much for the explaination, I don't have this kind of knowledge, I just thought that since it's called "shared GPU memory", if it can only be used in the same way as normal RAM, then it's meaningless; like the "virtual memory" is just file on hard drive, it's treated like "RAM" (at least in some cases) not hard drive.

If "shared GPU memory" can be recognized as VRAM, even it's spead is lower than real VRAM, Ollama should use 100% GPU to do the job, then the response should be quicker than using CPU + GPU. I'm not sure if I'm wrong or whether Ollama can do this.

@wxletter commented on GitHub (Jul 28, 2024): > I don't have a deep knowledge of Nvidia devices/drivers or how llama.ccp uses them, but generally the problem with RAM limited peripherals is memory bandwidth. PCIe devices can access system RAM, but the speed is lower than the speed at which the CPU can access it. If we take current top of the line tech, a x16 PCIe4 bus has about 32GBps simplex data transfer rate while a DDR5 based CPU/RAM system has about 64GBps. (GPUs have a higher bandwidth with their local memory due the increased width of the bus, GDDR6 is typically > 800GBps while HBM is measured in TBps). So while it's technically possible to have a PCI based device access system RAM, it's usually more efficient to allow the CPU to process the data in system RAM and the PCI device process data in its own RAM. > > Somebody with more knowledge of Nvidia cards and llama.cpp could provide more insight. Thanks very much for the explaination, I don't have this kind of knowledge, I just thought that since it's called "shared GPU memory", if it can only be used in the same way as normal RAM, then it's meaningless; like the "virtual memory" is just file on hard drive, it's treated like "RAM" (at least in some cases) not hard drive. If "shared GPU memory" can be recognized as VRAM, even it's spead is lower than real VRAM, Ollama should use 100% GPU to do the job, then the response should be quicker than using CPU + GPU. I'm not sure if I'm wrong or whether Ollama can do this.

GiteaMirror commented

2026-05-03 22:44:21 -05:00

@rick-github commented on GitHub (Jul 28, 2024):

Whether CPU+GPU or GPU only is faster or slower depends on the where the bottleneck is, memory bandwidth or compute. Either way, it's not an ollama issue, it's a llama.cpp issue. Follow up on https://github.com/ggerganov/llama.cpp/issues/6743.

@rick-github commented on GitHub (Jul 28, 2024): Whether CPU+GPU or GPU only is faster or slower depends on the where the bottleneck is, memory bandwidth or compute. Either way, it's not an ollama issue, it's a llama.cpp issue. Follow up on https://github.com/ggerganov/llama.cpp/issues/6743.

GiteaMirror commented

2026-05-03 22:44:21 -05:00

@wxletter commented on GitHub (Jul 28, 2024):

Whether CPU+GPU or GPU only is faster or slower depends on the where the bottleneck is, memory bandwidth or compute. Either way, it's not an ollama issue, it's a llama.cpp issue. Follow up on ggerganov/llama.cpp#6743.

I subscribed to this issue, it's not the same as mine. On my PC when Ollama is running llama3.1 70b both VRAM and shared GPU memory are used, however most of the time it's CPU doing the job and GPU is barely used (judging from performance monitor). I agree with @alirezanet that even some layers are offloaded to shared GPU memory, CPU should not be used mainly. In my case it takes more than 1 min before Ollama starting to respond (to a simple chat like just saying "hello"), and I can only get about 1 word per second, the performance is too bad.

@wxletter commented on GitHub (Jul 28, 2024): > Whether CPU+GPU or GPU only is faster or slower depends on the where the bottleneck is, memory bandwidth or compute. Either way, it's not an ollama issue, it's a llama.cpp issue. Follow up on [ggerganov/llama.cpp#6743](https://github.com/ggerganov/llama.cpp/issues/6743). I subscribed to this issue, it's not the same as mine. On my PC when Ollama is running llama3.1 70b both VRAM and shared GPU memory are used, however most of the time it's CPU doing the job and GPU is barely used (judging from performance monitor). I agree with @alirezanet that even some layers are offloaded to shared GPU memory, CPU should not be used mainly. In my case it takes more than 1 min before Ollama starting to respond (to a simple chat like just saying "hello"), and I can only get about 1 word per second, the performance is too bad.

GiteaMirror commented

2026-05-03 22:44:22 -05:00

@rick-github commented on GitHub (Jul 28, 2024):

Stable Diffusion users found shared memory impacted processing speed so much that Nvidia added an option to turn it off. If you have time, it would be interesting if you could try this and see if anything changes. Having read up a little bit on shared memory, it's not clear to me why the driver is reporting any shared memory usage at all. llama.cpp has only got 42 layers of the model loaded into VRAM, and if llama.cpp is using CPU for the other 39 layers, then there should be no shared GPU RAM, just VRAM and system RAM. It would be interesting if you could post a screen cap of the GPU and CPU usage for the entire time that llama.cpp is doing inference, to see if the load switches completely between GPU and CPU or if it uses a bit from both at the same time.

@rick-github commented on GitHub (Jul 28, 2024): Stable Diffusion users found shared memory impacted processing speed so much that Nvidia added an option to [turn it off](https://nvidia.custhelp.com/app/answers/detail/a_id/5490). If you have time, it would be interesting if you could try this and see if anything changes. Having read up a little bit on shared memory, it's not clear to me why the driver is reporting any shared memory usage at all. llama.cpp has only got 42 layers of the model loaded into VRAM, and if llama.cpp is using CPU for the other 39 layers, then there should be no shared GPU RAM, just VRAM and system RAM. It would be interesting if you could post a screen cap of the GPU and CPU usage for the entire time that llama.cpp is doing inference, to see if the load switches completely between GPU and CPU or if it uses a bit from both at the same time.

GiteaMirror commented

2026-05-03 22:44:23 -05:00

@wxletter commented on GitHub (Jul 28, 2024):

I tried to set "prefer no system fallback" for ollama app.exe, ollama.exe and ollama_llama_server.exe, restarted Ollama, restarted PC, with no result... when I run "ollama run llama3.1:70b" it's still using 20+ GB of shared GPU memory. I took a screenshot after I sent "hello" to the model, and this is the CPU and GPU usage when llama3.1 is thinking about how to respond. CPU usage is about 50%, and GPU usage is about 10% all the time, sometimes GPU usasge will raise to about 30% then immediately drop to lower than 10%.

@wxletter commented on GitHub (Jul 28, 2024): I tried to set "prefer no system fallback" for ollama app.exe, ollama.exe and ollama_llama_server.exe, restarted Ollama, restarted PC, with no result... when I run "ollama run llama3.1:70b" it's still using 20+ GB of shared GPU memory. I took a screenshot after I sent "hello" to the model, and this is the CPU and GPU usage when llama3.1 is thinking about how to respond. CPU usage is about 50%, and GPU usage is about 10% all the time, sometimes GPU usasge will raise to about 30% then immediately drop to lower than 10%. ![cpu_gpu_usage](https://github.com/user-attachments/assets/48b085f0-1035-4cb1-a5bf-1e4c79beaf59)

GiteaMirror commented

2026-05-03 22:44:23 -05:00

@dhiltgen commented on GitHub (Jul 29, 2024):

As others have pointed out, ollama (and the underlying llama.cpp library) utilize dedicated VRAM on the GPU for inference. Once that memory is near fully allocated, the remaining portions of the model are loaded into system memory and inference is performed using the CPU.

What Windows does with the shared memory is perform a paging algorithm where pages of memory are swapped back and forth between system RAM and GPU VRAM, and while this does allow some apps to overflow VRAM, the performance impact to inference would be significant. It's better to leverage the CPU for performing inference of the portion of the model that doesn't fit within VRAM in parallel with the GPU processing its portion of the model, instead of thrashing memory pages back and forth.

@dhiltgen commented on GitHub (Jul 29, 2024): As others have pointed out, ollama (and the underlying llama.cpp library) utilize dedicated VRAM on the GPU for inference. Once that memory is near fully allocated, the remaining portions of the model are loaded into system memory and inference is performed using the CPU. What Windows does with the shared memory is perform a paging algorithm where pages of memory are swapped back and forth between system RAM and GPU VRAM, and while this does allow some apps to overflow VRAM, the performance impact to inference would be significant. It's better to leverage the CPU for performing inference of the portion of the model that doesn't fit within VRAM in parallel with the GPU processing its portion of the model, instead of thrashing memory pages back and forth.

GiteaMirror commented

2026-05-03 22:44:24 -05:00

@wxletter commented on GitHub (Jul 29, 2024):

As others have pointed out, ollama (and the underlying llama.cpp library) utilize dedicated VRAM on the GPU for inference. Once that memory is near fully allocated, the remaining portions of the model are loaded into system memory and inference is performed using the CPU.

What Windows does with the shared memory is perform a paging algorithm where pages of memory are swapped back and forth between system RAM and GPU VRAM, and while this does allow some apps to overflow VRAM, the performance impact to inference would be significant. It's better to leverage the CPU for performing inference of the portion of the model that doesn't fit within VRAM in parallel with the GPU processing its portion of the model, instead of thrashing memory pages back and forth.

@dhiltgen Thanks very much for the explaination, I understand now that it's better to leverage CPU to handle the layers loaded into RAM together with GPU to handle the layers loaded in VRAM. I have one concer left: in my case about half the layers are loaded into RAM and the other half loaded into VRAM, when GPU and CPU are performing inference together the CPU usage is about 40% and most of the time the GPU usage is about 10%, hardly reaches 30% then drop to 10% immediately, is there anything I can do to make most use of both GPU and CPU to get better inference performance?

@wxletter commented on GitHub (Jul 29, 2024): > As others have pointed out, ollama (and the underlying llama.cpp library) utilize dedicated VRAM on the GPU for inference. Once that memory is near fully allocated, the remaining portions of the model are loaded into system memory and inference is performed using the CPU. > > What Windows does with the shared memory is perform a paging algorithm where pages of memory are swapped back and forth between system RAM and GPU VRAM, and while this does allow some apps to overflow VRAM, the performance impact to inference would be significant. It's better to leverage the CPU for performing inference of the portion of the model that doesn't fit within VRAM in parallel with the GPU processing its portion of the model, instead of thrashing memory pages back and forth. @dhiltgen Thanks very much for the explaination, I understand now that it's better to leverage CPU to handle the layers loaded into RAM together with GPU to handle the layers loaded in VRAM. I have one concer left: in my case about half the layers are loaded into RAM and the other half loaded into VRAM, when GPU and CPU are performing inference together the CPU usage is about 40% and most of the time the GPU usage is about 10%, hardly reaches 30% then drop to 10% immediately, is there anything I can do to make most use of both GPU and CPU to get better inference performance?

GiteaMirror commented

2026-05-03 22:44:25 -05:00

@wxletter commented on GitHub (Aug 1, 2024):

Update - today I updated Ollama to version 0.3.2, llama3.1 70B loads faster (about 25 sec) than before (Ollama ver 0.3.0, more than 1min), CPU utilization is higher (about 70%) but GPU utilizaiton is still low (about 20%) when inferencing. 40/81 laysers are loaded into VRAM.

@wxletter commented on GitHub (Aug 1, 2024): Update - today I updated Ollama to version 0.3.2, llama3.1 70B loads faster (about 25 sec) than before (Ollama ver 0.3.0, more than 1min), CPU utilization is higher (about 70%) but GPU utilizaiton is still low (about 20%) when inferencing. 40/81 laysers are loaded into VRAM. ![image](https://github.com/user-attachments/assets/5651912c-3f6b-41fe-92b8-6acb01d29921)

GiteaMirror commented

2026-05-03 22:44:26 -05:00

@dhiltgen commented on GitHub (Aug 1, 2024):

It's possible our thread count might not be optimal on your system - see #2496 - you can experiment with setting different values for num_thread to try to optimize performance.

@dhiltgen commented on GitHub (Aug 1, 2024): It's possible our thread count might not be optimal on your system - see #2496 - you can experiment with setting different values for `num_thread` to try to optimize performance.

GiteaMirror commented

2026-05-03 22:44:26 -05:00

@wxletter commented on GitHub (Aug 2, 2024):

My CPU is Intel 13700KF, it has 16 cores and 24 threads, I tried to use "/set parameter num_thread 24" and "/set parameter num_thread 16" to set the parameter but only get about 40% CPU usage, can't even make it to 70% as when I updated Ollama yesterday, and the GPU usage is still low - about 10% to 20%. I'll try with some other numbers to see if it can be better. Is there any parameter that I can use to improve GPU utilization?

@wxletter commented on GitHub (Aug 2, 2024): My CPU is Intel 13700KF, it has 16 cores and 24 threads, I tried to use "/set parameter num_thread 24" and "/set parameter num_thread 16" to set the parameter but only get about 40% CPU usage, can't even make it to 70% as when I updated Ollama yesterday, and the GPU usage is still low - about 10% to 20%. I'll try with some other numbers to see if it can be better. Is there any parameter that I can use to improve GPU utilization?

GiteaMirror commented

2026-05-03 22:44:26 -05:00

@xiaohan815 commented on GitHub (Sep 20, 2024):

mybe you need two 4090 GPUs to run the 70B model?估计你需要2张4090才能跑70b不卡

@xiaohan815 commented on GitHub (Sep 20, 2024): mybe you need two 4090 GPUs to run the 70B model?估计你需要2张4090才能跑70b不卡

GiteaMirror commented

2026-05-03 22:44:27 -05:00

@michelle-chou25 commented on GitHub (Oct 29, 2024):

mybe you need two 4090 GPUs to run the 70B model?估计你需要2张4090才能跑70b不卡

Yes, 48G GPU at least

@michelle-chou25 commented on GitHub (Oct 29, 2024): > mybe you need two 4090 GPUs to run the 70B model?估计你需要2张4090才能跑70b不卡 Yes, 48G GPU at least

GiteaMirror commented

2026-05-03 22:44:27 -05:00

@kripper commented on GitHub (Nov 16, 2024):

I'm experiencing the same symptom here: CPU is reporting high load, but only GPU should be used (using VRAM and shared memory). Maybe there is a bug in llama.cpp?

@kripper commented on GitHub (Nov 16, 2024): I'm experiencing the same symptom [here](https://github.com/ollama/ollama/issues/7673#issuecomment-2480393630): CPU is reporting high load, but only GPU should be used (using VRAM and shared memory). Maybe there is a bug in llama.cpp?

GiteaMirror commented

2026-05-03 22:44:28 -05:00

@icemagno commented on GitHub (Dec 4, 2024):

I have Ollama for windows, RTX 4060 and ollama keeps insisting on using CPU and RAM instead my GPU. It is very disapointing because I spent a fortune buying this gpu. Many have explained various things about PCI, buses and RAM performance, etc... So ... what is the point of having GPU then ?

@icemagno commented on GitHub (Dec 4, 2024): I have Ollama for windows, RTX 4060 and ollama keeps insisting on using CPU and RAM instead my GPU. It is very disapointing because I spent a fortune buying this gpu. Many have explained various things about PCI, buses and RAM performance, etc... So ... what is the point of having GPU then ?

GiteaMirror commented

2026-05-03 22:44:28 -05:00

@rick-github commented on GitHub (Dec 4, 2024):

ollama will use the GPU if it's able to. If you would like your issue debugged, open a new ticket and add server logs.

@rick-github commented on GitHub (Dec 4, 2024): ollama will use the GPU if it's able to. If you would like your issue debugged, open a new ticket and add server logs.

GiteaMirror commented

2026-05-03 22:44:30 -05:00

@DJMo13 commented on GitHub (Jan 24, 2025):

Use smaller models that actually are able to fit in your gpus vram, 13b models with quantization need at least a gpu with 12 gb vram and 32b models need at least 24 gb vram. 70b Models need datacenter gpus or two consumer ones and if you put a model too big on your gpu then it has to fall back to use the cpu as well and even 1 token per second is fast if you use a llama 70b model on your cpu...

@DJMo13 commented on GitHub (Jan 24, 2025): Use smaller models that actually are able to fit in your gpus vram, 13b models with quantization need at least a gpu with 12 gb vram and 32b models need at least 24 gb vram. 70b Models need datacenter gpus or two consumer ones and if you put a model too big on your gpu then it has to fall back to use the cpu as well and even 1 token per second is fast if you use a llama 70b model on your cpu...

GiteaMirror commented

2026-05-03 22:44:32 -05:00

@cyberluke commented on GitHub (Feb 16, 2025):

Just use LM Studio, it loads models better and have much more options to set up offloading and num layer :-/

@cyberluke commented on GitHub (Feb 16, 2025): Just use LM Studio, it loads models better and have much more options to set up offloading and num layer :-/

GiteaMirror commented

2026-05-03 22:44:33 -05:00

@99sono commented on GitHub (Apr 21, 2025):

Intuitively, I always thought it would be a no-brainer that the price to pay to swap model data from RAM to GPU VRAM would be much lower than using the CPU for any heavy math like in big language models. With so many more cores on the GPU (like the RTX 3090’s 10,000+ vs. a CPU’s 16 or so), I figured moving data to the GPU would be worth it, even if it’s a bit slow.

Here’s the analysis from Grok 3 on this matter for the Llama 3.1 70B model (q4_0, ~40-50 GB) on an RTX 3090 with 24 GB VRAM, based on your setup (42/81 layers in VRAM, 39 in shared memory):

Summary for Too long did not read (TLDR)

Ideal Case (Imaginary Infinite VRAM): If all 81 layers fit in VRAM, the GPU computes everything in ~50-100 ms per token, giving ~10-20 tokens/s. This isn’t possible with 24 GB VRAM, as the model needs ~40-50 GB.
Swapping Layers: If we swap 39 layers from RAM to VRAM per token (each layer is ~0.43 GB), copying takes ~523 ms over PCIe (~13 ms per layer). GPU compute for all 81 layers is ~50-100 ms, so total time is ~573-623 ms per token. This is slow due to data transfer.
Current Setup (Static Offloading): Your setup has 42 layers in VRAM (computed by GPU in ~25-50 ms) and 39 layers in RAM (computed by CPU in ~500-1000 ms). Total time is ~525-1050 ms per token, which explains the slow responses.

Surprisingly, swapping layers isn’t much faster than using the CPU, because moving data over PCIe is so slow. For your RTX 3090, sticking with q4_0 and maybe trying a smaller model like Llama 3.1 8B (~5 GB, fits in VRAM) might be better. You could also try setting OLLAMA_MAX_VRAM to push more layers to VRAM, but be careful of crashes. Thanks for posting this—it’s really interesting! What’s your CPU, and have you tried smaller models?

Full Grok explanation:

I understand your confusion about the layer-swapping costs and the comparison between swapping layers versus computing on the CPU. The original computation was a bit unclear, and I’ll make it explicit by calculating the time per token for three scenarios involving a 4-bit quantized Llama 3.1 70B model (~40-50 GB) on an RTX 3090 (24 GB VRAM). The scenarios will compare:

Ideal Scenario: All 81 layers, KV cache, and activations fit in an imaginary RTX 3090 with infinite VRAM, so all computation is on the GPU.
Dynamic Swapping Scenario: 42 layers stay in VRAM, 39 layers are swapped in from RAM per token, and all computation is on the GPU after swapping.
Static Offloading Scenario: 42 layers are computed on the GPU, 39 layers are computed on the CPU (as in the GitHub issue), with no swapping.

I’ll also rewrite your GitHub post in a simpler, non-expert tone, following your requested structure: (A) your intuitive thought about swapping versus CPU computation, and (B) Grok 3’s analysis. The post will reflect the clarified model size (~40-50 GB for q4_0) and address the issue (Ollama Issue #6008).

Clarified Analysis: Time per Token for Three Scenarios

Model and Hardware Assumptions

Model: Llama 3.1 70B, q4_0 quantization (~40-50 GB total memory footprint, ~35 GB for weights).
- 81 layers, each ~0.43 GB (( \frac{35 , \text{GB}}{81} \approx 0.43 , \text{GB} )).
- KV cache (~5-10 GB for 2048 tokens) and activations (~5 GB) included in the 40-50 GB total.
Hardware: NVIDIA RTX 3090 with 24 GB VRAM, PCIe 4.0 x16 (~32 GB/s bidirectional), 35.6 TFLOPS (FP16, used for q4_0 inference).
- CPU: Assumed high-end (e.g., AMD Ryzen 9 5950X, 16 cores, ~1-2 TFLOPS in FP16, ~100 GB/s DDR4 bandwidth).
Memory Usage (from Issue): 22.9 GB VRAM (42 layers) + 18.9 GB shared GPU memory (39 layers), totaling ~41.8 GB, consistent with q4_0.

Scenario 1: Ideal (All Layers in Infinite VRAM)

Setup: All 81 layers, KV cache, and activations fit in an imaginary RTX 3090 with infinite VRAM. All computation is on the GPU.
Compute Time:
- Each layer involves matrix multiplications (e.g., feed-forward, attention). For a 70B model, a single layer’s computation is ~10-20 GFLOPs (based on typical LLM workloads).
- Total for 81 layers: ~1-2 TFLOPs per token (assuming sequential token generation).
- RTX 3090’s 35.6 TFLOPS can process this in:
  [
  \frac{1-2 , \text{TFLOPs}}{35.6 , \text{TFLOPs/s}} \approx 28-56 , \text{ms}
  ]
- VRAM bandwidth (936 GB/s) is not a bottleneck, as each layer’s data (~0.43 GB) is accessed in ~0.46 ms.
Total Time per Token: ~50-100 ms (including minor overheads like kernel launches, assuming ~10-20 tokens/s for a 70B model on an RTX 3090).

Scenario 2: Dynamic Swapping (39 Layers Swapped per Token)

Setup: 42 layers stay in VRAM (as in the issue). For each token, 39 layers are swapped from RAM to VRAM, and all 81 layers are computed on the GPU.
Swapping Cost:
- Each layer is ~0.43 GB. Copying one layer over PCIe 4.0 (~32 GB/s) takes:
  [
  \frac{0.43 , \text{GB}}{32 , \text{GB/s}} \approx 13.4 , \text{ms}
  ]
- Swapping 39 layers:
  [
  39 \times 13.4 , \text{ms} \approx 522.6 , \text{ms}
  ]
- Note: This assumes swapping in only (unidirectional). If we need to swap out 39 layers to make space, double the cost:
  [
  39 \times 13.4 , \text{ms} \times 2 \approx 1045.2 , \text{ms}
  ]
  For simplicity, assume VRAM can hold the new layers temporarily (e.g., by overwriting KV cache), so ~522.6 ms for swapping in.
Compute Time:
- After swapping, all 81 layers are computed on the GPU, identical to Scenario 1: ~50-100 ms.
Total Time per Token:
[
522.6 , \text{ms (swapping)} + 50-100 , \text{ms (compute)} \approx 572.6-622.6 , \text{ms}
]
- Total: ~573-623 ms per token. This is much slower than Scenario 1 due to PCIe latency.

Scenario 3: Static Offloading (39 Layers on CPU)

Setup: 42 layers in VRAM (GPU), 39 layers in shared GPU memory (RAM, computed by CPU), as in the issue.
GPU Compute:
- 42 layers require ~0.5-1 TFLOPs (half of Scenario 1).
- RTX 3090 processes this in:
  [
  \frac{0.5-1 , \text{TFLOPs}}{35.6 , \text{TFLOPs/s}} \approx 14-28 , \text{ms}
  ]
- Total GPU time: ~25-50 ms (with overhead).
CPU Compute:
- 39 layers also require ~0.5-1 TFLOPs.
- CPU (1-2 TFLOPS) processes this in:
  [
  \frac{0.5-1 , \text{TFLOPs}}{1-2 , \text{TFLOPs/s}} \approx 250-1000 , \text{ms}
  ]
- RAM bandwidth (~100 GB/s) is sufficient but slower than VRAM, adding minor overhead.
- Shared GPU memory access (via PCIe) may add latency (~10-20 ms per layer access), but assume direct CPU computation for simplicity.
- Total CPU time: ~500-1000 ms (conservative, based on typical CPU-bound LLM inference).
Total Time per Token:
- GPU and CPU compute sequentially (layer-by-layer processing in Transformers).
  [
  25-50 , \text{ms (GPU)} + 500-1000 , \text{ms (CPU)} \approx 525-1050 , \text{ms}
  ]
- Total: ~525-1050 ms per token. This aligns with the slow responses in the issue.

Comparison

Scenario 1 (Ideal): ~50-100 ms per token (GPU only, infeasible with 24 GB VRAM).
Scenario 2 (Dynamic Swapping): ~573-623 ms per token (GPU compute + swapping 39 layers).
Scenario 3 (Static Offloading): ~525-1050 ms per token (GPU for 42 layers, CPU for 39 layers).

Key Insight: Dynamic swapping (Scenario 2) is not a bargain compared to static offloading (Scenario 3). Swapping 39 layers costs ~523 ms, which is comparable to or worse than the CPU’s 500-1000 ms for computing 39 layers. The GPU compute advantage (50-100 ms for all layers) is negated by PCIe latency. Static offloading is simpler and often faster, especially if the CPU is reasonably performant.

@99sono commented on GitHub (Apr 21, 2025): Intuitively, I always thought it would be a no-brainer that the price to pay to swap model data from RAM to GPU VRAM would be much lower than using the CPU for any heavy math like in big language models. With so many more cores on the GPU (like the RTX 3090’s 10,000+ vs. a CPU’s 16 or so), I figured moving data to the GPU would be worth it, even if it’s a bit slow. Here’s the analysis from Grok 3 on this matter for the Llama 3.1 70B model (q4_0, ~40-50 GB) on an RTX 3090 with 24 GB VRAM, based on your setup (42/81 layers in VRAM, 39 in shared memory): --- # Summary for Too long did not read (TLDR) - **Ideal Case (Imaginary Infinite VRAM)**: If all 81 layers fit in VRAM, the GPU computes everything in ~50-100 ms per token, giving ~10-20 tokens/s. This isn’t possible with 24 GB VRAM, as the model needs ~40-50 GB. - **Swapping Layers**: If we swap 39 layers from RAM to VRAM per token (each layer is ~0.43 GB), copying takes ~523 ms over PCIe (~13 ms per layer). GPU compute for all 81 layers is ~50-100 ms, so total time is ~573-623 ms per token. This is slow due to data transfer. - **Current Setup (Static Offloading)**: Your setup has 42 layers in VRAM (computed by GPU in ~25-50 ms) and 39 layers in RAM (computed by CPU in ~500-1000 ms). Total time is ~525-1050 ms per token, which explains the slow responses. Surprisingly, swapping layers isn’t much faster than using the CPU, because moving data over PCIe is so slow. For your RTX 3090, sticking with q4_0 and maybe trying a smaller model like Llama 3.1 8B (~5 GB, fits in VRAM) might be better. You could also try setting `OLLAMA_MAX_VRAM` to push more layers to VRAM, but be careful of crashes. Thanks for posting this—it’s really interesting! What’s your CPU, and have you tried smaller models? --- # Full Grok explanation: I understand your confusion about the layer-swapping costs and the comparison between swapping layers versus computing on the CPU. The original computation was a bit unclear, and I’ll make it explicit by calculating the time per token for three scenarios involving a 4-bit quantized Llama 3.1 70B model (~40-50 GB) on an RTX 3090 (24 GB VRAM). The scenarios will compare: 1. **Ideal Scenario**: All 81 layers, KV cache, and activations fit in an imaginary RTX 3090 with infinite VRAM, so all computation is on the GPU. 2. **Dynamic Swapping Scenario**: 42 layers stay in VRAM, 39 layers are swapped in from RAM per token, and all computation is on the GPU after swapping. 3. **Static Offloading Scenario**: 42 layers are computed on the GPU, 39 layers are computed on the CPU (as in the GitHub issue), with no swapping. I’ll also rewrite your GitHub post in a simpler, non-expert tone, following your requested structure: (A) your intuitive thought about swapping versus CPU computation, and (B) Grok 3’s analysis. The post will reflect the clarified model size (~40-50 GB for q4_0) and address the issue ([Ollama Issue #6008](https://github.com/ollama/ollama/issues/6008)). --- ### Clarified Analysis: Time per Token for Three Scenarios #### Model and Hardware Assumptions - **Model**: Llama 3.1 70B, q4_0 quantization (~40-50 GB total memory footprint, ~35 GB for weights). - 81 layers, each ~0.43 GB (\( \frac{35 \, \text{GB}}{81} \approx 0.43 \, \text{GB} \)). - KV cache (~5-10 GB for 2048 tokens) and activations (~5 GB) included in the 40-50 GB total. - **Hardware**: NVIDIA RTX 3090 with 24 GB VRAM, PCIe 4.0 x16 (~32 GB/s bidirectional), 35.6 TFLOPS (FP16, used for q4_0 inference). - CPU: Assumed high-end (e.g., AMD Ryzen 9 5950X, 16 cores, ~1-2 TFLOPS in FP16, ~100 GB/s DDR4 bandwidth). - **Memory Usage (from Issue)**: 22.9 GB VRAM (42 layers) + 18.9 GB shared GPU memory (39 layers), totaling ~41.8 GB, consistent with q4_0. #### Scenario 1: Ideal (All Layers in Infinite VRAM) - **Setup**: All 81 layers, KV cache, and activations fit in an imaginary RTX 3090 with infinite VRAM. All computation is on the GPU. - **Compute Time**: - Each layer involves matrix multiplications (e.g., feed-forward, attention). For a 70B model, a single layer’s computation is ~10-20 GFLOPs (based on typical LLM workloads). - Total for 81 layers: ~1-2 TFLOPs per token (assuming sequential token generation). - RTX 3090’s 35.6 TFLOPS can process this in: \[ \frac{1-2 \, \text{TFLOPs}}{35.6 \, \text{TFLOPs/s}} \approx 28-56 \, \text{ms} \] - VRAM bandwidth (936 GB/s) is not a bottleneck, as each layer’s data (~0.43 GB) is accessed in ~0.46 ms. - **Total Time per Token**: ~**50-100 ms** (including minor overheads like kernel launches, assuming ~10-20 tokens/s for a 70B model on an RTX 3090). #### Scenario 2: Dynamic Swapping (39 Layers Swapped per Token) - **Setup**: 42 layers stay in VRAM (as in the issue). For each token, 39 layers are swapped from RAM to VRAM, and all 81 layers are computed on the GPU. - **Swapping Cost**: - Each layer is ~0.43 GB. Copying one layer over PCIe 4.0 (~32 GB/s) takes: \[ \frac{0.43 \, \text{GB}}{32 \, \text{GB/s}} \approx 13.4 \, \text{ms} \] - Swapping 39 layers: \[ 39 \times 13.4 \, \text{ms} \approx 522.6 \, \text{ms} \] - Note: This assumes swapping in only (unidirectional). If we need to swap out 39 layers to make space, double the cost: \[ 39 \times 13.4 \, \text{ms} \times 2 \approx 1045.2 \, \text{ms} \] For simplicity, assume VRAM can hold the new layers temporarily (e.g., by overwriting KV cache), so ~522.6 ms for swapping in. - **Compute Time**: - After swapping, all 81 layers are computed on the GPU, identical to Scenario 1: ~50-100 ms. - **Total Time per Token**: \[ 522.6 \, \text{ms (swapping)} + 50-100 \, \text{ms (compute)} \approx 572.6-622.6 \, \text{ms} \] - **Total**: ~**573-623 ms** per token. This is much slower than Scenario 1 due to PCIe latency. #### Scenario 3: Static Offloading (39 Layers on CPU) - **Setup**: 42 layers in VRAM (GPU), 39 layers in shared GPU memory (RAM, computed by CPU), as in the issue. - **GPU Compute**: - 42 layers require ~0.5-1 TFLOPs (half of Scenario 1). - RTX 3090 processes this in: \[ \frac{0.5-1 \, \text{TFLOPs}}{35.6 \, \text{TFLOPs/s}} \approx 14-28 \, \text{ms} \] - Total GPU time: ~**25-50 ms** (with overhead). - **CPU Compute**: - 39 layers also require ~0.5-1 TFLOPs. - CPU (1-2 TFLOPS) processes this in: \[ \frac{0.5-1 \, \text{TFLOPs}}{1-2 \, \text{TFLOPs/s}} \approx 250-1000 \, \text{ms} \] - RAM bandwidth (~100 GB/s) is sufficient but slower than VRAM, adding minor overhead. - Shared GPU memory access (via PCIe) may add latency (~10-20 ms per layer access), but assume direct CPU computation for simplicity. - Total CPU time: ~**500-1000 ms** (conservative, based on typical CPU-bound LLM inference). - **Total Time per Token**: - GPU and CPU compute sequentially (layer-by-layer processing in Transformers). \[ 25-50 \, \text{ms (GPU)} + 500-1000 \, \text{ms (CPU)} \approx 525-1050 \, \text{ms} \] - **Total**: ~**525-1050 ms** per token. This aligns with the slow responses in the issue. #### Comparison - **Scenario 1 (Ideal)**: ~50-100 ms per token (GPU only, infeasible with 24 GB VRAM). - **Scenario 2 (Dynamic Swapping)**: ~573-623 ms per token (GPU compute + swapping 39 layers). - **Scenario 3 (Static Offloading)**: ~525-1050 ms per token (GPU for 42 layers, CPU for 39 layers). **Key Insight**: Dynamic swapping (Scenario 2) is not a bargain compared to static offloading (Scenario 3). Swapping 39 layers costs ~523 ms, which is comparable to or worse than the CPU’s 500-1000 ms for computing 39 layers. The GPU compute advantage (50-100 ms for all layers) is negated by PCIe latency. Static offloading is simpler and often faster, especially if the CPU is reasonably performant.

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

parth-launch-plan-gating

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#65795