[GH-ISSUE #6008] Ollama is running on both CPU and GPU - expected to use GPU only #3758

Closed
opened 2026-04-12 14:34:31 -05:00 by GiteaMirror · 23 comments
Owner

Originally created by @wxletter on GitHub (Jul 27, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6008

Originally assigned to: @dhiltgen on GitHub.

What is the issue?
When I run "ollama run llama3.1:70b", I can see that 22.9/24 GB of dedicated GPU memory is used, and 18.9/31.9 GB of shared GPU memory is used (it's in Chinese so I did the translation).

6f29d8b7b0c24ad60fcc36b0cf56593

From "server.log" I can see "offloaded 42/81 layers to GPU", and when I'm chatting with llama3.1 the response is very slow, "ollama ps" shows:

image

Memory should be enough to run this model, then why only 42/81 layers are offloaded to GPU, and ollama is still using CPU? Is there a way to force ollama to use GPU? Server log attached, let me know if there's any other info that could be helpful.

OS
Windows11

GPU
Nvidia RTX 4090

CPU
Intel i7 13700KF

RAM
64GB

Ollama version
0.3.0
server.log

Originally created by @wxletter on GitHub (Jul 27, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6008 Originally assigned to: @dhiltgen on GitHub. **What is the issue?** When I run "ollama run llama3.1:70b", I can see that 22.9/24 GB of dedicated GPU memory is used, and 18.9/31.9 GB of shared GPU memory is used (it's in Chinese so I did the translation). <img width="301" alt="6f29d8b7b0c24ad60fcc36b0cf56593" src="https://github.com/user-attachments/assets/3f64fa11-684f-445a-bbfc-f06155829d9c"> From "server.log" I can see "offloaded 42/81 layers to GPU", and when I'm chatting with llama3.1 the response is very slow, "ollama ps" shows: ![image](https://github.com/user-attachments/assets/59b0d507-14c0-4d84-a929-3e6068a19638) Memory should be enough to run this model, then why only 42/81 layers are offloaded to GPU, and ollama is still using CPU? Is there a way to force ollama to use GPU? Server log attached, let me know if there's any other info that could be helpful. **OS** Windows11 **GPU** Nvidia RTX 4090 **CPU** Intel i7 13700KF **RAM** 64GB **Ollama version** 0.3.0 [server.log](https://github.com/user-attachments/files/16398489/server.log)
GiteaMirror added the question label 2026-04-12 14:34:31 -05:00
Author
Owner

@wxletter commented on GitHub (Jul 27, 2024):

@mxmp210
Thanks for your reply, however I don't have this issue running llama3.1:70b, the model can be loaded succesfully but the response is very slow since Ollama is running on CPU. I have 64 GB RAM.

<!-- gh-comment-id:2254124219 --> @wxletter commented on GitHub (Jul 27, 2024): @mxmp210 Thanks for your reply, however I don't have this issue running llama3.1:70b, the model can be loaded succesfully but the response is very slow since Ollama is running on CPU. I have 64 GB RAM.
Author
Owner

@wxletter commented on GitHub (Jul 27, 2024):

@dhiltgen could you help to take a look at this issue? I'm not sure if it'd be rude to @ you like this, I just saw you're helping people with other problems, and I think you may help me on this, thanks in advance!

<!-- gh-comment-id:2254143337 --> @wxletter commented on GitHub (Jul 27, 2024): @dhiltgen could you help to take a look at this issue? I'm not sure if it'd be rude to @ you like this, I just saw you're helping people with other problems, and I think you may help me on this, thanks in advance!
Author
Owner

@rick-github commented on GitHub (Jul 27, 2024):

time=2024-07-27T14:30:02.936+08:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=42 layers.split="" memory.available="[21.3 GiB]" memory.required.full="39.3 GiB" memory.required.partial="21.1 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[21.1 GiB]" memory.weights.total="36.5 GiB" memory.weights.repeating="35.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="1.1 GiB"

ollama is using the GPU: almost all (21.1 of 24G) is being used for the model. But the model is larger than the available dedicated VRAM, it needs a total 39.3G of RAM. So some it has to spill into system RAM, which is 18.2G of shared GPU memory.

<!-- gh-comment-id:2254186207 --> @rick-github commented on GitHub (Jul 27, 2024): ``` time=2024-07-27T14:30:02.936+08:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=42 layers.split="" memory.available="[21.3 GiB]" memory.required.full="39.3 GiB" memory.required.partial="21.1 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[21.1 GiB]" memory.weights.total="36.5 GiB" memory.weights.repeating="35.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="1.1 GiB" ``` ollama is using the GPU: almost all (21.1 of 24G) is being used for the model. But the model is larger than the available dedicated VRAM, it needs a total 39.3G of RAM. So some it has to spill into system RAM, which is 18.2G of shared GPU memory.
Author
Owner

@wxletter commented on GitHub (Jul 27, 2024):

time=2024-07-27T14:30:02.936+08:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=42 layers.split="" memory.available="[21.3 GiB]" memory.required.full="39.3 GiB" memory.required.partial="21.1 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[21.1 GiB]" memory.weights.total="36.5 GiB" memory.weights.repeating="35.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="1.1 GiB"

ollama is using the GPU: almost all (21.1 of 24G) is being used for the model. But the model is larger than the available dedicated VRAM, it needs a total 39.3G of RAM. So some it has to spill into system RAM, which is 18.2G of shared GPU memory.

I understand that the dedicated 24 GB VRAM is not enough to load the model so shared GPU memory is used, although the "shared GPU memory" is actually RAM, it should be considered as "VRAM" just speed is lower than real VRAM. So do you mean that no matter it's shared GPU memory or just RAM, as long as part of the layers are offloaded to RAM, Ollama will use CPU + GPU?

<!-- gh-comment-id:2254193515 --> @wxletter commented on GitHub (Jul 27, 2024): > ``` > time=2024-07-27T14:30:02.936+08:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=42 layers.split="" memory.available="[21.3 GiB]" memory.required.full="39.3 GiB" memory.required.partial="21.1 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[21.1 GiB]" memory.weights.total="36.5 GiB" memory.weights.repeating="35.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="1.1 GiB" > ``` > > ollama is using the GPU: almost all (21.1 of 24G) is being used for the model. But the model is larger than the available dedicated VRAM, it needs a total 39.3G of RAM. So some it has to spill into system RAM, which is 18.2G of shared GPU memory. I understand that the dedicated 24 GB VRAM is not enough to load the model so shared GPU memory is used, although the "shared GPU memory" is actually RAM, it should be considered as "VRAM" just speed is lower than real VRAM. So do you mean that no matter it's shared GPU memory or just RAM, as long as part of the layers are offloaded to RAM, Ollama will use CPU + GPU?
Author
Owner

@rick-github commented on GitHub (Jul 27, 2024):

I don't have a deep knowledge of Nvidia devices/drivers or how llama.ccp uses them, but generally the problem with RAM limited peripherals is memory bandwidth. PCIe devices can access system RAM, but the speed is lower than the speed at which the CPU can access it. If we take current top of the line tech, a x16 PCIe4 bus has about 32GBps simplex data transfer rate while a DDR5 based CPU/RAM system has about 64GBps. (GPUs have a higher bandwidth with their local memory due the increased width of the bus, GDDR6 is typically > 800GBps while HBM is measured in TBps). So while it's technically possible to have a PCI based device access system RAM, it's usually more efficient to allow the CPU to process the data in system RAM and the PCI device process data in its own RAM.

Somebody with more knowledge of Nvidia cards and llama.cpp could provide more insight.

<!-- gh-comment-id:2254253759 --> @rick-github commented on GitHub (Jul 27, 2024): I don't have a deep knowledge of Nvidia devices/drivers or how llama.ccp uses them, but generally the problem with RAM limited peripherals is memory bandwidth. PCIe devices can access system RAM, but the speed is lower than the speed at which the CPU can access it. If we take current top of the line tech, a x16 PCIe4 bus has about 32GBps simplex data transfer rate while a DDR5 based CPU/RAM system has about 64GBps. (GPUs have a higher bandwidth with their local memory due the increased width of the bus, GDDR6 is typically > 800GBps while HBM is measured in TBps). So while it's technically possible to have a PCI based device access system RAM, it's usually more efficient to allow the CPU to process the data in system RAM and the PCI device process data in its own RAM. Somebody with more knowledge of Nvidia cards and llama.cpp could provide more insight.
Author
Owner

@wxletter commented on GitHub (Jul 28, 2024):

I don't have a deep knowledge of Nvidia devices/drivers or how llama.ccp uses them, but generally the problem with RAM limited peripherals is memory bandwidth. PCIe devices can access system RAM, but the speed is lower than the speed at which the CPU can access it. If we take current top of the line tech, a x16 PCIe4 bus has about 32GBps simplex data transfer rate while a DDR5 based CPU/RAM system has about 64GBps. (GPUs have a higher bandwidth with their local memory due the increased width of the bus, GDDR6 is typically > 800GBps while HBM is measured in TBps). So while it's technically possible to have a PCI based device access system RAM, it's usually more efficient to allow the CPU to process the data in system RAM and the PCI device process data in its own RAM.

Somebody with more knowledge of Nvidia cards and llama.cpp could provide more insight.

Thanks very much for the explaination, I don't have this kind of knowledge, I just thought that since it's called "shared GPU memory", if it can only be used in the same way as normal RAM, then it's meaningless; like the "virtual memory" is just file on hard drive, it's treated like "RAM" (at least in some cases) not hard drive.

If "shared GPU memory" can be recognized as VRAM, even it's spead is lower than real VRAM, Ollama should use 100% GPU to do the job, then the response should be quicker than using CPU + GPU. I'm not sure if I'm wrong or whether Ollama can do this.

<!-- gh-comment-id:2254322068 --> @wxletter commented on GitHub (Jul 28, 2024): > I don't have a deep knowledge of Nvidia devices/drivers or how llama.ccp uses them, but generally the problem with RAM limited peripherals is memory bandwidth. PCIe devices can access system RAM, but the speed is lower than the speed at which the CPU can access it. If we take current top of the line tech, a x16 PCIe4 bus has about 32GBps simplex data transfer rate while a DDR5 based CPU/RAM system has about 64GBps. (GPUs have a higher bandwidth with their local memory due the increased width of the bus, GDDR6 is typically > 800GBps while HBM is measured in TBps). So while it's technically possible to have a PCI based device access system RAM, it's usually more efficient to allow the CPU to process the data in system RAM and the PCI device process data in its own RAM. > > Somebody with more knowledge of Nvidia cards and llama.cpp could provide more insight. Thanks very much for the explaination, I don't have this kind of knowledge, I just thought that since it's called "shared GPU memory", if it can only be used in the same way as normal RAM, then it's meaningless; like the "virtual memory" is just file on hard drive, it's treated like "RAM" (at least in some cases) not hard drive. If "shared GPU memory" can be recognized as VRAM, even it's spead is lower than real VRAM, Ollama should use 100% GPU to do the job, then the response should be quicker than using CPU + GPU. I'm not sure if I'm wrong or whether Ollama can do this.
Author
Owner

@rick-github commented on GitHub (Jul 28, 2024):

Whether CPU+GPU or GPU only is faster or slower depends on the where the bottleneck is, memory bandwidth or compute. Either way, it's not an ollama issue, it's a llama.cpp issue. Follow up on https://github.com/ggerganov/llama.cpp/issues/6743.

<!-- gh-comment-id:2254497983 --> @rick-github commented on GitHub (Jul 28, 2024): Whether CPU+GPU or GPU only is faster or slower depends on the where the bottleneck is, memory bandwidth or compute. Either way, it's not an ollama issue, it's a llama.cpp issue. Follow up on https://github.com/ggerganov/llama.cpp/issues/6743.
Author
Owner

@wxletter commented on GitHub (Jul 28, 2024):

Whether CPU+GPU or GPU only is faster or slower depends on the where the bottleneck is, memory bandwidth or compute. Either way, it's not an ollama issue, it's a llama.cpp issue. Follow up on ggerganov/llama.cpp#6743.

I subscribed to this issue, it's not the same as mine. On my PC when Ollama is running llama3.1 70b both VRAM and shared GPU memory are used, however most of the time it's CPU doing the job and GPU is barely used (judging from performance monitor). I agree with @alirezanet that even some layers are offloaded to shared GPU memory, CPU should not be used mainly. In my case it takes more than 1 min before Ollama starting to respond (to a simple chat like just saying "hello"), and I can only get about 1 word per second, the performance is too bad.

<!-- gh-comment-id:2254508775 --> @wxletter commented on GitHub (Jul 28, 2024): > Whether CPU+GPU or GPU only is faster or slower depends on the where the bottleneck is, memory bandwidth or compute. Either way, it's not an ollama issue, it's a llama.cpp issue. Follow up on [ggerganov/llama.cpp#6743](https://github.com/ggerganov/llama.cpp/issues/6743). I subscribed to this issue, it's not the same as mine. On my PC when Ollama is running llama3.1 70b both VRAM and shared GPU memory are used, however most of the time it's CPU doing the job and GPU is barely used (judging from performance monitor). I agree with @alirezanet that even some layers are offloaded to shared GPU memory, CPU should not be used mainly. In my case it takes more than 1 min before Ollama starting to respond (to a simple chat like just saying "hello"), and I can only get about 1 word per second, the performance is too bad.
Author
Owner

@rick-github commented on GitHub (Jul 28, 2024):

Stable Diffusion users found shared memory impacted processing speed so much that Nvidia added an option to turn it off. If you have time, it would be interesting if you could try this and see if anything changes. Having read up a little bit on shared memory, it's not clear to me why the driver is reporting any shared memory usage at all. llama.cpp has only got 42 layers of the model loaded into VRAM, and if llama.cpp is using CPU for the other 39 layers, then there should be no shared GPU RAM, just VRAM and system RAM. It would be interesting if you could post a screen cap of the GPU and CPU usage for the entire time that llama.cpp is doing inference, to see if the load switches completely between GPU and CPU or if it uses a bit from both at the same time.

<!-- gh-comment-id:2254539398 --> @rick-github commented on GitHub (Jul 28, 2024): Stable Diffusion users found shared memory impacted processing speed so much that Nvidia added an option to [turn it off](https://nvidia.custhelp.com/app/answers/detail/a_id/5490). If you have time, it would be interesting if you could try this and see if anything changes. Having read up a little bit on shared memory, it's not clear to me why the driver is reporting any shared memory usage at all. llama.cpp has only got 42 layers of the model loaded into VRAM, and if llama.cpp is using CPU for the other 39 layers, then there should be no shared GPU RAM, just VRAM and system RAM. It would be interesting if you could post a screen cap of the GPU and CPU usage for the entire time that llama.cpp is doing inference, to see if the load switches completely between GPU and CPU or if it uses a bit from both at the same time.
Author
Owner

@wxletter commented on GitHub (Jul 28, 2024):

I tried to set "prefer no system fallback" for ollama app.exe, ollama.exe and ollama_llama_server.exe, restarted Ollama, restarted PC, with no result... when I run "ollama run llama3.1:70b" it's still using 20+ GB of shared GPU memory. I took a screenshot after I sent "hello" to the model, and this is the CPU and GPU usage when llama3.1 is thinking about how to respond. CPU usage is about 50%, and GPU usage is about 10% all the time, sometimes GPU usasge will raise to about 30% then immediately drop to lower than 10%.

cpu_gpu_usage

<!-- gh-comment-id:2254570175 --> @wxletter commented on GitHub (Jul 28, 2024): I tried to set "prefer no system fallback" for ollama app.exe, ollama.exe and ollama_llama_server.exe, restarted Ollama, restarted PC, with no result... when I run "ollama run llama3.1:70b" it's still using 20+ GB of shared GPU memory. I took a screenshot after I sent "hello" to the model, and this is the CPU and GPU usage when llama3.1 is thinking about how to respond. CPU usage is about 50%, and GPU usage is about 10% all the time, sometimes GPU usasge will raise to about 30% then immediately drop to lower than 10%. ![cpu_gpu_usage](https://github.com/user-attachments/assets/48b085f0-1035-4cb1-a5bf-1e4c79beaf59)
Author
Owner

@dhiltgen commented on GitHub (Jul 29, 2024):

As others have pointed out, ollama (and the underlying llama.cpp library) utilize dedicated VRAM on the GPU for inference. Once that memory is near fully allocated, the remaining portions of the model are loaded into system memory and inference is performed using the CPU.

What Windows does with the shared memory is perform a paging algorithm where pages of memory are swapped back and forth between system RAM and GPU VRAM, and while this does allow some apps to overflow VRAM, the performance impact to inference would be significant. It's better to leverage the CPU for performing inference of the portion of the model that doesn't fit within VRAM in parallel with the GPU processing its portion of the model, instead of thrashing memory pages back and forth.

<!-- gh-comment-id:2256480433 --> @dhiltgen commented on GitHub (Jul 29, 2024): As others have pointed out, ollama (and the underlying llama.cpp library) utilize dedicated VRAM on the GPU for inference. Once that memory is near fully allocated, the remaining portions of the model are loaded into system memory and inference is performed using the CPU. What Windows does with the shared memory is perform a paging algorithm where pages of memory are swapped back and forth between system RAM and GPU VRAM, and while this does allow some apps to overflow VRAM, the performance impact to inference would be significant. It's better to leverage the CPU for performing inference of the portion of the model that doesn't fit within VRAM in parallel with the GPU processing its portion of the model, instead of thrashing memory pages back and forth.
Author
Owner

@wxletter commented on GitHub (Jul 29, 2024):

As others have pointed out, ollama (and the underlying llama.cpp library) utilize dedicated VRAM on the GPU for inference. Once that memory is near fully allocated, the remaining portions of the model are loaded into system memory and inference is performed using the CPU.

What Windows does with the shared memory is perform a paging algorithm where pages of memory are swapped back and forth between system RAM and GPU VRAM, and while this does allow some apps to overflow VRAM, the performance impact to inference would be significant. It's better to leverage the CPU for performing inference of the portion of the model that doesn't fit within VRAM in parallel with the GPU processing its portion of the model, instead of thrashing memory pages back and forth.

@dhiltgen Thanks very much for the explaination, I understand now that it's better to leverage CPU to handle the layers loaded into RAM together with GPU to handle the layers loaded in VRAM. I have one concer left: in my case about half the layers are loaded into RAM and the other half loaded into VRAM, when GPU and CPU are performing inference together the CPU usage is about 40% and most of the time the GPU usage is about 10%, hardly reaches 30% then drop to 10% immediately, is there anything I can do to make most use of both GPU and CPU to get better inference performance?

<!-- gh-comment-id:2256701019 --> @wxletter commented on GitHub (Jul 29, 2024): > As others have pointed out, ollama (and the underlying llama.cpp library) utilize dedicated VRAM on the GPU for inference. Once that memory is near fully allocated, the remaining portions of the model are loaded into system memory and inference is performed using the CPU. > > What Windows does with the shared memory is perform a paging algorithm where pages of memory are swapped back and forth between system RAM and GPU VRAM, and while this does allow some apps to overflow VRAM, the performance impact to inference would be significant. It's better to leverage the CPU for performing inference of the portion of the model that doesn't fit within VRAM in parallel with the GPU processing its portion of the model, instead of thrashing memory pages back and forth. @dhiltgen Thanks very much for the explaination, I understand now that it's better to leverage CPU to handle the layers loaded into RAM together with GPU to handle the layers loaded in VRAM. I have one concer left: in my case about half the layers are loaded into RAM and the other half loaded into VRAM, when GPU and CPU are performing inference together the CPU usage is about 40% and most of the time the GPU usage is about 10%, hardly reaches 30% then drop to 10% immediately, is there anything I can do to make most use of both GPU and CPU to get better inference performance?
Author
Owner

@wxletter commented on GitHub (Aug 1, 2024):

Update - today I updated Ollama to version 0.3.2, llama3.1 70B loads faster (about 25 sec) than before (Ollama ver 0.3.0, more than 1min), CPU utilization is higher (about 70%) but GPU utilizaiton is still low (about 20%) when inferencing. 40/81 laysers are loaded into VRAM.

image

<!-- gh-comment-id:2263052962 --> @wxletter commented on GitHub (Aug 1, 2024): Update - today I updated Ollama to version 0.3.2, llama3.1 70B loads faster (about 25 sec) than before (Ollama ver 0.3.0, more than 1min), CPU utilization is higher (about 70%) but GPU utilizaiton is still low (about 20%) when inferencing. 40/81 laysers are loaded into VRAM. ![image](https://github.com/user-attachments/assets/5651912c-3f6b-41fe-92b8-6acb01d29921)
Author
Owner

@dhiltgen commented on GitHub (Aug 1, 2024):

It's possible our thread count might not be optimal on your system - see #2496 - you can experiment with setting different values for num_thread to try to optimize performance.

<!-- gh-comment-id:2264193056 --> @dhiltgen commented on GitHub (Aug 1, 2024): It's possible our thread count might not be optimal on your system - see #2496 - you can experiment with setting different values for `num_thread` to try to optimize performance.
Author
Owner

@wxletter commented on GitHub (Aug 2, 2024):

My CPU is Intel 13700KF, it has 16 cores and 24 threads, I tried to use "/set parameter num_thread 24" and "/set parameter num_thread 16" to set the parameter but only get about 40% CPU usage, can't even make it to 70% as when I updated Ollama yesterday, and the GPU usage is still low - about 10% to 20%. I'll try with some other numbers to see if it can be better. Is there any parameter that I can use to improve GPU utilization?

<!-- gh-comment-id:2264545593 --> @wxletter commented on GitHub (Aug 2, 2024): My CPU is Intel 13700KF, it has 16 cores and 24 threads, I tried to use "/set parameter num_thread 24" and "/set parameter num_thread 16" to set the parameter but only get about 40% CPU usage, can't even make it to 70% as when I updated Ollama yesterday, and the GPU usage is still low - about 10% to 20%. I'll try with some other numbers to see if it can be better. Is there any parameter that I can use to improve GPU utilization?
Author
Owner

@xiaohan815 commented on GitHub (Sep 20, 2024):

mybe you need two 4090 GPUs to run the 70B model?估计你需要2张4090才能跑70b不卡

<!-- gh-comment-id:2363259573 --> @xiaohan815 commented on GitHub (Sep 20, 2024): mybe you need two 4090 GPUs to run the 70B model?估计你需要2张4090才能跑70b不卡
Author
Owner

@michelle-chou25 commented on GitHub (Oct 29, 2024):

mybe you need two 4090 GPUs to run the 70B model?估计你需要2张4090才能跑70b不卡

Yes, 48G GPU at least

<!-- gh-comment-id:2443137997 --> @michelle-chou25 commented on GitHub (Oct 29, 2024): > mybe you need two 4090 GPUs to run the 70B model?估计你需要2张4090才能跑70b不卡 Yes, 48G GPU at least
Author
Owner

@kripper commented on GitHub (Nov 16, 2024):

I'm experiencing the same symptom here: CPU is reporting high load, but only GPU should be used (using VRAM and shared memory). Maybe there is a bug in llama.cpp?

<!-- gh-comment-id:2480403423 --> @kripper commented on GitHub (Nov 16, 2024): I'm experiencing the same symptom [here](https://github.com/ollama/ollama/issues/7673#issuecomment-2480393630): CPU is reporting high load, but only GPU should be used (using VRAM and shared memory). Maybe there is a bug in llama.cpp?
Author
Owner

@icemagno commented on GitHub (Dec 4, 2024):

I have Ollama for windows, RTX 4060 and ollama keeps insisting on using CPU and RAM instead my GPU. It is very disapointing because I spent a fortune buying this gpu. Many have explained various things about PCI, buses and RAM performance, etc... So ... what is the point of having GPU then ?

<!-- gh-comment-id:2518123005 --> @icemagno commented on GitHub (Dec 4, 2024): I have Ollama for windows, RTX 4060 and ollama keeps insisting on using CPU and RAM instead my GPU. It is very disapointing because I spent a fortune buying this gpu. Many have explained various things about PCI, buses and RAM performance, etc... So ... what is the point of having GPU then ?
Author
Owner

@rick-github commented on GitHub (Dec 4, 2024):

ollama will use the GPU if it's able to. If you would like your issue debugged, open a new ticket and add server logs.

<!-- gh-comment-id:2518339029 --> @rick-github commented on GitHub (Dec 4, 2024): ollama will use the GPU if it's able to. If you would like your issue debugged, open a new ticket and add server logs.
Author
Owner

@DJMo13 commented on GitHub (Jan 24, 2025):

Use smaller models that actually are able to fit in your gpus vram, 13b models with quantization need at least a gpu with 12 gb vram and 32b models need at least 24 gb vram. 70b Models need datacenter gpus or two consumer ones and if you put a model too big on your gpu then it has to fall back to use the cpu as well and even 1 token per second is fast if you use a llama 70b model on your cpu...

<!-- gh-comment-id:2612016038 --> @DJMo13 commented on GitHub (Jan 24, 2025): Use smaller models that actually are able to fit in your gpus vram, 13b models with quantization need at least a gpu with 12 gb vram and 32b models need at least 24 gb vram. 70b Models need datacenter gpus or two consumer ones and if you put a model too big on your gpu then it has to fall back to use the cpu as well and even 1 token per second is fast if you use a llama 70b model on your cpu...
Author
Owner

@cyberluke commented on GitHub (Feb 16, 2025):

Just use LM Studio, it loads models better and have much more options to set up offloading and num layer :-/

<!-- gh-comment-id:2661197600 --> @cyberluke commented on GitHub (Feb 16, 2025): Just use LM Studio, it loads models better and have much more options to set up offloading and num layer :-/
Author
Owner

@99sono commented on GitHub (Apr 21, 2025):

Intuitively, I always thought it would be a no-brainer that the price to pay to swap model data from RAM to GPU VRAM would be much lower than using the CPU for any heavy math like in big language models. With so many more cores on the GPU (like the RTX 3090’s 10,000+ vs. a CPU’s 16 or so), I figured moving data to the GPU would be worth it, even if it’s a bit slow.

Here’s the analysis from Grok 3 on this matter for the Llama 3.1 70B model (q4_0, ~40-50 GB) on an RTX 3090 with 24 GB VRAM, based on your setup (42/81 layers in VRAM, 39 in shared memory):


Summary for Too long did not read (TLDR)

  • Ideal Case (Imaginary Infinite VRAM): If all 81 layers fit in VRAM, the GPU computes everything in ~50-100 ms per token, giving ~10-20 tokens/s. This isn’t possible with 24 GB VRAM, as the model needs ~40-50 GB.
  • Swapping Layers: If we swap 39 layers from RAM to VRAM per token (each layer is ~0.43 GB), copying takes ~523 ms over PCIe (~13 ms per layer). GPU compute for all 81 layers is ~50-100 ms, so total time is ~573-623 ms per token. This is slow due to data transfer.
  • Current Setup (Static Offloading): Your setup has 42 layers in VRAM (computed by GPU in ~25-50 ms) and 39 layers in RAM (computed by CPU in ~500-1000 ms). Total time is ~525-1050 ms per token, which explains the slow responses.

Surprisingly, swapping layers isn’t much faster than using the CPU, because moving data over PCIe is so slow. For your RTX 3090, sticking with q4_0 and maybe trying a smaller model like Llama 3.1 8B (~5 GB, fits in VRAM) might be better. You could also try setting OLLAMA_MAX_VRAM to push more layers to VRAM, but be careful of crashes. Thanks for posting this—it’s really interesting! What’s your CPU, and have you tried smaller models?


Full Grok explanation:

I understand your confusion about the layer-swapping costs and the comparison between swapping layers versus computing on the CPU. The original computation was a bit unclear, and I’ll make it explicit by calculating the time per token for three scenarios involving a 4-bit quantized Llama 3.1 70B model (~40-50 GB) on an RTX 3090 (24 GB VRAM). The scenarios will compare:

  1. Ideal Scenario: All 81 layers, KV cache, and activations fit in an imaginary RTX 3090 with infinite VRAM, so all computation is on the GPU.
  2. Dynamic Swapping Scenario: 42 layers stay in VRAM, 39 layers are swapped in from RAM per token, and all computation is on the GPU after swapping.
  3. Static Offloading Scenario: 42 layers are computed on the GPU, 39 layers are computed on the CPU (as in the GitHub issue), with no swapping.

I’ll also rewrite your GitHub post in a simpler, non-expert tone, following your requested structure: (A) your intuitive thought about swapping versus CPU computation, and (B) Grok 3’s analysis. The post will reflect the clarified model size (~40-50 GB for q4_0) and address the issue (Ollama Issue #6008).


Clarified Analysis: Time per Token for Three Scenarios

Model and Hardware Assumptions

  • Model: Llama 3.1 70B, q4_0 quantization (~40-50 GB total memory footprint, ~35 GB for weights).
    • 81 layers, each ~0.43 GB (( \frac{35 , \text{GB}}{81} \approx 0.43 , \text{GB} )).
    • KV cache (~5-10 GB for 2048 tokens) and activations (~5 GB) included in the 40-50 GB total.
  • Hardware: NVIDIA RTX 3090 with 24 GB VRAM, PCIe 4.0 x16 (~32 GB/s bidirectional), 35.6 TFLOPS (FP16, used for q4_0 inference).
    • CPU: Assumed high-end (e.g., AMD Ryzen 9 5950X, 16 cores, ~1-2 TFLOPS in FP16, ~100 GB/s DDR4 bandwidth).
  • Memory Usage (from Issue): 22.9 GB VRAM (42 layers) + 18.9 GB shared GPU memory (39 layers), totaling ~41.8 GB, consistent with q4_0.

Scenario 1: Ideal (All Layers in Infinite VRAM)

  • Setup: All 81 layers, KV cache, and activations fit in an imaginary RTX 3090 with infinite VRAM. All computation is on the GPU.
  • Compute Time:
    • Each layer involves matrix multiplications (e.g., feed-forward, attention). For a 70B model, a single layer’s computation is ~10-20 GFLOPs (based on typical LLM workloads).
    • Total for 81 layers: ~1-2 TFLOPs per token (assuming sequential token generation).
    • RTX 3090’s 35.6 TFLOPS can process this in:
      [
      \frac{1-2 , \text{TFLOPs}}{35.6 , \text{TFLOPs/s}} \approx 28-56 , \text{ms}
      ]
    • VRAM bandwidth (936 GB/s) is not a bottleneck, as each layer’s data (~0.43 GB) is accessed in ~0.46 ms.
  • Total Time per Token: ~50-100 ms (including minor overheads like kernel launches, assuming ~10-20 tokens/s for a 70B model on an RTX 3090).

Scenario 2: Dynamic Swapping (39 Layers Swapped per Token)

  • Setup: 42 layers stay in VRAM (as in the issue). For each token, 39 layers are swapped from RAM to VRAM, and all 81 layers are computed on the GPU.
  • Swapping Cost:
    • Each layer is ~0.43 GB. Copying one layer over PCIe 4.0 (~32 GB/s) takes:
      [
      \frac{0.43 , \text{GB}}{32 , \text{GB/s}} \approx 13.4 , \text{ms}
      ]
    • Swapping 39 layers:
      [
      39 \times 13.4 , \text{ms} \approx 522.6 , \text{ms}
      ]
    • Note: This assumes swapping in only (unidirectional). If we need to swap out 39 layers to make space, double the cost:
      [
      39 \times 13.4 , \text{ms} \times 2 \approx 1045.2 , \text{ms}
      ]
      For simplicity, assume VRAM can hold the new layers temporarily (e.g., by overwriting KV cache), so ~522.6 ms for swapping in.
  • Compute Time:
    • After swapping, all 81 layers are computed on the GPU, identical to Scenario 1: ~50-100 ms.
  • Total Time per Token:
    [
    522.6 , \text{ms (swapping)} + 50-100 , \text{ms (compute)} \approx 572.6-622.6 , \text{ms}
    ]
    • Total: ~573-623 ms per token. This is much slower than Scenario 1 due to PCIe latency.

Scenario 3: Static Offloading (39 Layers on CPU)

  • Setup: 42 layers in VRAM (GPU), 39 layers in shared GPU memory (RAM, computed by CPU), as in the issue.
  • GPU Compute:
    • 42 layers require ~0.5-1 TFLOPs (half of Scenario 1).
    • RTX 3090 processes this in:
      [
      \frac{0.5-1 , \text{TFLOPs}}{35.6 , \text{TFLOPs/s}} \approx 14-28 , \text{ms}
      ]
    • Total GPU time: ~25-50 ms (with overhead).
  • CPU Compute:
    • 39 layers also require ~0.5-1 TFLOPs.
    • CPU (1-2 TFLOPS) processes this in:
      [
      \frac{0.5-1 , \text{TFLOPs}}{1-2 , \text{TFLOPs/s}} \approx 250-1000 , \text{ms}
      ]
    • RAM bandwidth (~100 GB/s) is sufficient but slower than VRAM, adding minor overhead.
    • Shared GPU memory access (via PCIe) may add latency (~10-20 ms per layer access), but assume direct CPU computation for simplicity.
    • Total CPU time: ~500-1000 ms (conservative, based on typical CPU-bound LLM inference).
  • Total Time per Token:
    • GPU and CPU compute sequentially (layer-by-layer processing in Transformers).
      [
      25-50 , \text{ms (GPU)} + 500-1000 , \text{ms (CPU)} \approx 525-1050 , \text{ms}
      ]
    • Total: ~525-1050 ms per token. This aligns with the slow responses in the issue.

Comparison

  • Scenario 1 (Ideal): ~50-100 ms per token (GPU only, infeasible with 24 GB VRAM).
  • Scenario 2 (Dynamic Swapping): ~573-623 ms per token (GPU compute + swapping 39 layers).
  • Scenario 3 (Static Offloading): ~525-1050 ms per token (GPU for 42 layers, CPU for 39 layers).

Key Insight: Dynamic swapping (Scenario 2) is not a bargain compared to static offloading (Scenario 3). Swapping 39 layers costs ~523 ms, which is comparable to or worse than the CPU’s 500-1000 ms for computing 39 layers. The GPU compute advantage (50-100 ms for all layers) is negated by PCIe latency. Static offloading is simpler and often faster, especially if the CPU is reasonably performant.

<!-- gh-comment-id:2817767874 --> @99sono commented on GitHub (Apr 21, 2025): Intuitively, I always thought it would be a no-brainer that the price to pay to swap model data from RAM to GPU VRAM would be much lower than using the CPU for any heavy math like in big language models. With so many more cores on the GPU (like the RTX 3090’s 10,000+ vs. a CPU’s 16 or so), I figured moving data to the GPU would be worth it, even if it’s a bit slow. Here’s the analysis from Grok 3 on this matter for the Llama 3.1 70B model (q4_0, ~40-50 GB) on an RTX 3090 with 24 GB VRAM, based on your setup (42/81 layers in VRAM, 39 in shared memory): --- # Summary for Too long did not read (TLDR) - **Ideal Case (Imaginary Infinite VRAM)**: If all 81 layers fit in VRAM, the GPU computes everything in ~50-100 ms per token, giving ~10-20 tokens/s. This isn’t possible with 24 GB VRAM, as the model needs ~40-50 GB. - **Swapping Layers**: If we swap 39 layers from RAM to VRAM per token (each layer is ~0.43 GB), copying takes ~523 ms over PCIe (~13 ms per layer). GPU compute for all 81 layers is ~50-100 ms, so total time is ~573-623 ms per token. This is slow due to data transfer. - **Current Setup (Static Offloading)**: Your setup has 42 layers in VRAM (computed by GPU in ~25-50 ms) and 39 layers in RAM (computed by CPU in ~500-1000 ms). Total time is ~525-1050 ms per token, which explains the slow responses. Surprisingly, swapping layers isn’t much faster than using the CPU, because moving data over PCIe is so slow. For your RTX 3090, sticking with q4_0 and maybe trying a smaller model like Llama 3.1 8B (~5 GB, fits in VRAM) might be better. You could also try setting `OLLAMA_MAX_VRAM` to push more layers to VRAM, but be careful of crashes. Thanks for posting this—it’s really interesting! What’s your CPU, and have you tried smaller models? --- # Full Grok explanation: I understand your confusion about the layer-swapping costs and the comparison between swapping layers versus computing on the CPU. The original computation was a bit unclear, and I’ll make it explicit by calculating the time per token for three scenarios involving a 4-bit quantized Llama 3.1 70B model (~40-50 GB) on an RTX 3090 (24 GB VRAM). The scenarios will compare: 1. **Ideal Scenario**: All 81 layers, KV cache, and activations fit in an imaginary RTX 3090 with infinite VRAM, so all computation is on the GPU. 2. **Dynamic Swapping Scenario**: 42 layers stay in VRAM, 39 layers are swapped in from RAM per token, and all computation is on the GPU after swapping. 3. **Static Offloading Scenario**: 42 layers are computed on the GPU, 39 layers are computed on the CPU (as in the GitHub issue), with no swapping. I’ll also rewrite your GitHub post in a simpler, non-expert tone, following your requested structure: (A) your intuitive thought about swapping versus CPU computation, and (B) Grok 3’s analysis. The post will reflect the clarified model size (~40-50 GB for q4_0) and address the issue ([Ollama Issue #6008](https://github.com/ollama/ollama/issues/6008)). --- ### Clarified Analysis: Time per Token for Three Scenarios #### Model and Hardware Assumptions - **Model**: Llama 3.1 70B, q4_0 quantization (~40-50 GB total memory footprint, ~35 GB for weights). - 81 layers, each ~0.43 GB (\( \frac{35 \, \text{GB}}{81} \approx 0.43 \, \text{GB} \)). - KV cache (~5-10 GB for 2048 tokens) and activations (~5 GB) included in the 40-50 GB total. - **Hardware**: NVIDIA RTX 3090 with 24 GB VRAM, PCIe 4.0 x16 (~32 GB/s bidirectional), 35.6 TFLOPS (FP16, used for q4_0 inference). - CPU: Assumed high-end (e.g., AMD Ryzen 9 5950X, 16 cores, ~1-2 TFLOPS in FP16, ~100 GB/s DDR4 bandwidth). - **Memory Usage (from Issue)**: 22.9 GB VRAM (42 layers) + 18.9 GB shared GPU memory (39 layers), totaling ~41.8 GB, consistent with q4_0. #### Scenario 1: Ideal (All Layers in Infinite VRAM) - **Setup**: All 81 layers, KV cache, and activations fit in an imaginary RTX 3090 with infinite VRAM. All computation is on the GPU. - **Compute Time**: - Each layer involves matrix multiplications (e.g., feed-forward, attention). For a 70B model, a single layer’s computation is ~10-20 GFLOPs (based on typical LLM workloads). - Total for 81 layers: ~1-2 TFLOPs per token (assuming sequential token generation). - RTX 3090’s 35.6 TFLOPS can process this in: \[ \frac{1-2 \, \text{TFLOPs}}{35.6 \, \text{TFLOPs/s}} \approx 28-56 \, \text{ms} \] - VRAM bandwidth (936 GB/s) is not a bottleneck, as each layer’s data (~0.43 GB) is accessed in ~0.46 ms. - **Total Time per Token**: ~**50-100 ms** (including minor overheads like kernel launches, assuming ~10-20 tokens/s for a 70B model on an RTX 3090). #### Scenario 2: Dynamic Swapping (39 Layers Swapped per Token) - **Setup**: 42 layers stay in VRAM (as in the issue). For each token, 39 layers are swapped from RAM to VRAM, and all 81 layers are computed on the GPU. - **Swapping Cost**: - Each layer is ~0.43 GB. Copying one layer over PCIe 4.0 (~32 GB/s) takes: \[ \frac{0.43 \, \text{GB}}{32 \, \text{GB/s}} \approx 13.4 \, \text{ms} \] - Swapping 39 layers: \[ 39 \times 13.4 \, \text{ms} \approx 522.6 \, \text{ms} \] - Note: This assumes swapping in only (unidirectional). If we need to swap out 39 layers to make space, double the cost: \[ 39 \times 13.4 \, \text{ms} \times 2 \approx 1045.2 \, \text{ms} \] For simplicity, assume VRAM can hold the new layers temporarily (e.g., by overwriting KV cache), so ~522.6 ms for swapping in. - **Compute Time**: - After swapping, all 81 layers are computed on the GPU, identical to Scenario 1: ~50-100 ms. - **Total Time per Token**: \[ 522.6 \, \text{ms (swapping)} + 50-100 \, \text{ms (compute)} \approx 572.6-622.6 \, \text{ms} \] - **Total**: ~**573-623 ms** per token. This is much slower than Scenario 1 due to PCIe latency. #### Scenario 3: Static Offloading (39 Layers on CPU) - **Setup**: 42 layers in VRAM (GPU), 39 layers in shared GPU memory (RAM, computed by CPU), as in the issue. - **GPU Compute**: - 42 layers require ~0.5-1 TFLOPs (half of Scenario 1). - RTX 3090 processes this in: \[ \frac{0.5-1 \, \text{TFLOPs}}{35.6 \, \text{TFLOPs/s}} \approx 14-28 \, \text{ms} \] - Total GPU time: ~**25-50 ms** (with overhead). - **CPU Compute**: - 39 layers also require ~0.5-1 TFLOPs. - CPU (1-2 TFLOPS) processes this in: \[ \frac{0.5-1 \, \text{TFLOPs}}{1-2 \, \text{TFLOPs/s}} \approx 250-1000 \, \text{ms} \] - RAM bandwidth (~100 GB/s) is sufficient but slower than VRAM, adding minor overhead. - Shared GPU memory access (via PCIe) may add latency (~10-20 ms per layer access), but assume direct CPU computation for simplicity. - Total CPU time: ~**500-1000 ms** (conservative, based on typical CPU-bound LLM inference). - **Total Time per Token**: - GPU and CPU compute sequentially (layer-by-layer processing in Transformers). \[ 25-50 \, \text{ms (GPU)} + 500-1000 \, \text{ms (CPU)} \approx 525-1050 \, \text{ms} \] - **Total**: ~**525-1050 ms** per token. This aligns with the slow responses in the issue. #### Comparison - **Scenario 1 (Ideal)**: ~50-100 ms per token (GPU only, infeasible with 24 GB VRAM). - **Scenario 2 (Dynamic Swapping)**: ~573-623 ms per token (GPU compute + swapping 39 layers). - **Scenario 3 (Static Offloading)**: ~525-1050 ms per token (GPU for 42 layers, CPU for 39 layers). **Key Insight**: Dynamic swapping (Scenario 2) is not a bargain compared to static offloading (Scenario 3). Swapping 39 layers costs ~523 ms, which is comparable to or worse than the CPU’s 500-1000 ms for computing 39 layers. The GPU compute advantage (50-100 ms for all layers) is negated by PCIe latency. Static offloading is simpler and often faster, especially if the CPU is reasonably performant.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#3758