[GH-ISSUE #3818] llama3:70b-instruct response time. #2361

Closed
opened 2026-04-12 12:40:46 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @Atlanta11 on GitHub (Apr 22, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3818

Hello,what else can I do to make the AI respond faster because currently everything is working but a bit on the slow side with an Nvidia GeForce RTX 4090 and i9-14900k with 64 GB of RAM. I was able to download the model ollama run llama3:70b-instruct fairly quickly at a speed of 30 MB per second.
Here is my server.log.
server.log

Schermafbeelding 2024-04-22 133915

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

llama3:70b-instruct

Originally created by @Atlanta11 on GitHub (Apr 22, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3818 Hello,what else can I do to make the AI respond faster because currently everything is working but a bit on the slow side with an Nvidia GeForce RTX 4090 and i9-14900k with 64 GB of RAM. I was able to download the model ollama run llama3:70b-instruct fairly quickly at a speed of 30 MB per second. Here is my server.log. [server.log](https://github.com/ollama/ollama/files/15062206/server.log) ![Schermafbeelding 2024-04-22 133915](https://github.com/ollama/ollama/assets/9640147/195d26be-7909-49f9-bd5f-277ecb229576) ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version llama3:70b-instruct
GiteaMirror added the bug label 2026-04-12 12:40:46 -05:00
Author
Owner

@markhill343 commented on GitHub (Apr 23, 2024):

The 70B should be around ~50GB. The GPU only works with the 24GB it loads into the VRAM. The remaing part is stored in the system ram and only your CPU can process that. So you basically need another 4090.

<!-- gh-comment-id:2071693729 --> @markhill343 commented on GitHub (Apr 23, 2024): The 70B should be around ~50GB. The GPU only works with the 24GB it loads into the VRAM. The remaing part is stored in the system ram and only your CPU can process that. So you basically need another 4090.
Author
Owner

@jmorganca commented on GitHub (May 9, 2024):

Yes quite a bit of the model is running on CPU in this case so it's expected to be slow, sorry

<!-- gh-comment-id:2103558919 --> @jmorganca commented on GitHub (May 9, 2024): Yes quite a bit of the model is running on CPU in this case so it's expected to be slow, sorry
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#2361