[GH-ISSUE #499] Dedicated hardware for 16b/70b models #229

Closed
opened 2026-04-12 09:44:59 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @zdeneksvarc on GitHub (Sep 8, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/499

Hey guys, let's say I want to get a dedicated home server that would run ollama serve 13b/70b in Docker. Is there any chance to get such hardware (CPU) to achieve speed at least 5 tok/s? Since Ollama doesn't use GPU acceleration.

Originally created by @zdeneksvarc on GitHub (Sep 8, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/499 Hey guys, let's say I want to get a dedicated home server that would run `ollama serve` 13b/70b in Docker. Is there any chance to get such hardware (CPU) to achieve speed at least 5 tok/s? Since Ollama doesn't use GPU acceleration.
Author
Owner

@mxyng commented on GitHub (Sep 8, 2023):

We're actively working on GPU acceleration on Linux and Windows. You can follow the progress in #454 which includes GPU accelerated Docker images

<!-- gh-comment-id:1712308027 --> @mxyng commented on GitHub (Sep 8, 2023): We're actively working on GPU acceleration on Linux and Windows. You can follow the progress in #454 which includes GPU accelerated Docker images
Author
Owner

@zdeneksvarc commented on GitHub (Sep 8, 2023):

What a great news. This Ollama thing is awesome. Thank you.

<!-- gh-comment-id:1712310529 --> @zdeneksvarc commented on GitHub (Sep 8, 2023): What a great news. This Ollama thing is awesome. Thank you.
Author
Owner

@zdeneksvarc commented on GitHub (Sep 9, 2023):

Although the answer to the OP was encouraging, I was still looking for hard data to tell me what hardware is needed to get at least 5 tok/s. For completeness here is a partial answer from the link below:

The M1/M2 Pro supports up to 200 GB/s unified memory bandwidth, while the M1/M2 Max supports up to 400 GB/s. For example MacBook M2 Max using Llama.cpp can run 7B model with 38 t/s, 13B model with 22 t/s, and 65B model with 5 t/s.However in terms of inference speed dual setup of RTX 3090/4090 GPUs is faster compared to the Mac M2 Pro/Max/Ultra. Two RTX 4090s can run 65b models at a speed of 20 tokens per second, while two affordable secondhand RTX 3090s achieve 15 tokens per second with Exllama. Additionally, the Mac evaluates prompts slower, making the dual GPU setup more appealing.

And here is awesome comprehensive answer: https://www.hardware-corner.net/guides/computer-to-run-llama-ai-model/

<!-- gh-comment-id:1712444329 --> @zdeneksvarc commented on GitHub (Sep 9, 2023): Although the answer to the OP was encouraging, I was still looking for hard data to tell me what hardware is needed to get at least 5 tok/s. For completeness here is a partial answer from the link below: >The M1/M2 Pro supports up to 200 GB/s unified memory bandwidth, while the M1/M2 Max supports up to 400 GB/s. For example MacBook M2 Max using Llama.cpp can run 7B model with 38 t/s, 13B model with 22 t/s, and [65B model with 5 t/s](https://twitter.com/natfriedman/status/1665408927431884800).However in terms of inference speed dual setup of RTX 3090/4090 GPUs is faster compared to the Mac M2 Pro/Max/Ultra. Two RTX 4090s [can run 65b models at a speed of 20 tokens per second](https://github.com/turboderp/exllama#dual-gpu-results), while two affordable secondhand RTX 3090s achieve 15 tokens per second with Exllama. Additionally, the Mac evaluates prompts slower, making the dual GPU setup more appealing. And here is awesome comprehensive answer: https://www.hardware-corner.net/guides/computer-to-run-llama-ai-model/
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#229