[GH-ISSUE #5822] Slow inference on dual A40 #50139

Closed
opened 2026-04-28 14:21:05 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @jmorganca on GitHub (Jul 21, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5822

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Slow performance on an A40 card with llama3

server.log

OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @jmorganca on GitHub (Jul 21, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5822 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? Slow performance on an A40 card with `llama3` [server.log](https://github.com/user-attachments/files/16322714/server.log) ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the needs more info label 2026-04-28 14:21:05 -05:00
Author
Owner

@dhiltgen commented on GitHub (Jul 22, 2024):

Based on the server log, this looks like a local build from source with cuda v12, not the official builds.

Can we get more information about how this was built?

<!-- gh-comment-id:2243946000 --> @dhiltgen commented on GitHub (Jul 22, 2024): Based on the server log, this looks like a local build from source with cuda v12, not the official builds. Can we get more information about how this was built?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#50139