[GH-ISSUE #3159] A way to communicate reasons for low performance to users of CLI & API #27703

Closed
opened 2026-04-22 05:14:35 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @easp on GitHub (Mar 15, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3159

Originally assigned to: @bmizerany on GitHub.

People are often concerned about what they perceive to be low performance and/or whether Ollama is making optimal use of their RAM/VRAM/GPU/CPU cores. This comes up frequently in github issues, the main Discord channel and the Discord help channel.

I think a lot of these queries could be avoided if Ollama communicated more information to users in a more concise and obvious way than digging in the log file. Perhaps a message in the CLI, just before displaying the REPL prompt, and a status message field in the API response.

Common conditions seem to be

  • Model + context too large for VRAM
  • GPU detected but not used, using CPU for inference
  • No GPU detected, using CPU for inference

The message should warn that performance will be low and point user towards remedies:

  • Using a model with fewer parameters and/or smaller quantization
  • A URL for troubleshooting issues with GPUs.
Originally created by @easp on GitHub (Mar 15, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3159 Originally assigned to: @bmizerany on GitHub. People are often concerned about what they perceive to be low performance and/or whether Ollama is making optimal use of their RAM/VRAM/GPU/CPU cores. This comes up frequently in github issues, the main Discord channel and the Discord help channel. I think a lot of these queries could be avoided if Ollama communicated more information to users in a more concise and obvious way than digging in the log file. Perhaps a message in the CLI, just before displaying the REPL prompt, and a status message field in the API response. Common conditions seem to be * Model + context too large for VRAM * GPU detected but not used, using CPU for inference * No GPU detected, using CPU for inference The message should warn that performance will be low and point user towards remedies: * Using a model with fewer parameters and/or smaller quantization * A URL for troubleshooting issues with GPUs.
GiteaMirror added the feature request label 2026-04-22 05:14:36 -05:00
Author
Owner

@pdevine commented on GitHub (May 18, 2024):

We just added ollama ps in the last version of ollama which will show you what percentage of a model is loaded on the CPU or GPU. For trouble shooting you'll still need to look through the logs, but it should cover the 3 cases you mentioned. I'll go ahead and close out the issue.

<!-- gh-comment-id:2118618471 --> @pdevine commented on GitHub (May 18, 2024): We just added `ollama ps` in the last version of ollama which will show you what percentage of a model is loaded on the CPU or GPU. For trouble shooting you'll still need to look through the logs, but it should cover the 3 cases you mentioned. I'll go ahead and close out the issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#27703