[GH-ISSUE #12746] Benchmarking Local Ollama Models (Performance & Quality) - created. #8452

Open
opened 2026-04-12 21:08:08 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @cumhur on GitHub (Oct 22, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12746

Hello Ollama Team and Community,

First, I want to say thank you for the incredible work on Ollama. It has completely changed how many of us experiment with local LLMs.

To help myself and others better evaluate and compare the performance of different models, I've developed a Python-based benchmarking tool. I wanted to share it with the community in case it's useful, or perhaps to inspire future official features.

GitHub Repository: https://github.com/cumhur/Ollama_Local_LLM_Benchmark

This tool allows users to test multiple local models against a list of prompts. It has two key outputs:

  1. Performance Dashboard (.html): It generates a simple HTML dashboard (using Matplotlib) that visually compares key metrics like total response time, load duration, and tokens/second. This makes it easy to see performance differences at a glance.
  2. Detailed Report (.csv): It saves all benchmark data and the full text response from each model into a CSV file. This is crucial because it allows for secondary "quality" analysis (e.g., one could feed these responses to a more powerful model like GPT-4 or Claude 3 to score their accuracy or coherence).

I created this because I needed a consistent way to see how different models and configurations perform on my own hardware.

I believe a robust, integrated benchmarking feature would be a powerful addition to the Ollama ecosystem, helping users quickly find the best model (for speed and quality) for their specific needs.

Thank you for your time and for building such a great platform!

Originally created by @cumhur on GitHub (Oct 22, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12746 Hello Ollama Team and Community, First, I want to say thank you for the incredible work on Ollama. It has completely changed how many of us experiment with local LLMs. To help myself and others better evaluate and compare the performance of different models, I've developed a Python-based benchmarking tool. I wanted to share it with the community in case it's useful, or perhaps to inspire future official features. **GitHub Repository:** https://github.com/cumhur/Ollama_Local_LLM_Benchmark This tool allows users to test multiple local models against a list of prompts. It has two key outputs: 1. **Performance Dashboard (.html):** It generates a simple HTML dashboard (using Matplotlib) that visually compares key metrics like total response time, load duration, and tokens/second. This makes it easy to see performance differences at a glance. 2. **Detailed Report (.csv):** It saves all benchmark data *and the full text response* from each model into a CSV file. This is crucial because it allows for secondary "quality" analysis (e.g., one could feed these responses to a more powerful model like GPT-4 or Claude 3 to score their accuracy or coherence). I created this because I needed a consistent way to see how different models and configurations perform on my own hardware. I believe a robust, integrated benchmarking feature would be a powerful addition to the Ollama ecosystem, helping users quickly find the best model (for speed *and* quality) for their specific needs. Thank you for your time and for building such a great platform!
GiteaMirror added the feature request label 2026-04-12 21:08:08 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#8452