[GH-ISSUE #14771] cmd: add ollama fit to recommend compatible models based on available hardware #9547

Open
opened 2026-04-12 22:28:07 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @khalilkhamassi62-oss on GitHub (Mar 11, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14771

Problem

A new Ollama user faces a blank prompt with no guidance on which model
to run. Choosing wrong leads to:

  • Out-of-memory crashes when VRAM is insufficient
  • Multi-minute load times from unexpected CPU offloading
  • No way to know in advance whether a 70B model will run at all

There is currently no way to ask Ollama "what can my machine actually run?"

Proposed Solution

A new ollama fit subcommand — and matching GET /api/fit endpoint —
that scans the machine and ranks a built-in model catalogue by
hardware compatibility.

CLI example:

$ ollama fit

Ollama Fit Check
──────────────────────────────────────────────────────────────
  CPU  : linux (amd64)
  RAM  : 22.4 GB free / 31.9 GB total
  GPU  : CUDA NVIDIA RTX 3080  •  9.2 GB free / 10.0 GB total
  Disk : 180.0 GB free  →  /home/user/.ollama/models
──────────────────────────────────────────────────────────────

  ✅  IDEAL — Full GPU inference, fast
  ────────────────────────────────────────────────────────────
  llama3.2:3b          Q4_K_M    2.0 GB    ~82 tok/s  GPU
  phi3:3.8b            Q4_K_M    2.3 GB    ~80 tok/s  GPU
  mistral:7b           Q4_K_M    4.5 GB    ~55 tok/s  GPU
  llama3.1:8b          Q4_K_M    4.9 GB    ~51 tok/s  GPU

  🟡  GOOD — Minor CPU offload
  ────────────────────────────────────────────────────────────
  gemma2:9b            Q4_K_M    5.5 GB    ~38 tok/s  GPU+CPU

API example:

GET /api/fit
GET /api/fit?tags=code
GET /api/fit?family=qwen&all=true

Startup TUI: A "Fit Check" entry in the ollama menu opens a
tabbed screen. Users browse tier tabs with ←/→, select models with
space, and press Enter to pull them — without leaving the terminal.

Why This Belongs in Ollama Core

No new hardware detection. The implementation delegates entirely
to discover.GPUDevices() and ml.SystemInfo — the same paths the
scheduler already uses. Disk space uses syscall.Statfs, which is
one syscall.

No new dependencies. Only packages already in go.mod are used.

Follows existing patterns exactly:

  • Handler is a *Server method in server/routes.go, same as
    ListHandler, ShowHandler, etc.
  • Client method in api/client.go follows the same pattern as
    client.List().
  • CLI uses the same Cobra + tabwriter pattern as ollama list.
  • TUI screen is a self-contained bubbletea model injected into the
    existing state machine — zero changes to the core render loop.

Works offline. The catalogue is static data compiled into the
binary. No network calls.

Installed models detected correctly. Uses manifest.Manifests()
— the same path as ollama list — to mark already-downloaded models.

Implementation

Working implementation on my fork:
https://github.com/khalilkhamassi62-oss/ollama/commit/773609a7

New package fitcheck/:

  • hardware.go — collects GPU, RAM, disk into HardwareProfile
  • requirements.go — 165-entry catalogue across 72 model families
    (Llama, Mistral, Phi, Gemma, Qwen, DeepSeek, Granite, vision,
    embedding, reasoning models)
  • scorer.go — 4-component scoring: VRAM fit (40%), RAM headroom
    (25%), disk space (15%), GPU generation speed class (20%)
  • disk_unix.go / disk_windows.go — platform disk stats
  • scorer_test.go — 10 tests, no real hardware required

Scoring Model

VRAM score:  model fits entirely on GPU → 1.0 (RunMode: GPU)
             partial fit, RAM can offload → 0.25–0.65 (GPU+CPU)
             no GPU or can't fit → 0.0 (CPU)

RAM score:   available ≥ required → 1.0
             total ≥ 85% of required → 0.5 + warning note
             insufficient → 0.0

Disk score:  available ≥ model size → 1.0
             insufficient → 0.0

Speed score: Metal (Apple Silicon 36GB+) → 1.0 / ~120 tok/s
             CUDA SM9+ (H100, RTX 40xx)  → 1.0 / ~150 tok/s
             CUDA SM8  (A100, RTX 30xx)  → 0.85 / ~100 tok/s
             CUDA SM7  (V100, RTX 20xx)  → 0.65 / ~60 tok/s
             ROCm                         → 0.70 / ~70 tok/s
             CPU only                     → 0.15 / ~3 tok/s

Final = VRAM×0.40 + RAM×0.25 + Disk×0.15 + Speed×0.20

Tier:  ≥0.82 → Ideal  |  ≥0.62 → Good  |  ≥0.38 → Marginal
       ≥0.15 → Possible  |  <0.15 or RAM+Disk both 0 → Too Large

Alternatives Considered

Alternative Why not
Recommend models on ollama.com Leaves the terminal, ignores current free VRAM/RAM
Add requirements to ollama show Only useful after you already know which model
Separate installable tool Installation friction, splits the UX
Dynamic fetch from ollama.com/library Network dependency, latency at startup

Open Questions for Maintainers

  1. Should the TUI entry be gated behind OLLAMA_EXPERIMENT=fitcheck
    initially?
  2. Is GET /api/fit the right path, or would /api/hardware returning
    just the hardware profile (separate from scoring) be more composable?
  3. Should EstTPS be removed from the JSON response since it is
    estimated, not measured?

Tests

# No hardware required
go test ./fitcheck/...

# Requires `ollama serve`
ollama fit
ollama fit --tags code --json | jq '.models[0]'
curl http://localhost:11434/api/fit?tags=embed | jq '.models[].req.name'

Tested on: <"Ubuntu 24.04, No GPU, 32GB RAM">

Originally created by @khalilkhamassi62-oss on GitHub (Mar 11, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14771 ## Problem A new Ollama user faces a blank prompt with no guidance on which model to run. Choosing wrong leads to: - Out-of-memory crashes when VRAM is insufficient - Multi-minute load times from unexpected CPU offloading - No way to know in advance whether a 70B model will run at all There is currently no way to ask Ollama "what can *my machine* actually run?" ## Proposed Solution A new `ollama fit` subcommand — and matching `GET /api/fit` endpoint — that scans the machine and ranks a built-in model catalogue by hardware compatibility. **CLI example:** ```sh $ ollama fit Ollama Fit Check ────────────────────────────────────────────────────────────── CPU : linux (amd64) RAM : 22.4 GB free / 31.9 GB total GPU : CUDA NVIDIA RTX 3080 • 9.2 GB free / 10.0 GB total Disk : 180.0 GB free → /home/user/.ollama/models ────────────────────────────────────────────────────────────── ✅ IDEAL — Full GPU inference, fast ──────────────────────────────────────────────────────────── llama3.2:3b Q4_K_M 2.0 GB ~82 tok/s GPU phi3:3.8b Q4_K_M 2.3 GB ~80 tok/s GPU mistral:7b Q4_K_M 4.5 GB ~55 tok/s GPU llama3.1:8b Q4_K_M 4.9 GB ~51 tok/s GPU 🟡 GOOD — Minor CPU offload ──────────────────────────────────────────────────────────── gemma2:9b Q4_K_M 5.5 GB ~38 tok/s GPU+CPU ``` **API example:** ```sh GET /api/fit GET /api/fit?tags=code GET /api/fit?family=qwen&all=true ``` **Startup TUI:** A "Fit Check" entry in the `ollama` menu opens a tabbed screen. Users browse tier tabs with ←/→, select models with space, and press Enter to pull them — without leaving the terminal. ## Why This Belongs in Ollama Core **No new hardware detection.** The implementation delegates entirely to `discover.GPUDevices()` and `ml.SystemInfo` — the same paths the scheduler already uses. Disk space uses `syscall.Statfs`, which is one syscall. **No new dependencies.** Only packages already in `go.mod` are used. **Follows existing patterns exactly:** - Handler is a `*Server` method in `server/routes.go`, same as `ListHandler`, `ShowHandler`, etc. - Client method in `api/client.go` follows the same pattern as `client.List()`. - CLI uses the same Cobra + tabwriter pattern as `ollama list`. - TUI screen is a self-contained bubbletea model injected into the existing state machine — zero changes to the core render loop. **Works offline.** The catalogue is static data compiled into the binary. No network calls. **Installed models detected correctly.** Uses `manifest.Manifests()` — the same path as `ollama list` — to mark already-downloaded models. ## Implementation Working implementation on my fork: https://github.com/khalilkhamassi62-oss/ollama/commit/773609a7 New package `fitcheck/`: - `hardware.go` — collects GPU, RAM, disk into `HardwareProfile` - `requirements.go` — 165-entry catalogue across 72 model families (Llama, Mistral, Phi, Gemma, Qwen, DeepSeek, Granite, vision, embedding, reasoning models) - `scorer.go` — 4-component scoring: VRAM fit (40%), RAM headroom (25%), disk space (15%), GPU generation speed class (20%) - `disk_unix.go` / `disk_windows.go` — platform disk stats - `scorer_test.go` — 10 tests, no real hardware required ## Scoring Model ``` VRAM score: model fits entirely on GPU → 1.0 (RunMode: GPU) partial fit, RAM can offload → 0.25–0.65 (GPU+CPU) no GPU or can't fit → 0.0 (CPU) RAM score: available ≥ required → 1.0 total ≥ 85% of required → 0.5 + warning note insufficient → 0.0 Disk score: available ≥ model size → 1.0 insufficient → 0.0 Speed score: Metal (Apple Silicon 36GB+) → 1.0 / ~120 tok/s CUDA SM9+ (H100, RTX 40xx) → 1.0 / ~150 tok/s CUDA SM8 (A100, RTX 30xx) → 0.85 / ~100 tok/s CUDA SM7 (V100, RTX 20xx) → 0.65 / ~60 tok/s ROCm → 0.70 / ~70 tok/s CPU only → 0.15 / ~3 tok/s Final = VRAM×0.40 + RAM×0.25 + Disk×0.15 + Speed×0.20 Tier: ≥0.82 → Ideal | ≥0.62 → Good | ≥0.38 → Marginal ≥0.15 → Possible | <0.15 or RAM+Disk both 0 → Too Large ``` ## Alternatives Considered | Alternative | Why not | |------------|---------| | Recommend models on ollama.com | Leaves the terminal, ignores current free VRAM/RAM | | Add requirements to `ollama show` | Only useful after you already know which model | | Separate installable tool | Installation friction, splits the UX | | Dynamic fetch from ollama.com/library | Network dependency, latency at startup | ## Open Questions for Maintainers 1. Should the TUI entry be gated behind `OLLAMA_EXPERIMENT=fitcheck` initially? 2. Is `GET /api/fit` the right path, or would `/api/hardware` returning just the hardware profile (separate from scoring) be more composable? 3. Should `EstTPS` be removed from the JSON response since it is estimated, not measured? ## Tests ```sh # No hardware required go test ./fitcheck/... # Requires `ollama serve` ollama fit ollama fit --tags code --json | jq '.models[0]' curl http://localhost:11434/api/fit?tags=embed | jq '.models[].req.name' ``` Tested on: <"Ubuntu 24.04, No GPU, 32GB RAM"> ```
GiteaMirror added the feature request label 2026-04-12 22:28:07 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#9547