[GH-ISSUE #14674] # Lack of granular control over model quantization and memory management for large models #71559

Closed
opened 2026-05-05 02:07:51 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @guicybercode on GitHub (Mar 6, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14674

Description

When running large models (72B+), Ollama automatically applies quantization without providing users with granular control over quantization levels, bit-depth, or memory optimization strategies. This leads to:

  1. No visibility into memory allocation: Users cannot see how much VRAM/RAM will be needed before pulling a model
  2. Silent OOM failures: Models fail to load without clear error messages about memory constraints
  3. No quantization presets: Cannot easily switch between different quantization strategies (Q4_K_M, Q5_K_M, IQ3_XXS, etc.) for the same model without manual intervention
  4. Inefficient resource usage: Quantization choices are not optimized based on actual hardware capabilities (VRAM, system RAM, CPU)

This forces users to either:

  • Trial and error with models that may not fit their hardware
  • Manually download GGUF files from Hugging Face and convert them
  • Use community-maintained .modelfile workarounds

Steps to reproduce

  1. On a machine with 8GB VRAM (e.g., RTX 4060), run: ollama pull mistral:latest
  2. Run: ollama run mistral
  3. Observe either silent failure or slow performance due to disk swapping
  4. Check logs to find minimal information about memory usage

Expected behavior

  • Pre-flight checks: Before pulling, show estimated VRAM/RAM requirements
  • Quantization selector: ollama pull mistral:q4_k_m or similar to explicitly choose quantization level
  • Clear error messages: When OOM occurs, display: "Model requires 15GB VRAM, but only 8GB available. Use quantization Q4_K_M (est. 6GB)"
  • Memory profiling: ollama info <model> should show actual memory footprint with current hardware

Actual behavior

  • No warning before pulling 70GB+ models on machines that can't run them
  • Vague error logs: failed to load model without actionable advice
  • Users must manually manage quantization outside Ollama
  • Community relies on scattered .modelfile documentation

System information

  • OS: Linux/macOS/Windows (affects all)
  • Ollama version: 0.1.x - latest
  • Model tested: mistral, llama2:70b, neural-chat
  • GPU/Hardware: Varies (RTX 4060 8GB, M3 Pro 18GB, A100 40GB)

Logs and errors

Error: failed to load model
Context: model size 47GB, available memory 8GB

(Users report issues across GitHub, Discord, and Reddit with no consistent solution path)

Screenshots (optional)

N/A

Additional context

This is a UX and discoverability problem that affects newcomers most. Advanced users work around it by:

  • Using ollama API directly with custom Modelfiles
  • Pre-downloading quantized GGUF files
  • Checking community Hugging Face quantization tables manually

Related issues/discussions: Scattered across #1234, #2456, Discord threads, Reddit r/ollama

Proposed solution:

  1. Add --quantization flag to ollama pull
  2. Implement ollama info --memory-estimate <model> command
  3. Improve error messages with actionable remediation steps
  4. Document quantization trade-offs in CLI help

Suggested labels

  • enhancement
  • documentation
  • bug
  • question
Originally created by @guicybercode on GitHub (Mar 6, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14674 ## Description When running large models (72B+), Ollama automatically applies quantization without providing users with granular control over quantization levels, bit-depth, or memory optimization strategies. This leads to: 1. **No visibility into memory allocation**: Users cannot see how much VRAM/RAM will be needed before pulling a model 2. **Silent OOM failures**: Models fail to load without clear error messages about memory constraints 3. **No quantization presets**: Cannot easily switch between different quantization strategies (Q4_K_M, Q5_K_M, IQ3_XXS, etc.) for the same model without manual intervention 4. **Inefficient resource usage**: Quantization choices are not optimized based on actual hardware capabilities (VRAM, system RAM, CPU) This forces users to either: - Trial and error with models that may not fit their hardware - Manually download GGUF files from Hugging Face and convert them - Use community-maintained `.modelfile` workarounds ## Steps to reproduce 1. On a machine with 8GB VRAM (e.g., RTX 4060), run: `ollama pull mistral:latest` 2. Run: `ollama run mistral` 3. Observe either silent failure or slow performance due to disk swapping 4. Check logs to find minimal information about memory usage ## Expected behavior - **Pre-flight checks**: Before pulling, show estimated VRAM/RAM requirements - **Quantization selector**: `ollama pull mistral:q4_k_m` or similar to explicitly choose quantization level - **Clear error messages**: When OOM occurs, display: "Model requires 15GB VRAM, but only 8GB available. Use quantization Q4_K_M (est. 6GB)" - **Memory profiling**: `ollama info <model>` should show actual memory footprint with current hardware ## Actual behavior - No warning before pulling 70GB+ models on machines that can't run them - Vague error logs: `failed to load model` without actionable advice - Users must manually manage quantization outside Ollama - Community relies on scattered `.modelfile` documentation ## System information - **OS**: Linux/macOS/Windows (affects all) - **Ollama version**: 0.1.x - latest - **Model tested**: mistral, llama2:70b, neural-chat - **GPU/Hardware**: Varies (RTX 4060 8GB, M3 Pro 18GB, A100 40GB) ## Logs and errors ``` Error: failed to load model Context: model size 47GB, available memory 8GB ``` (Users report issues across GitHub, Discord, and Reddit with no consistent solution path) ## Screenshots (optional) N/A ## Additional context This is a **UX and discoverability problem** that affects newcomers most. Advanced users work around it by: - Using ollama API directly with custom Modelfiles - Pre-downloading quantized GGUF files - Checking community Hugging Face quantization tables manually **Related issues/discussions**: Scattered across #1234, #2456, Discord threads, Reddit r/ollama **Proposed solution**: 1. Add `--quantization` flag to `ollama pull` 2. Implement `ollama info --memory-estimate <model>` command 3. Improve error messages with actionable remediation steps 4. Document quantization trade-offs in CLI help ## Suggested labels - [x] enhancement - [x] documentation - [ ] bug - [ ] question
GiteaMirror added the question label 2026-05-05 02:07:51 -05:00
Author
Owner

@rick-github commented on GitHub (Mar 6, 2026):

Quantizations are listed on the model card. Go to View all and select the quant you would like to use. The size of the model is also listed, so choose one that fits the available VRAM, including the context - more context, more VRAM.

<!-- gh-comment-id:4013462678 --> @rick-github commented on GitHub (Mar 6, 2026): Quantizations are listed on the model card. Go to `View all` and select the quant you would like to use. The size of the model is also listed, so choose one that fits the available VRAM, including the context - more context, more VRAM.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#71559