[GH-ISSUE #9419] support deepseek 671b fp4 #6143

Open
opened 2026-04-12 17:29:37 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @liudonghua123 on GitHub (Feb 28, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9419

Hi, i noticed that nvidia released deepseek 671b fp4(https://huggingface.co/nvidia/DeepSeek-R1-FP4). Maybe it's better than q4 version using the same ram/vram.

Is there any plans to support this kind of quantized version.

See also https://x.com/NVIDIAAIDev/status/1894172956726890623.

Originally created by @liudonghua123 on GitHub (Feb 28, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9419 Hi, i noticed that nvidia released deepseek 671b fp4(https://huggingface.co/nvidia/DeepSeek-R1-FP4). Maybe it's better than q4 version using the same ram/vram. Is there any plans to support this kind of quantized version. See also https://x.com/NVIDIAAIDev/status/1894172956726890623.
GiteaMirror added the model label 2026-04-12 17:29:37 -05:00
Author
Owner

@flywiththetide commented on GitHub (Mar 4, 2025):

Current Status of FP4 Support in Ollama

Ollama currently supports Q4 quantization, which is different from FP4 (Floating Point 4-bit). FP4 is an emerging format optimized for higher accuracy compared to integer-based quantization.

Challenges of Supporting FP4 Models in Ollama

  1. Backend Compatibility

    • Ollama’s inference engine is built around integer quantized models (Q4, Q8, etc.).
    • FP4 requires floating point arithmetic optimizations, which may not yet be supported in the current backend.
  2. Performance & Hardware Dependencies

    • FP4 inference is optimized for NVIDIA TensorRT and CUDA.
    • It might not be efficiently supported on non-NVIDIA hardware, limiting its use cases.

Potential Paths for Future Support

  • Check DeepSeek’s Hugging Face Format

    • If DeepSeek 671B FP4 is available in a compatible ONNX/TensorRT format, it may be possible to convert it for Ollama.
  • Modify the Ollama Backend to Support FP4

    • If there’s strong demand, Ollama’s inference engine could be updated to support mixed FP4/FP8 computation.
  • Community Testing & Benchmarking

    • It would be useful to test how FP4 compares to Q4 in accuracy vs. performance before full integration.

Would you be interested in running FP4 models via ONNX/TensorRT with Ollama as an experiment?

<!-- gh-comment-id:2696257342 --> @flywiththetide commented on GitHub (Mar 4, 2025): ### **Current Status of FP4 Support in Ollama** Ollama currently supports **Q4 quantization**, which is different from **FP4 (Floating Point 4-bit)**. FP4 is an emerging format optimized for **higher accuracy** compared to integer-based quantization. ### **Challenges of Supporting FP4 Models in Ollama** 1. **Backend Compatibility** - Ollama’s inference engine is built around **integer quantized models (Q4, Q8, etc.).** - FP4 requires **floating point arithmetic optimizations**, which may not yet be supported in the current backend. 2. **Performance & Hardware Dependencies** - FP4 inference is **optimized for NVIDIA TensorRT and CUDA**. - It might not be efficiently supported on **non-NVIDIA hardware**, limiting its use cases. ### **Potential Paths for Future Support** - **Check DeepSeek’s Hugging Face Format** - If **DeepSeek 671B FP4** is available in a compatible ONNX/TensorRT format, it may be possible to convert it for Ollama. - **Modify the Ollama Backend to Support FP4** - If there’s strong demand, Ollama’s inference engine could be updated to support **mixed FP4/FP8 computation**. - **Community Testing & Benchmarking** - It would be useful to test **how FP4 compares to Q4 in accuracy vs. performance** before full integration. Would you be interested in running FP4 models via **ONNX/TensorRT with Ollama** as an experiment?
Author
Owner

@ghostplant commented on GitHub (Apr 3, 2025):

@liudonghua123 Before ollama supports FP4, you can try DeepSeek-FP4 based on Tutel:

https://github.com/microsoft/Tutel?tab=readme-ov-file#whats-new

<!-- gh-comment-id:2777005095 --> @ghostplant commented on GitHub (Apr 3, 2025): @liudonghua123 Before ollama supports FP4, you can try DeepSeek-FP4 based on Tutel: https://github.com/microsoft/Tutel?tab=readme-ov-file#whats-new
Author
Owner

@ghostplant commented on GitHub (Apr 4, 2025):

@flywiththetide NVIDIA provides some accuracy results between FP8 and FP4:

Image

<!-- gh-comment-id:2779734113 --> @ghostplant commented on GitHub (Apr 4, 2025): @flywiththetide NVIDIA provides some accuracy results between FP8 and FP4: ![Image](https://github.com/user-attachments/assets/3578293a-19f6-4201-b4a1-5e0e76c1004e)
Author
Owner

@Johnreidsilver commented on GitHub (Feb 6, 2026):

Is NVFP4 (>Blackwell) still unsupported in Ollama?

Was thinking of trying out Qwen3-coder-next NVFP4 version...
https://huggingface.co/vincentzed-hf/Qwen3-Coder-Next-NVFP4

<!-- gh-comment-id:3860214126 --> @Johnreidsilver commented on GitHub (Feb 6, 2026): Is NVFP4 (>Blackwell) still unsupported in Ollama? Was thinking of trying out Qwen3-coder-next NVFP4 version... https://huggingface.co/vincentzed-hf/Qwen3-Coder-Next-NVFP4
Author
Owner

@cdsama commented on GitHub (Mar 18, 2026):

before ollama support NVFP4 maybe need llama.cpp support NVFP4 first

<!-- gh-comment-id:4080263592 --> @cdsama commented on GitHub (Mar 18, 2026): before ollama support NVFP4 maybe need llama.cpp support NVFP4 first
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#6143