[GH-ISSUE #1402] Optimum-NVIDIA - Unlock blazingly fast LLM inference in just 1 line of code #47256

Closed
opened 2026-04-28 03:28:36 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @iplayfast on GitHub (Dec 6, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1402

This article looks like it would be worth while to implement a check to see if nvidia is installed and add that line of code. Looks like it's quantizing an 8 bit float.

https://huggingface.co/blog/optimum-nvidia

Originally created by @iplayfast on GitHub (Dec 6, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1402 This article looks like it would be worth while to implement a check to see if nvidia is installed and add that line of code. Looks like it's quantizing an 8 bit float. https://huggingface.co/blog/optimum-nvidia
Author
Owner

@easp commented on GitHub (Dec 7, 2023):

As always, it's important to understand the conditions under which the claimed speedup was achieved. In this case:

  • High end GPU
  • fp8 model weights
  • Batch size of 4
  • Using tansformers python package

Ollama doesn't use the transformers python package, so this isn't going to be a single line of code to implement.
It uses a batch size of 1, already uses quantized model weights, and uses other optimizations (through llama.cpp). The quantization alone probably accounts for a significant part of of the 2.6x speedup they reported on the 4090 GPU.

<!-- gh-comment-id:1846145834 --> @easp commented on GitHub (Dec 7, 2023): As always, it's important to understand the conditions under which the claimed speedup was achieved. In this case: - High end GPU - fp8 model weights - Batch size of 4 - Using tansformers python package Ollama doesn't use the transformers python package, so this isn't going to be a single line of code to implement. It uses a batch size of 1, already uses quantized model weights, and uses other optimizations (through llama.cpp). The quantization alone probably accounts for a significant part of of the 2.6x speedup they reported on the 4090 GPU.
Author
Owner

@iplayfast commented on GitHub (Dec 7, 2023):

I suspected as much, but didn't want to let an opportunity to enhance ollama pass by.

<!-- gh-comment-id:1846149620 --> @iplayfast commented on GitHub (Dec 7, 2023): I suspected as much, but didn't want to let an opportunity to enhance ollama pass by.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#47256