[GH-ISSUE #2046] What quantization is used to quantize Phi-2? #26942

Closed
opened 2026-04-22 03:44:11 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @bm777 on GitHub (Jan 18, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2046

Running Phi-2 with Ollama is faster than running Phi-2 in Rust with Candle. rust is taking 1.7 GB of my memory while Ollama only 788MB of memory. I guess it is using the same GGUF quantized 1.6 Gb

Ollama is

  • quantizing it at run time or
  • it does it before hand
  • or using lama.cpp under the hood
  • no quantization at all.

Originally created by @bm777 on GitHub (Jan 18, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2046 Running Phi-2 with Ollama is faster than running Phi-2 in Rust with Candle. rust is taking 1.7 GB of my memory while Ollama only 788MB of memory. I guess it is using the same GGUF quantized 1.6 Gb Ollama is - quantizing it at run time or - it does it before hand - or using lama.cpp under the hood - no quantization at all. ----
Author
Owner

@easp commented on GitHub (Jan 18, 2024):

Look at https://ollama.ai/library/phi/tags. You can check the fingerprint to figure out which quantization is used for phi:latest. Or not, because at this point you can pretty much count on it being q4_0 for any model in the ollama.ai/library.

That said, your memory utilization figure probably off for Ollama. It uses llama.cpp under the hood and mmaps the model weights. This doesn't show up as part of the processes memory. Instead it's accounted for under the file cache on linux and either wired memory (when an inference is in progress), or file cache (when idle) on MacOS.

<!-- gh-comment-id:1898871984 --> @easp commented on GitHub (Jan 18, 2024): Look at https://ollama.ai/library/phi/tags. You can check the fingerprint to figure out which quantization is used for phi:latest. Or not, because at this point you can pretty much count on it being q4_0 for any model in the ollama.ai/library. That said, your memory utilization figure probably off for Ollama. It uses llama.cpp under the hood and mmaps the model weights. This doesn't show up as part of the processes memory. Instead it's accounted for under the file cache on linux and either wired memory (when an inference is in progress), or file cache (when idle) on MacOS.
Author
Owner

@bm777 commented on GitHub (Jan 19, 2024):

@easp thanks :)

<!-- gh-comment-id:1899680994 --> @bm777 commented on GitHub (Jan 19, 2024): @easp thanks :)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#26942