[GH-ISSUE #1583] Towards better Ollama #872

Closed
opened 2026-04-12 10:32:02 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @eramax on GitHub (Dec 18, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1583

Because Ollama.cpp has improved user interactive features over llama.cpp, I prefer it.

I sincerely hope that you would expand Ollama to include other quantizations. As you are aware, this will require you to write some of the quantization algorithms yourself because they are all written in Python and require package dependencies, which you will avoid if you use compiled code.

I recommend checking out exllamav2, which is becoming more and more popular these days and is much faster than gguf while consuming the same amount of VRAM or less.
Another proposal is to ship Ollama with a user interface in addition to the console one. This is a very basic web application that can use the Ollama API, but it will be highly beneficial to the user, and if you don't want to put in extra work managing data, user data may be stored locally in the browser.

I'm not really familiar with how llama.cpp works or decides which layers to offload, but I believe that certain aspects of the model are less crucial than others, and it's possible that we can offload certain portions (my theory isn't supported by any knowledge).

Recently, I used Ollma a lot through my Colab account, and it works really well and quickly. However, I would prefer to be able to run Ollma without requiring a service; during installation, I can set it up to run as an app without a service, which will be much more efficient for my Jupyter notebook, as I can't get the same experience while it's on Colab.
https://gist.github.com/eramax/8533181ad841e4612041c42d154df003

Originally created by @eramax on GitHub (Dec 18, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1583 Because Ollama.cpp has improved user interactive features over llama.cpp, I prefer it. I sincerely hope that you would expand Ollama to include other quantizations. As you are aware, this will require you to write some of the quantization algorithms yourself because they are all written in Python and require package dependencies, which you will avoid if you use compiled code. I recommend checking out exllamav2, which is becoming more and more popular these days and is much faster than gguf while consuming the same amount of VRAM or less. Another proposal is to ship Ollama with a user interface in addition to the console one. This is a very basic web application that can use the Ollama API, but it will be highly beneficial to the user, and if you don't want to put in extra work managing data, user data may be stored locally in the browser. I'm not really familiar with how llama.cpp works or decides which layers to offload, but I believe that certain aspects of the model are less crucial than others, and it's possible that we can offload certain portions (my theory isn't supported by any knowledge). Recently, I used Ollma a lot through my Colab account, and it works really well and quickly. However, I would prefer to be able to run Ollma without requiring a service; during installation, I can set it up to run as an app without a service, which will be much more efficient for my Jupyter notebook, as I can't get the same experience while it's on Colab. https://gist.github.com/eramax/8533181ad841e4612041c42d154df003
Author
Owner

@justinh-rahb commented on GitHub (Dec 18, 2023):

Ollama is built upon llama.cpp, meaning it naturally supports the quantization algorithms present in llama.cpp. The complexity and resource demands of developing new quantization algorithms are significant, hence Ollama’s reliance on llama.cpp’s existing capabilities in this area.

Regarding GPU offloading, Ollama shares the same methods as llama.cpp. Any enhancements in llama.cpp’s GPU offloading are directly applicable to Ollama.

Although projects like exllamav2 offer interesting features, Ollama’s focus, as observed, is closely tied to llama.cpp, and there are no current plans I know of to bring in other model loaders.

As for the user interface, Ollama’s scope is currently focused on backend functionalities, and a dedicated UI is not within its intended scope. For those seeking a graphical user interface, there are other projects listed in the README.

I hope this provides clarity on Ollama’s capabilities and development focus, and addresses your queries regarding its relationship with llama.cpp.

<!-- gh-comment-id:1861645114 --> @justinh-rahb commented on GitHub (Dec 18, 2023): Ollama is built upon llama.cpp, meaning it naturally supports the quantization algorithms present in llama.cpp. The complexity and resource demands of developing new quantization algorithms are significant, hence Ollama’s reliance on llama.cpp’s existing capabilities in this area. Regarding GPU offloading, Ollama shares the same methods as llama.cpp. Any enhancements in llama.cpp’s GPU offloading are directly applicable to Ollama. Although projects like exllamav2 offer interesting features, Ollama’s focus, as observed, is closely tied to llama.cpp, and there are no current plans I know of to bring in other model loaders. As for the user interface, Ollama’s scope is currently focused on backend functionalities, and a dedicated UI is not within its intended scope. For those seeking a graphical user interface, there are other projects listed in the README. I hope this provides clarity on Ollama’s capabilities and development focus, and addresses your queries regarding its relationship with llama.cpp.
Author
Owner

@eramax commented on GitHub (Dec 18, 2023):

Thank you for sharing Ollama Focus. It's a great product strategy to concentrate on specific goals and outperform the competition in these services. I understand that Ollama's team and product cannot meet every user need, so users will have to use other products to get the most out of LLM.

<!-- gh-comment-id:1861772546 --> @eramax commented on GitHub (Dec 18, 2023): Thank you for sharing Ollama Focus. It's a great product strategy to concentrate on specific goals and outperform the competition in these services. I understand that Ollama's team and product cannot meet every user need, so users will have to use other products to get the most out of LLM.
Author
Owner

@ehartford commented on GitHub (Jan 6, 2024):

exllamav2 and quip-sharp would be lovely

<!-- gh-comment-id:1879591065 --> @ehartford commented on GitHub (Jan 6, 2024): exllamav2 and quip-sharp would be lovely
Author
Owner

@shaqq commented on GitHub (Feb 10, 2024):

I recommend checking out exllamav2, which is becoming more and more popular these days and is much faster than gguf while consuming the same amount of VRAM or less.

I believe the gap between GGUF and ExLlamav2 has shortened. See the update here:

https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/

and the updates to llama.cpp:

https://github.com/ggerganov/llama.cpp/pull/3776#issuecomment-1781472687

So things are changing quite fast

<!-- gh-comment-id:1937115101 --> @shaqq commented on GitHub (Feb 10, 2024): > I recommend checking out exllamav2, which is becoming more and more popular these days and is much faster than gguf while consuming the same amount of VRAM or less. I believe the gap between GGUF and ExLlamav2 has shortened. See the update here: https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/ and the updates to llama.cpp: https://github.com/ggerganov/llama.cpp/pull/3776#issuecomment-1781472687 So things are changing quite fast
Author
Owner

@mchiang0610 commented on GitHub (Mar 11, 2024):

hi @eramax thank you for sharing this! We indeed evaluate other inference engines / other formats for Ollama. That being said for any engine we add onto Ollama, it creates more surface area to manage (crashes, performance issues, security vulnerabilities, etc.). This means we have to be very careful in selecting what engines or formats to use.

The idea here to for us to find a systematic approach to inferencing -- leveraging direct hardware vendors' drivers for speed and reliability.

I will close this issue for now, and I'm super thankful for this issue. We will continue to evaluate and see what's fitting and learn.

<!-- gh-comment-id:1989167595 --> @mchiang0610 commented on GitHub (Mar 11, 2024): hi @eramax thank you for sharing this! We indeed evaluate other inference engines / other formats for Ollama. That being said for any engine we add onto Ollama, it creates more surface area to manage (crashes, performance issues, security vulnerabilities, etc.). This means we have to be very careful in selecting what engines or formats to use. The idea here to for us to find a systematic approach to inferencing -- leveraging direct hardware vendors' drivers for speed and reliability. I will close this issue for now, and I'm super thankful for this issue. We will continue to evaluate and see what's fitting and learn.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#872