[GH-ISSUE #990] TPU backend support #482

New Issue

GiteaMirror · 2026-04-12T10:09:39-05:00

GiteaMirror commented

2026-04-12 10:09:39 -05:00

Originally created by @coolrazor007 on GitHub (Nov 3, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/990

Originally assigned to: @dhiltgen on GitHub.

Would love to see Ollama run on a TPU not just GPU. Has this been done by anyone already?

Originally created by @coolrazor007 on GitHub (Nov 3, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/990 Originally assigned to: @dhiltgen on GitHub. Would love to see Ollama run on a TPU not just GPU. Has this been done by anyone already?

GiteaMirror added the feature request label 2026-04-12 10:09:39 -05:00

GiteaMirror commented

2026-04-12 10:09:40 -05:00

@krenax commented on GitHub (Nov 6, 2023):

This may be interesting. Is Ollama officially supported on the TPU?

@krenax commented on GitHub (Nov 6, 2023): This may be interesting. Is Ollama officially supported on the TPU?

GiteaMirror commented

2026-04-12 10:09:40 -05:00

@coolrazor007 commented on GitHub (Nov 6, 2023):

I did some digging and realized that Ollama is based on llama.cpp which does not support TPUs currently.

@coolrazor007 commented on GitHub (Nov 6, 2023): I did some digging and realized that Ollama is based on llama.cpp which does not support TPUs currently.

GiteaMirror commented

2026-04-12 10:09:40 -05:00

@boredcoder411 commented on GitHub (Jan 8, 2024):

Any updates on this? I really want to use my edge tpu and raspberry pi with this project

@boredcoder411 commented on GitHub (Jan 8, 2024): Any updates on this? I really want to use my edge tpu and raspberry pi with this project

GiteaMirror commented

2026-04-12 10:09:41 -05:00

@easp commented on GitHub (Jan 9, 2024):

@boredcoder411 Edge TPU is not suited for LLM. They only have, what, 2GB of RAM and slow flash memory.

@easp commented on GitHub (Jan 9, 2024): @boredcoder411 Edge TPU is not suited for LLM. They only have, what, 2GB of RAM and slow flash memory.

GiteaMirror commented

2026-04-12 10:09:41 -05:00

@boredcoder411 commented on GitHub (Jan 9, 2024):

So what CAN I run on them?

@boredcoder411 commented on GitHub (Jan 9, 2024): So what CAN I run on them?

GiteaMirror commented

2026-04-12 10:09:42 -05:00

@helium729 commented on GitHub (Apr 28, 2024):

maybe our team would like to help with this feature, it's just we don't know where to get started yet.

@helium729 commented on GitHub (Apr 28, 2024): maybe our team would like to help with this feature, it's just we don't know where to get started yet.

GiteaMirror commented

2026-04-12 10:09:42 -05:00

@GameTec-live commented on GitHub (May 15, 2024):

I'd love support for the pcie coral tpus... You should be able to swap in and out of memory over the pcie bus fast enough (at least thats what ive read somewhere) and with the pi 5 now having nvme support, id love to be able to just build a tiny little llm server...
https://coral.ai/products/m2-accelerator-dual-edgetpu/

@GameTec-live commented on GitHub (May 15, 2024): I'd love support for the pcie coral tpus... You should be able to swap in and out of memory over the pcie bus fast enough (at least thats what ive read somewhere) and with the pi 5 now having nvme support, id love to be able to just build a tiny little llm server... https://coral.ai/products/m2-accelerator-dual-edgetpu/

GiteaMirror commented

2026-04-12 10:09:43 -05:00

@boredcoder411 commented on GitHub (May 15, 2024):

Fairly sure jax/flax supports the coral tpus, also the USB accelerator. Also sure it has enough ram to hold at least tinyllama.

@boredcoder411 commented on GitHub (May 15, 2024): Fairly sure jax/flax supports the coral tpus, also the USB accelerator. Also sure it has enough ram to hold at least tinyllama.

GiteaMirror commented

2026-04-12 10:09:43 -05:00

@quadcom commented on GitHub (May 24, 2024):

Why couldn't a RAMDISK be created to hold model files in the case of a TPU? I have an Unraid server with 96GB of RAM. Reserving 12-24GB for a RAMDISK wouldn't be a huge hit to its performance.

@quadcom commented on GitHub (May 24, 2024): Why couldn't a RAMDISK be created to hold model files in the case of a TPU? I have an Unraid server with 96GB of RAM. Reserving 12-24GB for a RAMDISK wouldn't be a huge hit to its performance.

GiteaMirror commented

2026-04-12 10:09:44 -05:00

@easp commented on GitHub (May 24, 2024):

A RAM disk, besides being a obsolete throwback, isn't going to help with the fact that Coral TPUs don't have enough RAM and only have a slow USB connection to its host system.

In addition, they aren't all that fast. They aren't supported by Ollama & they aren't likely to be because any one capable of doing the work likely has better things to do and even if they did the work, it's unlikely that the Ollama maintainers would merge it because it would add complexity for very little benefit.

@easp commented on GitHub (May 24, 2024): A RAM disk, besides being a obsolete throwback, isn't going to help with the fact that Coral TPUs don't have enough RAM and only have a slow USB connection to its host system. In addition, they aren't all that fast. They aren't supported by Ollama & they aren't likely to be because any one capable of doing the work likely has better things to do and even if they did the work, it's unlikely that the Ollama maintainers would merge it because it would add complexity for very little benefit.

GiteaMirror commented

2026-04-12 10:09:44 -05:00

@spc789 commented on GitHub (Jun 10, 2024):

I agree with the ramdisk feature, with an nvme/pcie tpu like corel pcie tpus (NOT the usb version) or the hailo tpus they are tied to the pcie bus.

Making a ramdisk, which is a way to forcible keep things in memory can be an option to speed up things.

Personally I do the ramdisk strategy with large panda dataframes and a /dev/shm use on linux if I need interprocess communication done on such things.

Ram is directly tied to the memory bus, so using this strategy could have a huge benefit with TPUs which rely on memory streaming of the data (those who dont have much ram onboard)

@spc789 commented on GitHub (Jun 10, 2024): I agree with the ramdisk feature, with an nvme/pcie tpu like corel pcie tpus (NOT the usb version) or the hailo tpus they are tied to the pcie bus. Making a ramdisk, which is a way to forcible keep things in memory can be an option to speed up things. Personally I do the ramdisk strategy with large panda dataframes and a /dev/shm use on linux if I need interprocess communication done on such things. Ram is directly tied to the memory bus, so using this strategy could have a huge benefit with TPUs which rely on memory streaming of the data (those who dont have much ram onboard)

GiteaMirror commented

2026-04-12 10:09:44 -05:00

@boredcoder411 commented on GitHub (Aug 11, 2024):

Ramdisks sound like a great idea, but how would this work in go?

@boredcoder411 commented on GitHub (Aug 11, 2024): Ramdisks sound like a great idea, but how would this work in go?

GiteaMirror commented

2026-04-12 10:09:45 -05:00

@jasonsmithio commented on GitHub (Aug 27, 2024):

I am happy to help if anyone is tackling this already!

@jasonsmithio commented on GitHub (Aug 27, 2024): I am happy to help if anyone is tackling this already!

GiteaMirror commented

2026-04-12 10:09:45 -05:00

@mfp20 commented on GitHub (Nov 15, 2024):

Does it work? I'm evaluating the idea to buy one of those hailo m.2 cards as I don't need more gpus...

@mfp20 commented on GitHub (Nov 15, 2024): Does it work? I'm evaluating the idea to buy one of those hailo m.2 cards as I don't need more gpus...

GiteaMirror commented

2026-04-12 10:09:45 -05:00

@boredcoder411 commented on GitHub (Nov 15, 2024):

Prob not cause google doesn't like consumers and has closed issues on the Jax repo requesting tpu support

@boredcoder411 commented on GitHub (Nov 15, 2024): Prob not cause google doesn't like consumers and has closed issues on the Jax repo requesting tpu support

GiteaMirror commented

2026-04-12 10:09:45 -05:00

@easp commented on GitHub (Nov 15, 2024):

@mfp20 That accelerator isn't designed for LLM. Look at the models they use in their benchmarks. EfficientNetV2-M is a 54M parameter model. That's 50x smaller than even a small LLM. They don't have the onboard memory and shifting the weights over PCIe 2x for each token isn't any more realistic than it is for a GPU.

@easp commented on GitHub (Nov 15, 2024): @mfp20 That accelerator isn't designed for LLM. Look at the models they use in their benchmarks. EfficientNetV2-M is a 54M parameter model. That's 50x smaller than even a small LLM. They don't have the onboard memory and shifting the weights over PCIe 2x for each token isn't any more realistic than it is for a GPU. ![Overloaded Jetta](https://github.com/user-attachments/assets/49849c1a-b502-4cf7-b0de-b7d71bca7b02)

GiteaMirror commented

2026-04-12 10:09:45 -05:00

@mfp20 commented on GitHub (Nov 16, 2024):

Prob not cause google doesn't like consumers and has closed issues on the Jax repo requesting tpu support

Any link about this? If the pcie latencies (to access the ram) are too high, there's no point to keep the issue open... as it is not possible to use the TPUs. Other accelerators failed in the past (ex: crypto accelerators on pci slots).

@mfp20 That accelerator isn't designed for LLM. Look at the models they use in their benchmarks. EfficientNetV2-M is a 54M parameter model. That's 50x smaller than even a small LLM. They don't have the onboard memory and shifting the weights over PCIe 2x for each token isn't any more realistic than it is for a GPU.

I've never looked into the computational details; I suppose you are right. But pcie5 latencies might be low enough to grant an effective accel. Maybe not much, but even a 1.5x would enable some use cases.

TPUs a part, I wonder if some sort of ram caching mechanism might do the trick. Something like using the ram to load the whole model and then a background process shifting parts of it into the vram before the gpu processing needing that part. I've no idea of the level of parallelism required by those algorithms, and what are the chances to predict what chunks of the model will be needed, but ... any parallelism can be reduced in multiple steps. It's slower but it would enable decent timings on ram-rich systems and consumer gpus with little vram. Probably at the end of the day we might experience something similar to the performance seen with https://github.com/b4rtaz/distributed-llama .

@mfp20 commented on GitHub (Nov 16, 2024): > Prob not cause google doesn't like consumers and has closed issues on the Jax repo requesting tpu support Any link about this? If the pcie latencies (to access the ram) are too high, there's no point to keep the issue open... as it is not possible to use the TPUs. Other accelerators failed in the past (ex: crypto accelerators on pci slots). > @mfp20 That accelerator isn't designed for LLM. Look at the models they use in their benchmarks. EfficientNetV2-M is a 54M parameter model. That's 50x smaller than even a small LLM. They don't have the onboard memory and shifting the weights over PCIe 2x for each token isn't any more realistic than it is for a GPU. I've never looked into the computational details; I suppose you are right. But pcie5 latencies might be low enough to grant an effective accel. Maybe not much, but even a 1.5x would enable some use cases. TPUs a part, I wonder if some sort of ram caching mechanism might do the trick. Something like using the ram to load the whole model and then a background process shifting parts of it into the vram before the gpu processing needing that part. I've no idea of the level of parallelism required by those algorithms, and what are the chances to predict what chunks of the model will be needed, but ... any parallelism can be reduced in multiple steps. It's slower but it would enable decent timings on ram-rich systems and consumer gpus with little vram. Probably at the end of the day we might experience something similar to the performance seen with https://github.com/b4rtaz/distributed-llama .

GiteaMirror commented

2026-04-12 10:09:47 -05:00

@easp commented on GitHub (Nov 16, 2024):

@mfp20 For monolithic models (ie not mixture-of-experts) it's very easy to predict what chunks of the model will be needed because the entire model is read sequentially from start to finish for each token.

For token generation compute isn't really an issue, PCIe latency isn't the issue, bandwidth is. For these reason, if a portion of the model is in RAM it's generally faster to compute on the CPU than it is to ship the data over PCIe in order to use it on a GPU or TPU.

@easp commented on GitHub (Nov 16, 2024): @mfp20 For monolithic models (ie not mixture-of-experts) it's very easy to predict what chunks of the model will be needed because the entire model is read sequentially from start to finish for each token. For token generation compute isn't really an issue, PCIe latency isn't the issue, bandwidth is. For these reason, if a portion of the model is in RAM it's generally faster to compute on the CPU than it is to ship the data over PCIe in order to use it on a GPU or TPU.

GiteaMirror commented

2026-04-12 10:09:48 -05:00

@mfp20 commented on GitHub (Nov 17, 2024):

@mfp20 For monolithic models (ie not mixture-of-experts) it's very easy to predict what chunks of the model will be needed because the entire model is read sequentially from start to finish for each token.

Ok, this can be adjusted in software, in a similar way it is adjusted today when running some layers on CPUs and some on GPUs.
I imagine a couple of DMA pumps moving a few model chunks form ram to vram, sync'ed by the main evaluation process. If a model is 56gb and the vram is 8gb, you get 12 chunks only to move for each token. The end result would be an execution slower than gpu-vram (because of insufficient memory bandwidth), and faster execution than cpu-only, but ... hey ... better than no execution at all (because of the lack of enough vram), or totally unusable slow execution of a bare cpu and the system completely clogged (because the general purpose cores are executing the model). It is an enabling solution for consumer grade systems.

For token generation compute isn't really an issue, PCIe latency isn't the issue, bandwidth is. For these reason, if a portion of the model is in RAM it's generally faster to compute on the CPU than it is to ship the data over PCIe in order to use it on a GPU or TPU.

If bw is the limit, then a pcie5 TPU can accelerate: both ddr5 and pcie5 have 64 GB/s bw. An order of magnitude less than consumers gpu's vram (200-800 GB/s); but increases in cpu-based performance are registered when llama.cpp workers can rely on complex instructions (ex: AVX, AVX2, AVX512). Having an ASIC chip like the ones in TPUs, instead of somewhat general purpose instructions, might further increase the acceleration. And offload the main cpu.

Problem is: none of those TPU cards are pcie5 x16. And an x16 card might end up being as expensive as a second-hand gpu on ebay...

In any case improving heterogeneous computing by implementing the ram-vram buffering described above might be useful. Probably not much for the single-prompt use case, but for parallel operations. I didn't look at current code (in llama.cpp, ollama, lm studio, and so on) but looks like they are struggling to mix multiple silicons.

@mfp20 commented on GitHub (Nov 17, 2024): > @mfp20 For monolithic models (ie not mixture-of-experts) it's very easy to predict what chunks of the model will be needed because the entire model is read sequentially from start to finish for each token. Ok, this can be adjusted in software, in a similar way it is adjusted today when running some layers on CPUs and some on GPUs. I imagine a couple of DMA pumps moving a few model chunks form ram to vram, sync'ed by the main evaluation process. If a model is 56gb and the vram is 8gb, you get 12 chunks only to move for each token. The end result would be an execution slower than gpu-vram (because of insufficient memory bandwidth), and faster execution than cpu-only, but ... hey ... better than no execution at all (because of the lack of enough vram), or totally unusable slow execution of a bare cpu and the system completely clogged (because the general purpose cores are executing the model). It is an enabling solution for consumer grade systems. > For token generation compute isn't really an issue, PCIe latency isn't the issue, bandwidth is. For these reason, if a portion of the model is in RAM it's generally faster to compute on the CPU than it is to ship the data over PCIe in order to use it on a GPU or TPU. If bw is the limit, then a pcie5 TPU can accelerate: both ddr5 and pcie5 have 64 GB/s bw. An order of magnitude less than consumers gpu's vram (200-800 GB/s); but increases in cpu-based performance are registered when llama.cpp workers can rely on complex instructions (ex: AVX, AVX2, AVX512). Having an ASIC chip like the ones in TPUs, instead of somewhat general purpose instructions, might further increase the acceleration. And offload the main cpu. **Problem is: none of those TPU cards are pcie5 x16.** And an x16 card might end up being as expensive as a second-hand gpu on ebay... In any case improving heterogeneous computing by implementing the ram-vram buffering described above might be useful. Probably not much for the single-prompt use case, but for parallel operations. I didn't look at current code (in llama.cpp, ollama, lm studio, and so on) but looks like they are struggling to mix multiple silicons.

GiteaMirror commented

2026-04-12 10:09:48 -05:00

@antarix1 commented on GitHub (Jan 31, 2025):

I am a novice user. I have only used ollama on my Linux host and loved it, albeit on ancient hardware without a GPU.
The little that I understand from the above discussion is that although the tensor processors are useful for compute, the severely limited RAM and bus speeds handicap the usability for fast chat-bots.

I found out this item and wondered if it would be any good?

Also, may be foolish talk, but do you think it is possible to build a PCIe card with 4 of these chips and memory brackets for SODIMM-DDR4 RAM so that users may install as much RAM as they want? Please enlighten me on the intricacies and difficulties of making such a card

@antarix1 commented on GitHub (Jan 31, 2025): I am a novice user. I have only used ollama on my Linux host and loved it, albeit on ancient hardware without a GPU. The little that I understand from the above discussion is that although the tensor processors are useful for compute, the severely limited RAM and bus speeds handicap the usability for fast chat-bots. I found out [this item](https://iot.asus.com/products/AI-accelerator/AI-Accelerator-PCIe-Card/) and wondered if it would be any good? Also, may be foolish talk, but do you think it is possible to build a PCIe card with 4 of [these chips](https://coral.ai/products/accelerator-module/) and memory brackets for SODIMM-DDR4 RAM so that users may install as much RAM as they want? Please enlighten me on the intricacies and difficulties of making such a card

GiteaMirror commented

2026-04-12 10:09:48 -05:00

@mfp20 commented on GitHub (Jan 31, 2025):

Please enlighten me on the intricacies and difficulties of making such a card

There's no el cheapo solution. I keep looking around, like many others; but as of today, AI isn't for all (yet). You can pay tokens to the big players, or experiment with open models using commodity hardware thanks to ollama and alike. That's it.

I didn't look at Coral's datasheet but I doubt it has the memory controller to connect big amounts of ram, and not of the right kind. Moreover, producing PCBs for ram chips/slots is ... expensive, as it needs to be perfectly tuned for all those signals packed in small room; the tracks' routing must be perfect. The card you are suggesting wouldn't be the typical week-end projecy you can buy on Tindie.

Moreover, have a look at nvidia's 5000 cards: they enabled FP4 and claimed a 115% AI performance improvement. But they just ... monetized on quantization that we used to tune in for helping our cheap gpus. The 100% improvement is just ... halving the information quantum in order to double the performance per cycle, given the same amount of transistors of the previous gen; the other 15% is because the 5000 cards have 5% more cores and better ram. In other words: nvidia pwned a software improvement that used to be un users' hands. Not the govs (with their baroque laws), nor the industry (with their marketing) is really helping to democratize the AI...

Your best bet currently are Apple's (because they share ram and vram), starting at 6000+ coins, rockchip's clusters (about 1500 coins to have one) or ... ebay & pray (that US anti-trust authority cracks nvidia) ...

@mfp20 commented on GitHub (Jan 31, 2025): > Please enlighten me on the intricacies and difficulties of making such a card There's no el cheapo solution. I keep looking around, like many others; but as of today, AI isn't for all (yet). You can pay tokens to the big players, or experiment with open models using commodity hardware thanks to ollama and alike. That's it. I didn't look at Coral's datasheet but I doubt it has the memory controller to connect big amounts of ram, and not of the right kind. Moreover, producing PCBs for ram chips/slots is ... expensive, as it needs to be perfectly tuned for all those signals packed in small room; the tracks' routing must be perfect. The card you are suggesting wouldn't be the typical week-end projecy you can buy on Tindie. Moreover, have a look at nvidia's 5000 cards: they enabled FP4 and claimed a 115% AI performance improvement. But they just ... monetized on quantization that we used to tune in for helping our cheap gpus. The 100% improvement is just ... halving the information quantum in order to double the performance per cycle, given the same amount of transistors of the previous gen; the other 15% is because the 5000 cards have 5% more cores and better ram. In other words: nvidia pwned a software improvement that used to be un users' hands. Not the govs (with their baroque laws), nor the industry (with their marketing) is really helping to democratize the AI... Your best bet currently are Apple's (because they share ram and vram), starting at 6000+ coins, rockchip's clusters (about 1500 coins to have one) or ... ebay & pray (that US anti-trust authority cracks nvidia) ...

GiteaMirror commented

2026-04-12 10:09:48 -05:00

@antarix1 commented on GitHub (Jan 31, 2025):

Excellent insight. Thanks for taking the time and replying in detail.

There's no el cheapo solution. I keep looking around, like many others; but as of today, AI isn't for all (yet). You can pay tokens to the big players, or experiment with open models using commodity hardware thanks to ollama and alike. That's it.

El cheapo is only a secondary objective. Primary objective is to run CUDA dependent models on non-nVidia hardware but still get some benefit from Tensor cores or run Tensor-lite models. AMD is trying to run CUDA code using translation and is effectively useless.
Aaaaaaand, I never trust a monopoly, be it in code, product or scientific thinking. It becomes a matter of WHEN not IF (don't be evil).

I didn't look at Coral's datasheet but I doubt it has the memory controller to connect big amounts of ram, and not of the right kind. Moreover, producing PCBs for ram chips/slots is ... expensive, as it needs to be perfectly tuned for all those signals packed in small room; the tracks' routing must be perfect. The card you are suggesting wouldn't be the typical week-end projecy you can buy on Tindie.

This is indeed true. I have heard from experts how difficult it is to design signal paths for high speed low latency memory. So the price of the final product would be at least similar to a mid-tier used GPU, would be my guess. Then again the focus is to develop an open source design for PCB that enthusiasts can manufacture themselves using off the shelf components.

Moreover, have a look at nvidia's 5000 cards: they enabled FP4 and claimed a 115% AI performance improvement. But they just ... monetized on quantization that we used to tune in for helping our cheap gpus. The 100% improvement is just ... halving the information quantum in order to double the performance per cycle, given the same amount of transistors of the previous gen; the other 15% is because the 5000 cards have 5% more cores and better ram. In other words: nvidia pwned a software improvement that used to be un users' hands. Not the govs (with their baroque laws), nor the industry (with their marketing) is really helping to democratize the AI...

Thank you for bringing this up. They dare to do this because they stand without competition or even a remote alternative. Besides, I have always treated marketing fluff of percentages as gimmick. As soon as they begin talking in % points, I stop listening. My concern is for when they lock down these cards so that users can no longer optimize and tinker on their own terms.

Your best bet currently are Apple's (because they share ram and vram), starting at 6000+ coins, rockchip's clusters (about 1500 coins to have one) or ... ebay & pray (that US anti-trust authority cracks nvidia) ...

No Apple thank you. Please refer to Louis Rossmann Not even a used McBk.
eBay is okay, but not reliable.

Please have a look at this, it seems people are already working on it. Also this looks promising as a take-off point for the base design.

Thanks again for your deep thought and consideration. I'd invite others to offer valuable insights into this conversation.

@antarix1 commented on GitHub (Jan 31, 2025): Excellent insight. Thanks for taking the time and replying in detail. > There's no el cheapo solution. I keep looking around, like many others; but as of today, AI isn't for all (yet). You can pay tokens to the big players, or experiment with open models using commodity hardware thanks to ollama and alike. That's it. El cheapo is only a secondary objective. Primary objective is to run CUDA dependent models on non-nVidia hardware but still get some benefit from Tensor cores or run Tensor-lite models. AMD is trying to run CUDA code using translation and is effectively useless. Aaaaaaand, I never trust a monopoly, be it in code, product or scientific thinking. It becomes a matter of WHEN not IF (don't be evil). > I didn't look at Coral's datasheet but I doubt it has the memory controller to connect big amounts of ram, and not of the right kind. Moreover, producing PCBs for ram chips/slots is ... expensive, as it needs to be perfectly tuned for all those signals packed in small room; the tracks' routing must be perfect. The card you are suggesting wouldn't be the typical week-end projecy you can buy on Tindie. This is indeed true. I have heard from experts how difficult it is to design signal paths for high speed low latency memory. So the price of the final product would be at least similar to a mid-tier used GPU, would be my guess. Then again the focus is to develop an open source design for PCB that enthusiasts can manufacture themselves using off the shelf components. > Moreover, have a look at nvidia's 5000 cards: they enabled FP4 and claimed a 115% AI performance improvement. But they just ... monetized on quantization that we used to tune in for helping our cheap gpus. The 100% improvement is just ... halving the information quantum in order to double the performance per cycle, given the same amount of transistors of the previous gen; the other 15% is because the 5000 cards have 5% more cores and better ram. In other words: nvidia pwned a software improvement that used to be un users' hands. Not the govs (with their baroque laws), nor the industry (with their marketing) is really helping to democratize the AI... Thank you for bringing this up. They dare to do this because they stand without competition or even a remote alternative. Besides, I have always treated marketing fluff of percentages as gimmick. As soon as they begin talking in % points, I stop listening. My concern is for when they lock down these cards so that users can no longer optimize and tinker on their own terms. > Your best bet currently are Apple's (because they share ram and vram), starting at 6000+ coins, rockchip's clusters (about 1500 coins to have one) or ... ebay & pray (that US anti-trust authority cracks nvidia) ... No Apple thank you. Please refer to [Louis Rossmann](https://rossmanngroup.com/) Not even a used McBk. eBay is okay, but not reliable. Please have a look at [this](https://www.makerfabs.com/dual-edge-tpu-adapter-m2-2280-b-m-key.html), it seems people are already working on it. Also [this](https://www.makerfabs.com/dual-edge-tpu-adapter.html) looks promising as a take-off point for the base design. Thanks again for your deep thought and consideration. I'd invite others to offer valuable insights into this conversation.

GiteaMirror commented

2026-04-12 10:09:49 -05:00

@antarix1 commented on GitHub (Jan 31, 2025):

please check out https://github.com/magic-blue-smoke/Dual-Edge-TPU-Adapter

@antarix1 commented on GitHub (Jan 31, 2025): please check out https://github.com/magic-blue-smoke/Dual-Edge-TPU-Adapter

GiteaMirror commented

2026-04-12 10:09:49 -05:00

@mfp20 commented on GitHub (Feb 2, 2025):

Excellent insight. Thanks for taking the time and replying in detail.

There's no el cheapo solution. I keep looking around, like many others; but as of today, AI isn't for all (yet). You can pay tokens to the big players, or experiment with open models using commodity hardware thanks to ollama and alike. That's it.

El cheapo is only a secondary objective. Primary objective is to run CUDA dependent models on non-nVidia hardware but still get some benefit from Tensor cores or run Tensor-lite models.

Dude, you didn't read the previous contributes in this thread so you are missing a point: the neural units and the ram MUST be tightly coupled (ie: there must be HIGH memory bandwidth, because the neurons need to iterate multiple times over the whole model sitting in ram; the higher the bandwith the better; the universe above the sky is the only limit). That said, if you compare the npu-memory bandwidth on a nvidia card with the bandwidth of the pcie5 bus, you'll see the huge difference in bandwidth. In other words, there are no busses readily available on our computers, that can match the bandwidth available on the gpu card alone. Modern gpu cards are autonomous systems that communicate over the pcie bus from time to time in order to access to the low speed components (disk, network, user I/O) they need to deliver the job...

If you place the npu on a 16x pcie5 slot, you introduce a bottleneck between the npu and the ram. It doesn't matter how many Corals you pack on a single pcie slot... the more you pack... the more the bottleneck will impact the npu performances. I pointed you to Apple's because there are softwares able to exploit the thunderbolt/usb4 connections in oder to focus multiple macbooks (note: each having up to 96GB of vram) on the same AI job (ex: powered by a 200GB model) but again ... 40gbps isn't 1800gbps available to the nvidia gpu... so the end result will be WAY slower than a single Blackwell based system. There are already softwares to work adound these issues but the end result CAN'T match the performance of a proper hardware solution.
In software you can buffer, exploit some common hw components (ex: MMU units, DMA units) and so on, in order to parallelize the work over multiple cheap gpus each having some tens of gpbs bandhwidth available on pcie or thunderbolt busses but looks like the AI isn't parallelizable much, so you can't have much success. In hardware instead, there are other limits: you can't make a 4D object in our three-dimensional space; so you can't produce a tesseract (ie: a geometry having equal distance between all the computation cores and all the memory units); have a look at the NUMA architectures available on the market (ex: Intel Xeon and AMD Threadrippers). And that's the reason why those NVIDIA racks are bloody expensive: they are a full pack of work arounds in order to have computation cores and memory at some-sort-of-equal distances.
Even if you manage to have some $$$ (millions) to buy one of those nvidia racks, then you need the money to pay the electric bill, and the data to train the models. In other words: unless you are Mark Zuckerberg or who ever else (Microsoft, Google, Oracle, some govs) have both the money and the mankind's data, you can't fully take advantage of AI tech.
There are exotic solutions also: quantum computing, biological computing, and so on. But ... well ... do they work? Do they exist? How much do they cost?

You can use Coral's (and other cheap AI solutions; there are better units around) for AI-based pattern matching jobs (ex: computer vision). What you cannot do is to run those huge generative models we currently run with ollama. That's why ollama doesn't support TPUs. I might have been blunt but ... that's not me... it's just the sad part of the AI story.

What we can realistically expect from the ollama project is that they introduce some form of clustering capabilities already seen in similar softwares. That's all these developers can do for us, if they are willing to.

@mfp20 commented on GitHub (Feb 2, 2025): > Excellent insight. Thanks for taking the time and replying in detail. > > > There's no el cheapo solution. I keep looking around, like many others; but as of today, AI isn't for all (yet). You can pay tokens to the big players, or experiment with open models using commodity hardware thanks to ollama and alike. That's it. > > El cheapo is only a secondary objective. Primary objective is to run CUDA dependent models on non-nVidia hardware but still get some benefit from Tensor cores or run Tensor-lite models. Dude, you didn't read the previous contributes in this thread so you are missing a point: the neural units and the ram MUST be tightly coupled (ie: there must be HIGH memory bandwidth, because the neurons need to iterate multiple times over the whole model sitting in ram; the higher the bandwith the better; the universe above the sky is the only limit). That said, if you compare the npu-memory bandwidth on a nvidia card with the bandwidth of the pcie5 bus, you'll see the huge difference in bandwidth. In other words, there are no busses readily available on our computers, that can match the bandwidth available on the gpu card alone. Modern gpu cards are autonomous systems that communicate over the pcie bus from time to time in order to access to the low speed components (disk, network, user I/O) they need to deliver the job... If you place the npu on a 16x pcie5 slot, you introduce a bottleneck between the npu and the ram. It doesn't matter how many Corals you pack on a single pcie slot... the more you pack... the more the bottleneck will impact the npu performances. I pointed you to Apple's because there are softwares able to exploit the thunderbolt/usb4 connections in oder to focus multiple macbooks (note: each having up to 96GB of vram) on the same AI job (ex: powered by a 200GB model) but again ... 40gbps isn't 1800gbps available to the nvidia gpu... so the end result will be WAY slower than a single Blackwell based system. There are already softwares to work adound these issues but the end result CAN'T match the performance of a proper hardware solution. In software you can buffer, exploit some common hw components (ex: MMU units, DMA units) and so on, in order to parallelize the work over multiple cheap gpus each having some tens of gpbs bandhwidth available on pcie or thunderbolt busses but looks like the AI isn't parallelizable much, so you can't have much success. In hardware instead, there are other limits: you can't make a 4D object in our three-dimensional space; so you can't produce a tesseract (ie: a geometry having equal distance between all the computation cores and all the memory units); have a look at the NUMA architectures available on the market (ex: Intel Xeon and AMD Threadrippers). And that's the reason why those NVIDIA racks are bloody expensive: they are a full pack of work arounds in order to have computation cores and memory at some-sort-of-equal distances. Even if you manage to have some $$$ (millions) to buy one of those nvidia racks, then you need the money to pay the electric bill, and the data to train the models. In other words: unless you are Mark Zuckerberg or who ever else (Microsoft, Google, Oracle, some govs) have both the money and the mankind's data, you can't fully take advantage of AI tech. There are exotic solutions also: quantum computing, biological computing, and so on. But ... well ... do they work? Do they exist? How much do they cost? You can use Coral's (and other cheap AI solutions; there are better units around) for AI-based pattern matching jobs (ex: computer vision). What you cannot do is to run those huge generative models we currently run with ollama. That's why ollama doesn't support TPUs. I might have been blunt but ... that's not me... it's just the sad part of the AI story. What we can realistically expect from the ollama project is that they introduce some form of clustering capabilities already seen in similar softwares. That's all these developers can do for us, if they are willing to.

GiteaMirror commented

2026-04-12 10:09:49 -05:00

@antarix1 commented on GitHub (Feb 4, 2025):

Excellent insight. Thanks for taking the time and replying in detail.

There's no el cheapo solution. I keep looking around, like many others; but as of today, AI isn't for all (yet). You can pay tokens to the big players, or experiment with open models using commodity hardware thanks to ollama and alike. That's it.

El cheapo is only a secondary objective. Primary objective is to run CUDA dependent models on non-nVidia hardware but still get some benefit from Tensor cores or run Tensor-lite models.

Dude, you didn't read the previous contributes in this thread so you are missing a point: the neural units and the ram MUST be tightly coupled (ie: there must be HIGH memory bandwidth, because the neurons need to iterate multiple times over the whole model sitting in ram; the higher the bandwith the better; the universe above the sky is the only limit). That said, if you compare the npu-memory bandwidth on a nvidia card with the bandwidth of the pcie5 bus, you'll see the huge difference in bandwidth. In other words, there are no busses readily available on our computers, that can match the bandwidth available on the gpu card alone. Modern gpu cards are autonomous systems that communicate over the pcie bus from time to time in order to access to the low speed components (disk, network, user I/O) they need to deliver the job...

If you place the npu on a 16x pcie5 slot, you introduce a bottleneck between the npu and the ram. It doesn't matter how many Corals you pack on a single pcie slot... the more you pack... the more the bottleneck will impact the npu performances. I pointed you to Apple's because there are softwares able to exploit the thunderbolt/usb4 connections in oder to focus multiple macbooks (note: each having up to 96GB of vram) on the same AI job (ex: powered by a 200GB model) but again ... 40gbps isn't 1800gbps available to the nvidia gpu... so the end result will be WAY slower than a single Blackwell based system. There are already softwares to work adound these issues but the end result CAN'T match the performance of a proper hardware solution. In software you can buffer, exploit some common hw components (ex: MMU units, DMA units) and so on, in order to parallelize the work over multiple cheap gpus each having some tens of gpbs bandhwidth available on pcie or thunderbolt busses but looks like the AI isn't parallelizable much, so you can't have much success. In hardware instead, there are other limits: you can't make a 4D object in our three-dimensional space; so you can't produce a tesseract (ie: a geometry having equal distance between all the computation cores and all the memory units); have a look at the NUMA architectures available on the market (ex: Intel Xeon and AMD Threadrippers). And that's the reason why those NVIDIA racks are bloody expensive: they are a full pack of work arounds in order to have computation cores and memory at some-sort-of-equal distances. Even if you manage to have some $$$ (millions) to buy one of those nvidia racks, then you need the money to pay the electric bill, and the data to train the models. In other words: unless you are Mark Zuckerberg or who ever else (Microsoft, Google, Oracle, some govs) have both the money and the mankind's data, you can't fully take advantage of AI tech. There are exotic solutions also: quantum computing, biological computing, and so on. But ... well ... do they work? Do they exist? How much do they cost?

You can use Coral's (and other cheap AI solutions; there are better units around) for AI-based pattern matching jobs (ex: computer vision). What you cannot do is to run those huge generative models we currently run with ollama. That's why ollama doesn't support TPUs. I might have been blunt but ... that's not me... it's just the sad part of the AI story.

What we can realistically expect from the ollama project is that they introduce some form of clustering capabilities already seen in similar softwares. That's all these developers can do for us, if they are willing to.

point taken. thanks again for the detailed reply.

@antarix1 commented on GitHub (Feb 4, 2025): > > Excellent insight. Thanks for taking the time and replying in detail. > > > There's no el cheapo solution. I keep looking around, like many others; but as of today, AI isn't for all (yet). You can pay tokens to the big players, or experiment with open models using commodity hardware thanks to ollama and alike. That's it. > > > > > > El cheapo is only a secondary objective. Primary objective is to run CUDA dependent models on non-nVidia hardware but still get some benefit from Tensor cores or run Tensor-lite models. > > Dude, you didn't read the previous contributes in this thread so you are missing a point: the neural units and the ram MUST be tightly coupled (ie: there must be HIGH memory bandwidth, because the neurons need to iterate multiple times over the whole model sitting in ram; the higher the bandwith the better; the universe above the sky is the only limit). That said, if you compare the npu-memory bandwidth on a nvidia card with the bandwidth of the pcie5 bus, you'll see the huge difference in bandwidth. In other words, there are no busses readily available on our computers, that can match the bandwidth available on the gpu card alone. Modern gpu cards are autonomous systems that communicate over the pcie bus from time to time in order to access to the low speed components (disk, network, user I/O) they need to deliver the job... > > If you place the npu on a 16x pcie5 slot, you introduce a bottleneck between the npu and the ram. It doesn't matter how many Corals you pack on a single pcie slot... the more you pack... the more the bottleneck will impact the npu performances. I pointed you to Apple's because there are softwares able to exploit the thunderbolt/usb4 connections in oder to focus multiple macbooks (note: each having up to 96GB of vram) on the same AI job (ex: powered by a 200GB model) but again ... 40gbps isn't 1800gbps available to the nvidia gpu... so the end result will be WAY slower than a single Blackwell based system. There are already softwares to work adound these issues but the end result CAN'T match the performance of a proper hardware solution. In software you can buffer, exploit some common hw components (ex: MMU units, DMA units) and so on, in order to parallelize the work over multiple cheap gpus each having some tens of gpbs bandhwidth available on pcie or thunderbolt busses but looks like the AI isn't parallelizable much, so you can't have much success. In hardware instead, there are other limits: you can't make a 4D object in our three-dimensional space; so you can't produce a tesseract (ie: a geometry having equal distance between all the computation cores and all the memory units); have a look at the NUMA architectures available on the market (ex: Intel Xeon and AMD Threadrippers). And that's the reason why those NVIDIA racks are bloody expensive: they are a full pack of work arounds in order to have computation cores and memory at some-sort-of-equal distances. Even if you manage to have some $$$ (millions) to buy one of those nvidia racks, then you need the money to pay the electric bill, and the data to train the models. In other words: unless you are Mark Zuckerberg or who ever else (Microsoft, Google, Oracle, some govs) have both the money and the mankind's data, you can't fully take advantage of AI tech. There are exotic solutions also: quantum computing, biological computing, and so on. But ... well ... do they work? Do they exist? How much do they cost? > > You can use Coral's (and other cheap AI solutions; there are better units around) for AI-based pattern matching jobs (ex: computer vision). What you cannot do is to run those huge generative models we currently run with ollama. That's why ollama doesn't support TPUs. I might have been blunt but ... that's not me... it's just the sad part of the AI story. > > What we can realistically expect from the ollama project is that they introduce some form of clustering capabilities already seen in similar softwares. That's all these developers can do for us, if they are willing to. point taken. thanks again for the detailed reply.

GiteaMirror referenced this issue

2026-04-12 22:53:33 -05:00

[PR #482] [MERGED] [docs] Improve build instructions #10177

GiteaMirror referenced this issue

2026-04-16 04:59:42 -05:00

[PR #482] [MERGED] [docs] Improve build instructions #15448

GiteaMirror referenced this issue

2026-04-19 15:12:56 -05:00

[PR #482] [MERGED] [docs] Improve build instructions #20717

GiteaMirror referenced this issue

2026-04-22 20:45:59 -05:00

[PR #482] [MERGED] [docs] Improve build instructions #36050

GiteaMirror referenced this issue

2026-04-24 21:18:31 -05:00

[PR #482] [MERGED] [docs] Improve build instructions #41425

GiteaMirror referenced this issue

2026-04-29 11:28:22 -05:00

[PR #482] [MERGED] [docs] Improve build instructions #56874

GiteaMirror referenced this issue

2026-05-05 03:59:36 -05:00

[PR #482] [MERGED] [docs] Improve build instructions #72471

Sign in to join this conversation.

Branches Tags

main

hoyyeva/opencode-image-modality

hoyyeva/anthropic-renderer-local-image-path

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-launch-codex-app

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#482