[GH-ISSUE #990] TPU backend support #482

Open
opened 2026-04-12 10:09:39 -05:00 by GiteaMirror · 25 comments
Owner

Originally created by @coolrazor007 on GitHub (Nov 3, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/990

Originally assigned to: @dhiltgen on GitHub.

Would love to see Ollama run on a TPU not just GPU. Has this been done by anyone already?

Originally created by @coolrazor007 on GitHub (Nov 3, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/990 Originally assigned to: @dhiltgen on GitHub. Would love to see Ollama run on a TPU not just GPU. Has this been done by anyone already?
GiteaMirror added the feature request label 2026-04-12 10:09:39 -05:00
Author
Owner

@krenax commented on GitHub (Nov 6, 2023):

This may be interesting. Is Ollama officially supported on the TPU?

<!-- gh-comment-id:1795578551 --> @krenax commented on GitHub (Nov 6, 2023): This may be interesting. Is Ollama officially supported on the TPU?
Author
Owner

@coolrazor007 commented on GitHub (Nov 6, 2023):

I did some digging and realized that Ollama is based on llama.cpp which does not support TPUs currently.

<!-- gh-comment-id:1795917359 --> @coolrazor007 commented on GitHub (Nov 6, 2023): I did some digging and realized that Ollama is based on llama.cpp which does not support TPUs currently.
Author
Owner

@boredcoder411 commented on GitHub (Jan 8, 2024):

Any updates on this? I really want to use my edge tpu and raspberry pi with this project

<!-- gh-comment-id:1881655621 --> @boredcoder411 commented on GitHub (Jan 8, 2024): Any updates on this? I really want to use my edge tpu and raspberry pi with this project
Author
Owner

@easp commented on GitHub (Jan 9, 2024):

@boredcoder411 Edge TPU is not suited for LLM. They only have, what, 2GB of RAM and slow flash memory.

<!-- gh-comment-id:1883469484 --> @easp commented on GitHub (Jan 9, 2024): @boredcoder411 Edge TPU is not suited for LLM. They only have, what, 2GB of RAM and slow flash memory.
Author
Owner

@boredcoder411 commented on GitHub (Jan 9, 2024):

So what CAN I run on them?

<!-- gh-comment-id:1883470733 --> @boredcoder411 commented on GitHub (Jan 9, 2024): So what CAN I run on them?
Author
Owner

@helium729 commented on GitHub (Apr 28, 2024):

maybe our team would like to help with this feature, it's just we don't know where to get started yet.

<!-- gh-comment-id:2081500918 --> @helium729 commented on GitHub (Apr 28, 2024): maybe our team would like to help with this feature, it's just we don't know where to get started yet.
Author
Owner

@GameTec-live commented on GitHub (May 15, 2024):

I'd love support for the pcie coral tpus... You should be able to swap in and out of memory over the pcie bus fast enough (at least thats what ive read somewhere) and with the pi 5 now having nvme support, id love to be able to just build a tiny little llm server...
https://coral.ai/products/m2-accelerator-dual-edgetpu/

<!-- gh-comment-id:2111602679 --> @GameTec-live commented on GitHub (May 15, 2024): I'd love support for the pcie coral tpus... You should be able to swap in and out of memory over the pcie bus fast enough (at least thats what ive read somewhere) and with the pi 5 now having nvme support, id love to be able to just build a tiny little llm server... https://coral.ai/products/m2-accelerator-dual-edgetpu/
Author
Owner

@boredcoder411 commented on GitHub (May 15, 2024):

Fairly sure jax/flax supports the coral tpus, also the USB accelerator. Also sure it has enough ram to hold at least tinyllama.

<!-- gh-comment-id:2111806005 --> @boredcoder411 commented on GitHub (May 15, 2024): Fairly sure jax/flax supports the coral tpus, also the USB accelerator. Also sure it has enough ram to hold at least tinyllama.
Author
Owner

@quadcom commented on GitHub (May 24, 2024):

Why couldn't a RAMDISK be created to hold model files in the case of a TPU? I have an Unraid server with 96GB of RAM. Reserving 12-24GB for a RAMDISK wouldn't be a huge hit to its performance.

<!-- gh-comment-id:2130481377 --> @quadcom commented on GitHub (May 24, 2024): Why couldn't a RAMDISK be created to hold model files in the case of a TPU? I have an Unraid server with 96GB of RAM. Reserving 12-24GB for a RAMDISK wouldn't be a huge hit to its performance.
Author
Owner

@easp commented on GitHub (May 24, 2024):

A RAM disk, besides being a obsolete throwback, isn't going to help with the fact that Coral TPUs don't have enough RAM and only have a slow USB connection to its host system.

In addition, they aren't all that fast. They aren't supported by Ollama & they aren't likely to be because any one capable of doing the work likely has better things to do and even if they did the work, it's unlikely that the Ollama maintainers would merge it because it would add complexity for very little benefit.

<!-- gh-comment-id:2130497240 --> @easp commented on GitHub (May 24, 2024): A RAM disk, besides being a obsolete throwback, isn't going to help with the fact that Coral TPUs don't have enough RAM and only have a slow USB connection to its host system. In addition, they aren't all that fast. They aren't supported by Ollama & they aren't likely to be because any one capable of doing the work likely has better things to do and even if they did the work, it's unlikely that the Ollama maintainers would merge it because it would add complexity for very little benefit.
Author
Owner

@spc789 commented on GitHub (Jun 10, 2024):

I agree with the ramdisk feature, with an nvme/pcie tpu like corel pcie tpus (NOT the usb version) or the hailo tpus they are tied to the pcie bus.

Making a ramdisk, which is a way to forcible keep things in memory can be an option to speed up things.

Personally I do the ramdisk strategy with large panda dataframes and a /dev/shm use on linux if I need interprocess communication done on such things.

Ram is directly tied to the memory bus, so using this strategy could have a huge benefit with TPUs which rely on memory streaming of the data (those who dont have much ram onboard)

<!-- gh-comment-id:2159112369 --> @spc789 commented on GitHub (Jun 10, 2024): I agree with the ramdisk feature, with an nvme/pcie tpu like corel pcie tpus (NOT the usb version) or the hailo tpus they are tied to the pcie bus. Making a ramdisk, which is a way to forcible keep things in memory can be an option to speed up things. Personally I do the ramdisk strategy with large panda dataframes and a /dev/shm use on linux if I need interprocess communication done on such things. Ram is directly tied to the memory bus, so using this strategy could have a huge benefit with TPUs which rely on memory streaming of the data (those who dont have much ram onboard)
Author
Owner

@boredcoder411 commented on GitHub (Aug 11, 2024):

Ramdisks sound like a great idea, but how would this work in go?

<!-- gh-comment-id:2282790866 --> @boredcoder411 commented on GitHub (Aug 11, 2024): Ramdisks sound like a great idea, but how would this work in go?
Author
Owner

@jasonsmithio commented on GitHub (Aug 27, 2024):

I am happy to help if anyone is tackling this already!

<!-- gh-comment-id:2311391620 --> @jasonsmithio commented on GitHub (Aug 27, 2024): I am happy to help if anyone is tackling this already!
Author
Owner

@mfp20 commented on GitHub (Nov 15, 2024):

Does it work? I'm evaluating the idea to buy one of those hailo m.2 cards as I don't need more gpus...

<!-- gh-comment-id:2478573721 --> @mfp20 commented on GitHub (Nov 15, 2024): Does it work? I'm evaluating the idea to buy one of those hailo m.2 cards as I don't need more gpus...
Author
Owner

@boredcoder411 commented on GitHub (Nov 15, 2024):

Prob not cause google doesn't like consumers and has closed issues on the Jax repo requesting tpu support

<!-- gh-comment-id:2478720467 --> @boredcoder411 commented on GitHub (Nov 15, 2024): Prob not cause google doesn't like consumers and has closed issues on the Jax repo requesting tpu support
Author
Owner

@easp commented on GitHub (Nov 15, 2024):

@mfp20 That accelerator isn't designed for LLM. Look at the models they use in their benchmarks. EfficientNetV2-M is a 54M parameter model. That's 50x smaller than even a small LLM. They don't have the onboard memory and shifting the weights over PCIe 2x for each token isn't any more realistic than it is for a GPU.

Overloaded Jetta

<!-- gh-comment-id:2479674988 --> @easp commented on GitHub (Nov 15, 2024): @mfp20 That accelerator isn't designed for LLM. Look at the models they use in their benchmarks. EfficientNetV2-M is a 54M parameter model. That's 50x smaller than even a small LLM. They don't have the onboard memory and shifting the weights over PCIe 2x for each token isn't any more realistic than it is for a GPU. ![Overloaded Jetta](https://github.com/user-attachments/assets/49849c1a-b502-4cf7-b0de-b7d71bca7b02)
Author
Owner

@mfp20 commented on GitHub (Nov 16, 2024):

Prob not cause google doesn't like consumers and has closed issues on the Jax repo requesting tpu support

Any link about this? If the pcie latencies (to access the ram) are too high, there's no point to keep the issue open... as it is not possible to use the TPUs. Other accelerators failed in the past (ex: crypto accelerators on pci slots).

@mfp20 That accelerator isn't designed for LLM. Look at the models they use in their benchmarks. EfficientNetV2-M is a 54M parameter model. That's 50x smaller than even a small LLM. They don't have the onboard memory and shifting the weights over PCIe 2x for each token isn't any more realistic than it is for a GPU.

I've never looked into the computational details; I suppose you are right. But pcie5 latencies might be low enough to grant an effective accel. Maybe not much, but even a 1.5x would enable some use cases.

TPUs a part, I wonder if some sort of ram caching mechanism might do the trick. Something like using the ram to load the whole model and then a background process shifting parts of it into the vram before the gpu processing needing that part. I've no idea of the level of parallelism required by those algorithms, and what are the chances to predict what chunks of the model will be needed, but ... any parallelism can be reduced in multiple steps. It's slower but it would enable decent timings on ram-rich systems and consumer gpus with little vram. Probably at the end of the day we might experience something similar to the performance seen with https://github.com/b4rtaz/distributed-llama .

<!-- gh-comment-id:2480427807 --> @mfp20 commented on GitHub (Nov 16, 2024): > Prob not cause google doesn't like consumers and has closed issues on the Jax repo requesting tpu support Any link about this? If the pcie latencies (to access the ram) are too high, there's no point to keep the issue open... as it is not possible to use the TPUs. Other accelerators failed in the past (ex: crypto accelerators on pci slots). > @mfp20 That accelerator isn't designed for LLM. Look at the models they use in their benchmarks. EfficientNetV2-M is a 54M parameter model. That's 50x smaller than even a small LLM. They don't have the onboard memory and shifting the weights over PCIe 2x for each token isn't any more realistic than it is for a GPU. I've never looked into the computational details; I suppose you are right. But pcie5 latencies might be low enough to grant an effective accel. Maybe not much, but even a 1.5x would enable some use cases. TPUs a part, I wonder if some sort of ram caching mechanism might do the trick. Something like using the ram to load the whole model and then a background process shifting parts of it into the vram before the gpu processing needing that part. I've no idea of the level of parallelism required by those algorithms, and what are the chances to predict what chunks of the model will be needed, but ... any parallelism can be reduced in multiple steps. It's slower but it would enable decent timings on ram-rich systems and consumer gpus with little vram. Probably at the end of the day we might experience something similar to the performance seen with https://github.com/b4rtaz/distributed-llama .
Author
Owner

@easp commented on GitHub (Nov 16, 2024):

@mfp20 For monolithic models (ie not mixture-of-experts) it's very easy to predict what chunks of the model will be needed because the entire model is read sequentially from start to finish for each token.

For token generation compute isn't really an issue, PCIe latency isn't the issue, bandwidth is. For these reason, if a portion of the model is in RAM it's generally faster to compute on the CPU than it is to ship the data over PCIe in order to use it on a GPU or TPU.

<!-- gh-comment-id:2480819782 --> @easp commented on GitHub (Nov 16, 2024): @mfp20 For monolithic models (ie not mixture-of-experts) it's very easy to predict what chunks of the model will be needed because the entire model is read sequentially from start to finish for each token. For token generation compute isn't really an issue, PCIe latency isn't the issue, bandwidth is. For these reason, if a portion of the model is in RAM it's generally faster to compute on the CPU than it is to ship the data over PCIe in order to use it on a GPU or TPU.
Author
Owner

@mfp20 commented on GitHub (Nov 17, 2024):

@mfp20 For monolithic models (ie not mixture-of-experts) it's very easy to predict what chunks of the model will be needed because the entire model is read sequentially from start to finish for each token.

Ok, this can be adjusted in software, in a similar way it is adjusted today when running some layers on CPUs and some on GPUs.
I imagine a couple of DMA pumps moving a few model chunks form ram to vram, sync'ed by the main evaluation process. If a model is 56gb and the vram is 8gb, you get 12 chunks only to move for each token. The end result would be an execution slower than gpu-vram (because of insufficient memory bandwidth), and faster execution than cpu-only, but ... hey ... better than no execution at all (because of the lack of enough vram), or totally unusable slow execution of a bare cpu and the system completely clogged (because the general purpose cores are executing the model). It is an enabling solution for consumer grade systems.

For token generation compute isn't really an issue, PCIe latency isn't the issue, bandwidth is. For these reason, if a portion of the model is in RAM it's generally faster to compute on the CPU than it is to ship the data over PCIe in order to use it on a GPU or TPU.

If bw is the limit, then a pcie5 TPU can accelerate: both ddr5 and pcie5 have 64 GB/s bw. An order of magnitude less than consumers gpu's vram (200-800 GB/s); but increases in cpu-based performance are registered when llama.cpp workers can rely on complex instructions (ex: AVX, AVX2, AVX512). Having an ASIC chip like the ones in TPUs, instead of somewhat general purpose instructions, might further increase the acceleration. And offload the main cpu.

Problem is: none of those TPU cards are pcie5 x16. And an x16 card might end up being as expensive as a second-hand gpu on ebay...

In any case improving heterogeneous computing by implementing the ram-vram buffering described above might be useful. Probably not much for the single-prompt use case, but for parallel operations. I didn't look at current code (in llama.cpp, ollama, lm studio, and so on) but looks like they are struggling to mix multiple silicons.

<!-- gh-comment-id:2480983348 --> @mfp20 commented on GitHub (Nov 17, 2024): > @mfp20 For monolithic models (ie not mixture-of-experts) it's very easy to predict what chunks of the model will be needed because the entire model is read sequentially from start to finish for each token. Ok, this can be adjusted in software, in a similar way it is adjusted today when running some layers on CPUs and some on GPUs. I imagine a couple of DMA pumps moving a few model chunks form ram to vram, sync'ed by the main evaluation process. If a model is 56gb and the vram is 8gb, you get 12 chunks only to move for each token. The end result would be an execution slower than gpu-vram (because of insufficient memory bandwidth), and faster execution than cpu-only, but ... hey ... better than no execution at all (because of the lack of enough vram), or totally unusable slow execution of a bare cpu and the system completely clogged (because the general purpose cores are executing the model). It is an enabling solution for consumer grade systems. > For token generation compute isn't really an issue, PCIe latency isn't the issue, bandwidth is. For these reason, if a portion of the model is in RAM it's generally faster to compute on the CPU than it is to ship the data over PCIe in order to use it on a GPU or TPU. If bw is the limit, then a pcie5 TPU can accelerate: both ddr5 and pcie5 have 64 GB/s bw. An order of magnitude less than consumers gpu's vram (200-800 GB/s); but increases in cpu-based performance are registered when llama.cpp workers can rely on complex instructions (ex: AVX, AVX2, AVX512). Having an ASIC chip like the ones in TPUs, instead of somewhat general purpose instructions, might further increase the acceleration. And offload the main cpu. **Problem is: none of those TPU cards are pcie5 x16.** And an x16 card might end up being as expensive as a second-hand gpu on ebay... In any case improving heterogeneous computing by implementing the ram-vram buffering described above might be useful. Probably not much for the single-prompt use case, but for parallel operations. I didn't look at current code (in llama.cpp, ollama, lm studio, and so on) but looks like they are struggling to mix multiple silicons.
Author
Owner

@antarix1 commented on GitHub (Jan 31, 2025):

I am a novice user. I have only used ollama on my Linux host and loved it, albeit on ancient hardware without a GPU.
The little that I understand from the above discussion is that although the tensor processors are useful for compute, the severely limited RAM and bus speeds handicap the usability for fast chat-bots.

I found out this item and wondered if it would be any good?

Also, may be foolish talk, but do you think it is possible to build a PCIe card with 4 of these chips and memory brackets for SODIMM-DDR4 RAM so that users may install as much RAM as they want? Please enlighten me on the intricacies and difficulties of making such a card

<!-- gh-comment-id:2626034227 --> @antarix1 commented on GitHub (Jan 31, 2025): I am a novice user. I have only used ollama on my Linux host and loved it, albeit on ancient hardware without a GPU. The little that I understand from the above discussion is that although the tensor processors are useful for compute, the severely limited RAM and bus speeds handicap the usability for fast chat-bots. I found out [this item](https://iot.asus.com/products/AI-accelerator/AI-Accelerator-PCIe-Card/) and wondered if it would be any good? Also, may be foolish talk, but do you think it is possible to build a PCIe card with 4 of [these chips](https://coral.ai/products/accelerator-module/) and memory brackets for SODIMM-DDR4 RAM so that users may install as much RAM as they want? Please enlighten me on the intricacies and difficulties of making such a card
Author
Owner

@mfp20 commented on GitHub (Jan 31, 2025):

Please enlighten me on the intricacies and difficulties of making such a card

There's no el cheapo solution. I keep looking around, like many others; but as of today, AI isn't for all (yet). You can pay tokens to the big players, or experiment with open models using commodity hardware thanks to ollama and alike. That's it.

I didn't look at Coral's datasheet but I doubt it has the memory controller to connect big amounts of ram, and not of the right kind. Moreover, producing PCBs for ram chips/slots is ... expensive, as it needs to be perfectly tuned for all those signals packed in small room; the tracks' routing must be perfect. The card you are suggesting wouldn't be the typical week-end projecy you can buy on Tindie.

Moreover, have a look at nvidia's 5000 cards: they enabled FP4 and claimed a 115% AI performance improvement. But they just ... monetized on quantization that we used to tune in for helping our cheap gpus. The 100% improvement is just ... halving the information quantum in order to double the performance per cycle, given the same amount of transistors of the previous gen; the other 15% is because the 5000 cards have 5% more cores and better ram. In other words: nvidia pwned a software improvement that used to be un users' hands. Not the govs (with their baroque laws), nor the industry (with their marketing) is really helping to democratize the AI...

Your best bet currently are Apple's (because they share ram and vram), starting at 6000+ coins, rockchip's clusters (about 1500 coins to have one) or ... ebay & pray (that US anti-trust authority cracks nvidia) ...

<!-- gh-comment-id:2626064436 --> @mfp20 commented on GitHub (Jan 31, 2025): > Please enlighten me on the intricacies and difficulties of making such a card There's no el cheapo solution. I keep looking around, like many others; but as of today, AI isn't for all (yet). You can pay tokens to the big players, or experiment with open models using commodity hardware thanks to ollama and alike. That's it. I didn't look at Coral's datasheet but I doubt it has the memory controller to connect big amounts of ram, and not of the right kind. Moreover, producing PCBs for ram chips/slots is ... expensive, as it needs to be perfectly tuned for all those signals packed in small room; the tracks' routing must be perfect. The card you are suggesting wouldn't be the typical week-end projecy you can buy on Tindie. Moreover, have a look at nvidia's 5000 cards: they enabled FP4 and claimed a 115% AI performance improvement. But they just ... monetized on quantization that we used to tune in for helping our cheap gpus. The 100% improvement is just ... halving the information quantum in order to double the performance per cycle, given the same amount of transistors of the previous gen; the other 15% is because the 5000 cards have 5% more cores and better ram. In other words: nvidia pwned a software improvement that used to be un users' hands. Not the govs (with their baroque laws), nor the industry (with their marketing) is really helping to democratize the AI... Your best bet currently are Apple's (because they share ram and vram), starting at 6000+ coins, rockchip's clusters (about 1500 coins to have one) or ... ebay & pray (that US anti-trust authority cracks nvidia) ...
Author
Owner

@antarix1 commented on GitHub (Jan 31, 2025):

Excellent insight. Thanks for taking the time and replying in detail.

There's no el cheapo solution. I keep looking around, like many others; but as of today, AI isn't for all (yet). You can pay tokens to the big players, or experiment with open models using commodity hardware thanks to ollama and alike. That's it.

El cheapo is only a secondary objective. Primary objective is to run CUDA dependent models on non-nVidia hardware but still get some benefit from Tensor cores or run Tensor-lite models. AMD is trying to run CUDA code using translation and is effectively useless.
Aaaaaaand, I never trust a monopoly, be it in code, product or scientific thinking. It becomes a matter of WHEN not IF (don't be evil).

I didn't look at Coral's datasheet but I doubt it has the memory controller to connect big amounts of ram, and not of the right kind. Moreover, producing PCBs for ram chips/slots is ... expensive, as it needs to be perfectly tuned for all those signals packed in small room; the tracks' routing must be perfect. The card you are suggesting wouldn't be the typical week-end projecy you can buy on Tindie.

This is indeed true. I have heard from experts how difficult it is to design signal paths for high speed low latency memory. So the price of the final product would be at least similar to a mid-tier used GPU, would be my guess. Then again the focus is to develop an open source design for PCB that enthusiasts can manufacture themselves using off the shelf components.

Moreover, have a look at nvidia's 5000 cards: they enabled FP4 and claimed a 115% AI performance improvement. But they just ... monetized on quantization that we used to tune in for helping our cheap gpus. The 100% improvement is just ... halving the information quantum in order to double the performance per cycle, given the same amount of transistors of the previous gen; the other 15% is because the 5000 cards have 5% more cores and better ram. In other words: nvidia pwned a software improvement that used to be un users' hands. Not the govs (with their baroque laws), nor the industry (with their marketing) is really helping to democratize the AI...

Thank you for bringing this up. They dare to do this because they stand without competition or even a remote alternative. Besides, I have always treated marketing fluff of percentages as gimmick. As soon as they begin talking in % points, I stop listening. My concern is for when they lock down these cards so that users can no longer optimize and tinker on their own terms.

Your best bet currently are Apple's (because they share ram and vram), starting at 6000+ coins, rockchip's clusters (about 1500 coins to have one) or ... ebay & pray (that US anti-trust authority cracks nvidia) ...

No Apple thank you. Please refer to Louis Rossmann Not even a used McBk.
eBay is okay, but not reliable.

Please have a look at this, it seems people are already working on it. Also this looks promising as a take-off point for the base design.

Thanks again for your deep thought and consideration. I'd invite others to offer valuable insights into this conversation.

<!-- gh-comment-id:2627607289 --> @antarix1 commented on GitHub (Jan 31, 2025): Excellent insight. Thanks for taking the time and replying in detail. > There's no el cheapo solution. I keep looking around, like many others; but as of today, AI isn't for all (yet). You can pay tokens to the big players, or experiment with open models using commodity hardware thanks to ollama and alike. That's it. El cheapo is only a secondary objective. Primary objective is to run CUDA dependent models on non-nVidia hardware but still get some benefit from Tensor cores or run Tensor-lite models. AMD is trying to run CUDA code using translation and is effectively useless. Aaaaaaand, I never trust a monopoly, be it in code, product or scientific thinking. It becomes a matter of WHEN not IF (don't be evil). > I didn't look at Coral's datasheet but I doubt it has the memory controller to connect big amounts of ram, and not of the right kind. Moreover, producing PCBs for ram chips/slots is ... expensive, as it needs to be perfectly tuned for all those signals packed in small room; the tracks' routing must be perfect. The card you are suggesting wouldn't be the typical week-end projecy you can buy on Tindie. This is indeed true. I have heard from experts how difficult it is to design signal paths for high speed low latency memory. So the price of the final product would be at least similar to a mid-tier used GPU, would be my guess. Then again the focus is to develop an open source design for PCB that enthusiasts can manufacture themselves using off the shelf components. > Moreover, have a look at nvidia's 5000 cards: they enabled FP4 and claimed a 115% AI performance improvement. But they just ... monetized on quantization that we used to tune in for helping our cheap gpus. The 100% improvement is just ... halving the information quantum in order to double the performance per cycle, given the same amount of transistors of the previous gen; the other 15% is because the 5000 cards have 5% more cores and better ram. In other words: nvidia pwned a software improvement that used to be un users' hands. Not the govs (with their baroque laws), nor the industry (with their marketing) is really helping to democratize the AI... Thank you for bringing this up. They dare to do this because they stand without competition or even a remote alternative. Besides, I have always treated marketing fluff of percentages as gimmick. As soon as they begin talking in % points, I stop listening. My concern is for when they lock down these cards so that users can no longer optimize and tinker on their own terms. > Your best bet currently are Apple's (because they share ram and vram), starting at 6000+ coins, rockchip's clusters (about 1500 coins to have one) or ... ebay & pray (that US anti-trust authority cracks nvidia) ... No Apple thank you. Please refer to [Louis Rossmann](https://rossmanngroup.com/) Not even a used McBk. eBay is okay, but not reliable. Please have a look at [this](https://www.makerfabs.com/dual-edge-tpu-adapter-m2-2280-b-m-key.html), it seems people are already working on it. Also [this](https://www.makerfabs.com/dual-edge-tpu-adapter.html) looks promising as a take-off point for the base design. Thanks again for your deep thought and consideration. I'd invite others to offer valuable insights into this conversation.
Author
Owner

@antarix1 commented on GitHub (Jan 31, 2025):

please check out https://github.com/magic-blue-smoke/Dual-Edge-TPU-Adapter

<!-- gh-comment-id:2627683802 --> @antarix1 commented on GitHub (Jan 31, 2025): please check out https://github.com/magic-blue-smoke/Dual-Edge-TPU-Adapter
Author
Owner

@mfp20 commented on GitHub (Feb 2, 2025):

Excellent insight. Thanks for taking the time and replying in detail.

There's no el cheapo solution. I keep looking around, like many others; but as of today, AI isn't for all (yet). You can pay tokens to the big players, or experiment with open models using commodity hardware thanks to ollama and alike. That's it.

El cheapo is only a secondary objective. Primary objective is to run CUDA dependent models on non-nVidia hardware but still get some benefit from Tensor cores or run Tensor-lite models.

Dude, you didn't read the previous contributes in this thread so you are missing a point: the neural units and the ram MUST be tightly coupled (ie: there must be HIGH memory bandwidth, because the neurons need to iterate multiple times over the whole model sitting in ram; the higher the bandwith the better; the universe above the sky is the only limit). That said, if you compare the npu-memory bandwidth on a nvidia card with the bandwidth of the pcie5 bus, you'll see the huge difference in bandwidth. In other words, there are no busses readily available on our computers, that can match the bandwidth available on the gpu card alone. Modern gpu cards are autonomous systems that communicate over the pcie bus from time to time in order to access to the low speed components (disk, network, user I/O) they need to deliver the job...

If you place the npu on a 16x pcie5 slot, you introduce a bottleneck between the npu and the ram. It doesn't matter how many Corals you pack on a single pcie slot... the more you pack... the more the bottleneck will impact the npu performances. I pointed you to Apple's because there are softwares able to exploit the thunderbolt/usb4 connections in oder to focus multiple macbooks (note: each having up to 96GB of vram) on the same AI job (ex: powered by a 200GB model) but again ... 40gbps isn't 1800gbps available to the nvidia gpu... so the end result will be WAY slower than a single Blackwell based system. There are already softwares to work adound these issues but the end result CAN'T match the performance of a proper hardware solution.
In software you can buffer, exploit some common hw components (ex: MMU units, DMA units) and so on, in order to parallelize the work over multiple cheap gpus each having some tens of gpbs bandhwidth available on pcie or thunderbolt busses but looks like the AI isn't parallelizable much, so you can't have much success. In hardware instead, there are other limits: you can't make a 4D object in our three-dimensional space; so you can't produce a tesseract (ie: a geometry having equal distance between all the computation cores and all the memory units); have a look at the NUMA architectures available on the market (ex: Intel Xeon and AMD Threadrippers). And that's the reason why those NVIDIA racks are bloody expensive: they are a full pack of work arounds in order to have computation cores and memory at some-sort-of-equal distances.
Even if you manage to have some $$$ (millions) to buy one of those nvidia racks, then you need the money to pay the electric bill, and the data to train the models. In other words: unless you are Mark Zuckerberg or who ever else (Microsoft, Google, Oracle, some govs) have both the money and the mankind's data, you can't fully take advantage of AI tech.
There are exotic solutions also: quantum computing, biological computing, and so on. But ... well ... do they work? Do they exist? How much do they cost?

You can use Coral's (and other cheap AI solutions; there are better units around) for AI-based pattern matching jobs (ex: computer vision). What you cannot do is to run those huge generative models we currently run with ollama. That's why ollama doesn't support TPUs. I might have been blunt but ... that's not me... it's just the sad part of the AI story.

What we can realistically expect from the ollama project is that they introduce some form of clustering capabilities already seen in similar softwares. That's all these developers can do for us, if they are willing to.

<!-- gh-comment-id:2629586346 --> @mfp20 commented on GitHub (Feb 2, 2025): > Excellent insight. Thanks for taking the time and replying in detail. > > > There's no el cheapo solution. I keep looking around, like many others; but as of today, AI isn't for all (yet). You can pay tokens to the big players, or experiment with open models using commodity hardware thanks to ollama and alike. That's it. > > El cheapo is only a secondary objective. Primary objective is to run CUDA dependent models on non-nVidia hardware but still get some benefit from Tensor cores or run Tensor-lite models. Dude, you didn't read the previous contributes in this thread so you are missing a point: the neural units and the ram MUST be tightly coupled (ie: there must be HIGH memory bandwidth, because the neurons need to iterate multiple times over the whole model sitting in ram; the higher the bandwith the better; the universe above the sky is the only limit). That said, if you compare the npu-memory bandwidth on a nvidia card with the bandwidth of the pcie5 bus, you'll see the huge difference in bandwidth. In other words, there are no busses readily available on our computers, that can match the bandwidth available on the gpu card alone. Modern gpu cards are autonomous systems that communicate over the pcie bus from time to time in order to access to the low speed components (disk, network, user I/O) they need to deliver the job... If you place the npu on a 16x pcie5 slot, you introduce a bottleneck between the npu and the ram. It doesn't matter how many Corals you pack on a single pcie slot... the more you pack... the more the bottleneck will impact the npu performances. I pointed you to Apple's because there are softwares able to exploit the thunderbolt/usb4 connections in oder to focus multiple macbooks (note: each having up to 96GB of vram) on the same AI job (ex: powered by a 200GB model) but again ... 40gbps isn't 1800gbps available to the nvidia gpu... so the end result will be WAY slower than a single Blackwell based system. There are already softwares to work adound these issues but the end result CAN'T match the performance of a proper hardware solution. In software you can buffer, exploit some common hw components (ex: MMU units, DMA units) and so on, in order to parallelize the work over multiple cheap gpus each having some tens of gpbs bandhwidth available on pcie or thunderbolt busses but looks like the AI isn't parallelizable much, so you can't have much success. In hardware instead, there are other limits: you can't make a 4D object in our three-dimensional space; so you can't produce a tesseract (ie: a geometry having equal distance between all the computation cores and all the memory units); have a look at the NUMA architectures available on the market (ex: Intel Xeon and AMD Threadrippers). And that's the reason why those NVIDIA racks are bloody expensive: they are a full pack of work arounds in order to have computation cores and memory at some-sort-of-equal distances. Even if you manage to have some $$$ (millions) to buy one of those nvidia racks, then you need the money to pay the electric bill, and the data to train the models. In other words: unless you are Mark Zuckerberg or who ever else (Microsoft, Google, Oracle, some govs) have both the money and the mankind's data, you can't fully take advantage of AI tech. There are exotic solutions also: quantum computing, biological computing, and so on. But ... well ... do they work? Do they exist? How much do they cost? You can use Coral's (and other cheap AI solutions; there are better units around) for AI-based pattern matching jobs (ex: computer vision). What you cannot do is to run those huge generative models we currently run with ollama. That's why ollama doesn't support TPUs. I might have been blunt but ... that's not me... it's just the sad part of the AI story. What we can realistically expect from the ollama project is that they introduce some form of clustering capabilities already seen in similar softwares. That's all these developers can do for us, if they are willing to.
Author
Owner

@antarix1 commented on GitHub (Feb 4, 2025):

Excellent insight. Thanks for taking the time and replying in detail.

There's no el cheapo solution. I keep looking around, like many others; but as of today, AI isn't for all (yet). You can pay tokens to the big players, or experiment with open models using commodity hardware thanks to ollama and alike. That's it.

El cheapo is only a secondary objective. Primary objective is to run CUDA dependent models on non-nVidia hardware but still get some benefit from Tensor cores or run Tensor-lite models.

Dude, you didn't read the previous contributes in this thread so you are missing a point: the neural units and the ram MUST be tightly coupled (ie: there must be HIGH memory bandwidth, because the neurons need to iterate multiple times over the whole model sitting in ram; the higher the bandwith the better; the universe above the sky is the only limit). That said, if you compare the npu-memory bandwidth on a nvidia card with the bandwidth of the pcie5 bus, you'll see the huge difference in bandwidth. In other words, there are no busses readily available on our computers, that can match the bandwidth available on the gpu card alone. Modern gpu cards are autonomous systems that communicate over the pcie bus from time to time in order to access to the low speed components (disk, network, user I/O) they need to deliver the job...

If you place the npu on a 16x pcie5 slot, you introduce a bottleneck between the npu and the ram. It doesn't matter how many Corals you pack on a single pcie slot... the more you pack... the more the bottleneck will impact the npu performances. I pointed you to Apple's because there are softwares able to exploit the thunderbolt/usb4 connections in oder to focus multiple macbooks (note: each having up to 96GB of vram) on the same AI job (ex: powered by a 200GB model) but again ... 40gbps isn't 1800gbps available to the nvidia gpu... so the end result will be WAY slower than a single Blackwell based system. There are already softwares to work adound these issues but the end result CAN'T match the performance of a proper hardware solution. In software you can buffer, exploit some common hw components (ex: MMU units, DMA units) and so on, in order to parallelize the work over multiple cheap gpus each having some tens of gpbs bandhwidth available on pcie or thunderbolt busses but looks like the AI isn't parallelizable much, so you can't have much success. In hardware instead, there are other limits: you can't make a 4D object in our three-dimensional space; so you can't produce a tesseract (ie: a geometry having equal distance between all the computation cores and all the memory units); have a look at the NUMA architectures available on the market (ex: Intel Xeon and AMD Threadrippers). And that's the reason why those NVIDIA racks are bloody expensive: they are a full pack of work arounds in order to have computation cores and memory at some-sort-of-equal distances. Even if you manage to have some $$$ (millions) to buy one of those nvidia racks, then you need the money to pay the electric bill, and the data to train the models. In other words: unless you are Mark Zuckerberg or who ever else (Microsoft, Google, Oracle, some govs) have both the money and the mankind's data, you can't fully take advantage of AI tech. There are exotic solutions also: quantum computing, biological computing, and so on. But ... well ... do they work? Do they exist? How much do they cost?

You can use Coral's (and other cheap AI solutions; there are better units around) for AI-based pattern matching jobs (ex: computer vision). What you cannot do is to run those huge generative models we currently run with ollama. That's why ollama doesn't support TPUs. I might have been blunt but ... that's not me... it's just the sad part of the AI story.

What we can realistically expect from the ollama project is that they introduce some form of clustering capabilities already seen in similar softwares. That's all these developers can do for us, if they are willing to.

point taken. thanks again for the detailed reply.

<!-- gh-comment-id:2634121092 --> @antarix1 commented on GitHub (Feb 4, 2025): > > Excellent insight. Thanks for taking the time and replying in detail. > > > There's no el cheapo solution. I keep looking around, like many others; but as of today, AI isn't for all (yet). You can pay tokens to the big players, or experiment with open models using commodity hardware thanks to ollama and alike. That's it. > > > > > > El cheapo is only a secondary objective. Primary objective is to run CUDA dependent models on non-nVidia hardware but still get some benefit from Tensor cores or run Tensor-lite models. > > Dude, you didn't read the previous contributes in this thread so you are missing a point: the neural units and the ram MUST be tightly coupled (ie: there must be HIGH memory bandwidth, because the neurons need to iterate multiple times over the whole model sitting in ram; the higher the bandwith the better; the universe above the sky is the only limit). That said, if you compare the npu-memory bandwidth on a nvidia card with the bandwidth of the pcie5 bus, you'll see the huge difference in bandwidth. In other words, there are no busses readily available on our computers, that can match the bandwidth available on the gpu card alone. Modern gpu cards are autonomous systems that communicate over the pcie bus from time to time in order to access to the low speed components (disk, network, user I/O) they need to deliver the job... > > If you place the npu on a 16x pcie5 slot, you introduce a bottleneck between the npu and the ram. It doesn't matter how many Corals you pack on a single pcie slot... the more you pack... the more the bottleneck will impact the npu performances. I pointed you to Apple's because there are softwares able to exploit the thunderbolt/usb4 connections in oder to focus multiple macbooks (note: each having up to 96GB of vram) on the same AI job (ex: powered by a 200GB model) but again ... 40gbps isn't 1800gbps available to the nvidia gpu... so the end result will be WAY slower than a single Blackwell based system. There are already softwares to work adound these issues but the end result CAN'T match the performance of a proper hardware solution. In software you can buffer, exploit some common hw components (ex: MMU units, DMA units) and so on, in order to parallelize the work over multiple cheap gpus each having some tens of gpbs bandhwidth available on pcie or thunderbolt busses but looks like the AI isn't parallelizable much, so you can't have much success. In hardware instead, there are other limits: you can't make a 4D object in our three-dimensional space; so you can't produce a tesseract (ie: a geometry having equal distance between all the computation cores and all the memory units); have a look at the NUMA architectures available on the market (ex: Intel Xeon and AMD Threadrippers). And that's the reason why those NVIDIA racks are bloody expensive: they are a full pack of work arounds in order to have computation cores and memory at some-sort-of-equal distances. Even if you manage to have some $$$ (millions) to buy one of those nvidia racks, then you need the money to pay the electric bill, and the data to train the models. In other words: unless you are Mark Zuckerberg or who ever else (Microsoft, Google, Oracle, some govs) have both the money and the mankind's data, you can't fully take advantage of AI tech. There are exotic solutions also: quantum computing, biological computing, and so on. But ... well ... do they work? Do they exist? How much do they cost? > > You can use Coral's (and other cheap AI solutions; there are better units around) for AI-based pattern matching jobs (ex: computer vision). What you cannot do is to run those huge generative models we currently run with ollama. That's why ollama doesn't support TPUs. I might have been blunt but ... that's not me... it's just the sad part of the AI story. > > What we can realistically expect from the ollama project is that they introduce some form of clustering capabilities already seen in similar softwares. That's all these developers can do for us, if they are willing to. point taken. thanks again for the detailed reply.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#482