mirror of
https://github.com/ollama/ollama.git
synced 2026-05-08 17:49:24 -05:00
Open
opened 2026-04-12 10:09:39 -05:00 by GiteaMirror
·
25 comments
No Branch/Tag Specified
main
hoyyeva/opencode-image-modality
hoyyeva/anthropic-renderer-local-image-path
hoyyeva/anthropic-local-image-path
dhiltgen/ci
dhiltgen/llama-runner
parth-launch-codex-app
hoyyeva/anthropic-reference-images-path
parth-anthropic-reference-images-path
brucemacd/download-before-remove
hoyyeva/editor-config-repair
parth-mlx-decode-checkpoints
hoyyeva/fix-codex-model-metadata-warning
hoyyeva/qwen
parth/hide-claude-desktop-till-release
parth-add-claude-code-autoinstall
release_v0.22.0
pdevine/manifest-list
codex/fix-codex-model-metadata-warning
pdevine/addressable-manifest
brucemacd/launch-fetch-reccomended
jmorganca/llama-compat
launch-copilot-cli
hoyyeva/opencode-thinking
release_v0.20.7
parth-auto-save-backup
parth-test
jmorganca/gemma4-audio-replacements
fix-manifest-digest-on-pull
hoyyeva/vscode-improve
brucemacd/install-server-wait
parth/update-claude-docs
brucemac/start-ap-install
pdevine/mlx-update
pdevine/qwen35_vision
drifkin/api-show-fallback
mintlify/image-generation-1773352582
hoyyeva/server-context-length-local-config
jmorganca/faster-reptition-penalties
jmorganca/convert-nemotron
parth-pi-thinking
pdevine/sampling-penalties
jmorganca/fix-create-quantization-memory
dongchen/resumable_transfer_fix
pdevine/sampling-cache-error
jessegross/mlx-usage
hoyyeva/openclaw-config
hoyyeva/app-html
pdevine/qwen3next
brucemacd/sign-sh-install
brucemacd/tui-update
brucemacd/usage-api
jmorganca/launch-empty
fix-app-dist-embed
mxyng/mlx-compile
mxyng/mlx-quant
mxyng/mlx-glm4.7
mxyng/mlx
brucemacd/simplify-model-picker
jmorganca/qwen3-concurrent
fix-glm-4.7-flash-mla-config
drifkin/qwen3-coder-opening-tag
brucemacd/usage-cli
fix-cuda12-fattn-shmem
ollama-imagegen-docs
parth/fix-multiline-inputs
brucemacd/config-docs
mxyng/model-files
mxyng/simple-execute
fix-imagegen-ollama-models
mxyng/async-upload
jmorganca/lazy-no-dtype-changes
imagegen-auto-detect-create
parth/decrease-concurrent-download-hf
fix-mlx-quantize-init
jmorganca/x-cleanup
usage
imagegen-readme
jmorganca/glm-image
mlx-gpu-cd
jmorganca/imagegen-modelfile
parth/agent-skills
parth/agent-allowlist
parth/signed-in-offline
parth/agents
parth/fix-context-chopping
improve-cloud-flow
parth/add-models-websearch
parth/prompt-renderer-mcp
jmorganca/native-settings
jmorganca/download-stream-hash
jmorganca/client2-rebased
brucemacd/oai-chat-req-multipart
jessegross/multi_chunk_reserve
grace/additional-omit-empty
grace/mistral-3-large
mxyng/tokenizer2
mxyng/tokenizer
jessegross/flash
hoyyeva/windows-nacked-app
mxyng/cleanup-attention
grace/deepseek-parser
hoyyeva/remember-unsent-prompt
parth/add-lfs-pointer-error-conversion
parth/olmo2-test2
hoyyeva/ollama-launchagent-plist
nicole/olmo-model
parth/olmo-test
mxyng/remove-embedded
parth/render-template
jmorganca/intellect-3
parth/remove-prealloc-linter
jmorganca/cmd-eval
nicole/nomic-embed-text-fix
mxyng/lint-2
hoyyeva/add-gemini-3-pro-preview
hoyyeva/load-model-list
mxyng/expand-path
mxyng/environ-2
hoyyeva/deeplink-json-encoding
parth/improve-tool-calling-tests
hoyyeva/conversation
hoyyeva/assistant-edit-response
hoyyeva/thinking
origin/brucemacd/invalid-char-i-err
parth/improve-tool-calling
jmorganca/required-omitempty
grace/qwen3-vl-tests
mxyng/iter-client
parth/docs-readme
nicole/embed-test
pdevine/integration-benchstat
parth/remove-generate-cmd
parth/add-toolcall-id
mxyng/server-tests
jmorganca/glm-4.6
jmorganca/gin-h-compat
drifkin/stable-tool-args
pdevine/qwen3-more-thinking
parth/add-websearch-client
nicole/websearch_local
jmorganca/qwen3-coder-updates
grace/deepseek-v3-migration-tests
mxyng/fix-create
jmorganca/cloud-errors
pdevine/parser-tidy
revert-12233-parth/simplify-entrypoints-runner
parth/enable-so-gpt-oss
brucemacd/qwen3vl
jmorganca/readme-simplify
parth/gpt-oss-structured-outputs
revert-12039-jmorganca/tools-braces
mxyng/embeddings
mxyng/gguf
mxyng/benchmark
mxyng/types-null
parth/move-parsing
mxyng/gemma2
jmorganca/docs
mxyng/16-bit
mxyng/create-stdin
pdevine/authorizedkeys
mxyng/quant
parth/opt-in-error-context-window
brucemacd/cache-models
brucemacd/runner-completion
jmorganca/llama-update-6
brucemacd/benchmark-list
brucemacd/partial-read-caps
parth/deepseek-r1-tools
mxyng/omit-array
parth/tool-prefix-temp
brucemacd/runner-test
jmorganca/qwen25vl
brucemacd/model-forward-test-ext
parth/python-function-parsing
jmorganca/cuda-compression-none
drifkin/num-parallel
drifkin/chat-truncation-fix
jmorganca/sync
parth/python-tools-calling
drifkin/array-head-count
brucemacd/create-no-loop
parth/server-enable-content-stream-with-tools
qwen25omni
mxyng/v3
brucemacd/ropeconfig
jmorganca/silence-tokenizer
parth/sample-so-test
parth/sampling-structured-outputs
brucemacd/doc-go-engine
parth/constrained-sampling-json
jmorganca/mistral-wip
brucemacd/mistral-small-convert
parth/sample-unmarshal-json-for-params
brucemacd/jomorganca/mistral
pdevine/bfloat16
jmorganca/mistral
brucemacd/mistral
pdevine/logging
parth/sample-correctness-fix
parth/sample-fix-sorting
jmorgan/sample-fix-sorting-extras
jmorganca/temp-0-images
brucemacd/parallel-embed-models
brucemacd/shim-grammar
jmorganca/fix-gguf-error
bmizerany/nameswork
jmorganca/faster-releases
bmizerany/validatenames
brucemacd/err-no-vocab
brucemacd/rope-config
brucemacd/err-hint
brucemacd/qwen2_5
brucemacd/logprobs
brucemacd/new_runner_graph_bench
progress-flicker
brucemacd/forward-test
brucemacd/go_qwen2
pdevine/gemma2
jmorganca/add-missing-symlink-eval
mxyng/next-debug
parth/set-context-size-openai
brucemacd/next-bpe-bench
brucemacd/next-bpe-test
brucemacd/new_runner_e2e
brucemacd/new_runner_qwen2
pdevine/convert-cohere2
brucemacd/convert-cli
parth/log-probs
mxyng/next-mlx
mxyng/cmd-history
parth/templating
parth/tokenize-detokenize
brucemacd/check-key-register
bmizerany/grammar
jmorganca/vendor-081b29bd
mxyng/func-checks
jmorganca/fix-null-format
parth/fix-default-to-warn-json
jmorganca/qwen2vl
jmorganca/no-concat
parth/cmd-cleanup-SO
brucemacd/check-key-register-structured-err
parth/openai-stream-usage
parth/fix-referencing-so
stream-tools-stop
jmorganca/degin-1
brucemacd/install-path-clean
brucemacd/push-name-validation
brucemacd/browser-key-register
jmorganca/openai-fix-first-message
jmorganca/fix-proxy
jessegross/sample
parth/disallow-streaming-tools
dhiltgen/remove_submodule
jmorganca/ga
jmorganca/mllama
pdevine/newlines
pdevine/geems-2b
jmorganca/llama-bump
mxyng/modelname-7
mxyng/gin-slog
mxyng/modelname-6
jyan/convert-prog
jyan/quant5
paligemma-support
pdevine/import-docs
jmorganca/openai-context
jyan/paligemma
jyan/p2
jyan/palitest
bmizerany/embedspeedup
jmorganca/llama-vit
brucemacd/allow-ollama
royh/ep-methods
royh/whisper
mxyng/api-models
mxyng/fix-memory
jyan/q4_4/8
jyan/ollama-v
royh/stream-tools
roy-embed-parallel
bmizerany/hrm
revert-5963-revert-5924-mxyng/llama3.1-rope
royh/embed-viz
jyan/local2
jyan/auth
jyan/local
jyan/parse-temp
jmorganca/template-mistral
jyan/reord-g
royh-openai-suffixdocs
royh-imgembed
royh-embed-parallel
jyan/quant4
royh-precision
jyan/progress
pdevine/fix-template
jyan/quant3
pdevine/ggla
mxyng/update-registry-domain
jmorganca/ggml-static
mxyng/create-context
jyan/v0.146
mxyng/layers-from-files
build_dist
bmizerany/noseek
royh-ls
royh-name
timeout
mxyng/server-timestamp
bmizerany/nosillyggufslurps
royh-params
jmorganca/llama-cpp-7c26775
royh-openai-delete
royh-show-rigid
jmorganca/enable-fa
jmorganca/no-error-template
jyan/format
royh-testdelete
bmizerany/fastverify
language_support
pdevine/ps-glitches
brucemacd/tokenize
bruce/iq-quants
bmizerany/filepathwithcoloninhost
mxyng/split-bin
bmizerany/client-registry
jmorganca/if-none-match
native
jmorganca/native
jmorganca/batch-embeddings
jmorganca/initcmake
jmorganca/mm
pdevine/showggmlinfo
modenameenforcealphanum
bmizerany/modenameenforcealphanum
jmorganca/done-reason
jmorganca/llama-cpp-8960fe8
ollama.com
bmizerany/filepathnobuild
bmizerany/types/model/defaultfix
rmdisplaylong
nogogen
bmizerany/x
modelfile-readme
bmizerany/replacecolon
jmorganca/limit
jmorganca/execstack
jmorganca/replace-assets
mxyng/tune-concurrency
jmorganca/testing
whitespace-detection
jmorganca/options
upgrade-all
scratch
cuda-search
mattw/airenamer
mattw/allmodelsonhuggingface
mattw/quantcontext
mattw/whatneedstorun
brucemacd/llama-mem-calc
mattw/faq-context
mattw/communitylinks
mattw/noprune
mattw/python-functioncalling
rename
mxyng/install
pulse
remove-first
editor
mattw/selfqueryingretrieval
cgo
mattw/howtoquant
api
matt/streamingapi
format-config
mxyng/extra-args
shell
update-nous-hermes
cp-model
upload-progress
fix-unknown-model
fix-model-names
delete-fix
insecure-registry
ls
deletemodels
progressbar
readme-updates
license-layers
skip-list
list-models
modelpath
matt/examplemodelfiles
distribution
go-opts
v0.30.0-rc7
v0.30.0-rc6
v0.30.0-rc5
v0.23.2
v0.23.2-rc0
v0.30.0-rc4
v0.30.0-rc3
v0.30.0-rc2
v0.30.0-rc1
v0.30.0-rc0
v0.23.1
v0.23.1-rc0
v0.23.0
v0.23.0-rc0
v0.22.1
v0.22.1-rc1
v0.22.1-rc0
v0.22.0
v0.22.0-rc1
v0.21.3-rc0
v0.21.2-rc1
v0.21.2
v0.21.2-rc0
v0.21.1
v0.21.1-rc1
v0.21.1-rc0
v0.21.0
v0.21.0-rc1
v0.21.0-rc0
v0.20.8-rc0
v0.20.7
v0.20.7-rc1
v0.20.7-rc0
v0.20.6
v0.20.6-rc1
v0.20.6-rc0
v0.20.5
v0.20.5-rc2
v0.20.5-rc1
v0.20.5-rc0
v0.20.4
v0.20.4-rc2
v0.20.4-rc1
v0.20.4-rc0
v0.20.3
v0.20.3-rc0
v0.20.2
v0.20.1
v0.20.1-rc2
v0.20.1-rc1
v0.20.1-rc0
v0.20.0
v0.20.0-rc1
v0.20.0-rc0
v0.19.0
v0.19.0-rc2
v0.19.0-rc1
v0.19.0-rc0
v0.18.4-rc1
v0.18.4-rc0
v0.18.3
v0.18.3-rc2
v0.18.3-rc1
v0.18.3-rc0
v0.18.2
v0.18.2-rc1
v0.18.2-rc0
v0.18.1
v0.18.1-rc1
v0.18.1-rc0
v0.18.0
v0.18.0-rc2
v0.18.0-rc1
v0.18.0-rc0
v0.17.8-rc4
v0.17.8-rc3
v0.17.8-rc2
v0.17.8-rc1
v0.17.8-rc0
v0.17.7
v0.17.7-rc2
v0.17.7-rc1
v0.17.7-rc0
v0.17.6
v0.17.5
v0.17.4
v0.17.3
v0.17.2
v0.17.1
v0.17.1-rc2
v0.17.1-rc1
v0.17.1-rc0
v0.17.0
v0.17.0-rc2
v0.17.0-rc1
v0.17.0-rc0
v0.16.3
v0.16.3-rc2
v0.16.3-rc1
v0.16.3-rc0
v0.16.2
v0.16.2-rc0
v0.16.1
v0.16.0
v0.16.0-rc2
v0.16.0-rc0
v0.16.0-rc1
v0.15.6
v0.15.5
v0.15.5-rc5
v0.15.5-rc4
v0.15.5-rc3
v0.15.5-rc2
v0.15.5-rc1
v0.15.5-rc0
v0.15.4
v0.15.3
v0.15.2
v0.15.1
v0.15.1-rc1
v0.15.1-rc0
v0.15.0-rc6
v0.15.0
v0.15.0-rc5
v0.15.0-rc4
v0.15.0-rc3
v0.15.0-rc2
v0.15.0-rc1
v0.15.0-rc0
v0.14.3
v0.14.3-rc3
v0.14.3-rc2
v0.14.3-rc1
v0.14.3-rc0
v0.14.2
v0.14.2-rc1
v0.14.2-rc0
v0.14.1
v0.14.0-rc11
v0.14.0
v0.14.0-rc10
v0.14.0-rc9
v0.14.0-rc8
v0.14.0-rc7
v0.14.0-rc6
v0.14.0-rc5
v0.14.0-rc4
v0.14.0-rc3
v0.14.0-rc2
v0.14.0-rc1
v0.14.0-rc0
v0.13.5
v0.13.5-rc1
v0.13.5-rc0
v0.13.4-rc2
v0.13.4
v0.13.4-rc1
v0.13.4-rc0
v0.13.3
v0.13.3-rc1
v0.13.3-rc0
v0.13.2
v0.13.2-rc2
v0.13.2-rc1
v0.13.2-rc0
v0.13.1
v0.13.1-rc2
v0.13.1-rc1
v0.13.1-rc0
v0.13.0
v0.13.0-rc0
v0.12.11
v0.12.11-rc1
v0.12.11-rc0
v0.12.10
v0.12.10-rc1
v0.12.10-rc0
v0.12.9-rc0
v0.12.9
v0.12.8
v0.12.8-rc0
v0.12.7
v0.12.7-rc1
v0.12.7-rc0
v0.12.7-citest0
v0.12.6
v0.12.6-rc1
v0.12.6-rc0
v0.12.5
v0.12.5-rc0
v0.12.4
v0.12.4-rc7
v0.12.4-rc6
v0.12.4-rc5
v0.12.4-rc4
v0.12.4-rc3
v0.12.4-rc2
v0.12.4-rc1
v0.12.4-rc0
v0.12.3
v0.12.2
v0.12.2-rc0
v0.12.1
v0.12.1-rc1
v0.12.1-rc2
v0.12.1-rc0
v0.12.0
v0.12.0-rc1
v0.12.0-rc0
v0.11.11
v0.11.11-rc3
v0.11.11-rc2
v0.11.11-rc1
v0.11.11-rc0
v0.11.10
v0.11.9
v0.11.9-rc0
v0.11.8
v0.11.8-rc0
v0.11.7-rc1
v0.11.7-rc0
v0.11.7
v0.11.6
v0.11.6-rc0
v0.11.5-rc4
v0.11.5-rc3
v0.11.5
v0.11.5-rc5
v0.11.5-rc2
v0.11.5-rc1
v0.11.5-rc0
v0.11.4
v0.11.4-rc0
v0.11.3
v0.11.3-rc0
v0.11.2
v0.11.1
v0.11.0-rc0
v0.11.0-rc1
v0.11.0-rc2
v0.11.0
v0.10.2-int1
v0.10.1
v0.10.0
v0.10.0-rc4
v0.10.0-rc3
v0.10.0-rc2
v0.10.0-rc1
v0.10.0-rc0
v0.9.7-rc1
v0.9.7-rc0
v0.9.6
v0.9.6-rc0
v0.9.6-ci0
v0.9.5
v0.9.4-rc5
v0.9.4-rc6
v0.9.4
v0.9.4-rc3
v0.9.4-rc4
v0.9.4-rc1
v0.9.4-rc2
v0.9.4-rc0
v0.9.3
v0.9.3-rc5
v0.9.4-citest0
v0.9.3-rc4
v0.9.3-rc3
v0.9.3-rc2
v0.9.3-rc1
v0.9.3-rc0
v0.9.2
v0.9.1
v0.9.1-rc1
v0.9.1-rc0
v0.9.1-ci1
v0.9.1-ci0
v0.9.0
v0.9.0-rc0
v0.8.0
v0.8.0-rc0
v0.7.1-rc2
v0.7.1
v0.7.1-rc1
v0.7.1-rc0
v0.7.0
v0.7.0-rc1
v0.7.0-rc0
v0.6.9-rc0
v0.6.8
v0.6.8-rc0
v0.6.7
v0.6.7-rc2
v0.6.7-rc1
v0.6.7-rc0
v0.6.6
v0.6.6-rc2
v0.6.6-rc1
v0.6.6-rc0
v0.6.5-rc1
v0.6.5
v0.6.5-rc0
v0.6.4-rc0
v0.6.4
v0.6.3-rc1
v0.6.3
v0.6.3-rc0
v0.6.2
v0.6.2-rc0
v0.6.1
v0.6.1-rc0
v0.6.0-rc0
v0.6.0
v0.5.14-rc0
v0.5.13
v0.5.13-rc6
v0.5.13-rc5
v0.5.13-rc4
v0.5.13-rc3
v0.5.13-rc2
v0.5.13-rc1
v0.5.13-rc0
v0.5.12
v0.5.12-rc1
v0.5.12-rc0
v0.5.11
v0.5.10
v0.5.9
v0.5.9-rc0
v0.5.8-rc13
v0.5.8
v0.5.8-rc12
v0.5.8-rc11
v0.5.8-rc10
v0.5.8-rc9
v0.5.8-rc8
v0.5.8-rc7
v0.5.8-rc6
v0.5.8-rc5
v0.5.8-rc4
v0.5.8-rc3
v0.5.8-rc2
v0.5.8-rc1
v0.5.8-rc0
v0.5.7
v0.5.6
v0.5.5
v0.5.5-rc0
v0.5.4
v0.5.3
v0.5.3-rc0
v0.5.2
v0.5.2-rc3
v0.5.2-rc2
v0.5.2-rc1
v0.5.2-rc0
v0.5.1
v0.5.0
v0.5.0-rc1
v0.4.8-rc0
v0.4.7
v0.4.6
v0.4.5
v0.4.4
v0.4.3
v0.4.3-rc0
v0.4.2
v0.4.2-rc1
v0.4.2-rc0
v0.4.1
v0.4.1-rc0
v0.4.0
v0.4.0-rc8
v0.4.0-rc7
v0.4.0-rc6
v0.4.0-rc5
v0.4.0-rc4
v0.4.0-rc3
v0.4.0-rc2
v0.4.0-rc1
v0.4.0-rc0
v0.4.0-ci3
v0.3.14
v0.3.14-rc0
v0.3.13
v0.3.12
v0.3.12-rc5
v0.3.12-rc4
v0.3.12-rc3
v0.3.12-rc2
v0.3.12-rc1
v0.3.11
v0.3.11-rc4
v0.3.11-rc3
v0.3.11-rc2
v0.3.11-rc1
v0.3.10
v0.3.10-rc1
v0.3.9
v0.3.8
v0.3.7
v0.3.7-rc6
v0.3.7-rc5
v0.3.7-rc4
v0.3.7-rc3
v0.3.7-rc2
v0.3.7-rc1
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.8
v0.2.8-rc2
v0.2.8-rc1
v0.2.7
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.2-rc2
v0.2.2-rc1
v0.2.1
v0.2.0
v0.1.49-rc14
v0.1.49-rc13
v0.1.49-rc12
v0.1.49-rc11
v0.1.49-rc10
v0.1.49-rc9
v0.1.49-rc8
v0.1.49-rc7
v0.1.49-rc6
v0.1.49-rc4
v0.1.49-rc5
v0.1.49-rc3
v0.1.49-rc2
v0.1.49-rc1
v0.1.48
v0.1.47
v0.1.46
v0.1.45-rc5
v0.1.45
v0.1.45-rc4
v0.1.45-rc3
v0.1.45-rc2
v0.1.45-rc1
v0.1.44
v0.1.43
v0.1.42
v0.1.41
v0.1.40
v0.1.40-rc1
v0.1.39
v0.1.39-rc2
v0.1.39-rc1
v0.1.38
v0.1.37
v0.1.36
v0.1.35
v0.1.35-rc1
v0.1.34
v0.1.34-rc1
v0.1.33
v0.1.33-rc7
v0.1.33-rc6
v0.1.33-rc5
v0.1.33-rc4
v0.1.33-rc3
v0.1.33-rc2
v0.1.33-rc1
v0.1.32
v0.1.32-rc2
v0.1.32-rc1
v0.1.31
v0.1.30
v0.1.29
v0.1.28
v0.1.27
v0.1.26
v0.1.25
v0.1.24
v0.1.23
v0.1.22
v0.1.21
v0.1.20
v0.1.19
v0.1.18
v0.1.17
v0.1.16
v0.1.15
v0.1.14
v0.1.13
v0.1.12
v0.1.11
v0.1.10
v0.1.9
v0.1.8
v0.1.7
v0.1.6
v0.1.5
v0.1.4
v0.1.3
v0.1.2
v0.1.1
v0.1.0
v0.0.21
v0.0.20
v0.0.19
v0.0.18
v0.0.17
v0.0.16
v0.0.15
v0.0.14
v0.0.13
v0.0.12
v0.0.11
v0.0.10
v0.0.9
v0.0.8
v0.0.7
v0.0.6
v0.0.5
v0.0.4
v0.0.3
v0.0.2
v0.0.1
Labels
Clear labels
amd
api
app
bug
build
cli
cloud
compatibility
context-length
create
docker
documentation
embeddings
feature request
feedback wanted
good first issue
gpt-oss
gpu
harmony
help wanted
image
install
intel
js
launch
linux
macos
memory
mlx
model
needs more info
networking
nvidia
ollama.com
performance
pull-request
python
question
registry
rendering
thinking
tools
top
vulkan
windows
wsl
Mirrored from GitHub Pull Request
No Label
feature request
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/ollama#482
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @coolrazor007 on GitHub (Nov 3, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/990
Originally assigned to: @dhiltgen on GitHub.
Would love to see Ollama run on a TPU not just GPU. Has this been done by anyone already?
@krenax commented on GitHub (Nov 6, 2023):
This may be interesting. Is Ollama officially supported on the TPU?
@coolrazor007 commented on GitHub (Nov 6, 2023):
I did some digging and realized that Ollama is based on llama.cpp which does not support TPUs currently.
@boredcoder411 commented on GitHub (Jan 8, 2024):
Any updates on this? I really want to use my edge tpu and raspberry pi with this project
@easp commented on GitHub (Jan 9, 2024):
@boredcoder411 Edge TPU is not suited for LLM. They only have, what, 2GB of RAM and slow flash memory.
@boredcoder411 commented on GitHub (Jan 9, 2024):
So what CAN I run on them?
@helium729 commented on GitHub (Apr 28, 2024):
maybe our team would like to help with this feature, it's just we don't know where to get started yet.
@GameTec-live commented on GitHub (May 15, 2024):
I'd love support for the pcie coral tpus... You should be able to swap in and out of memory over the pcie bus fast enough (at least thats what ive read somewhere) and with the pi 5 now having nvme support, id love to be able to just build a tiny little llm server...
https://coral.ai/products/m2-accelerator-dual-edgetpu/
@boredcoder411 commented on GitHub (May 15, 2024):
Fairly sure jax/flax supports the coral tpus, also the USB accelerator. Also sure it has enough ram to hold at least tinyllama.
@quadcom commented on GitHub (May 24, 2024):
Why couldn't a RAMDISK be created to hold model files in the case of a TPU? I have an Unraid server with 96GB of RAM. Reserving 12-24GB for a RAMDISK wouldn't be a huge hit to its performance.
@easp commented on GitHub (May 24, 2024):
A RAM disk, besides being a obsolete throwback, isn't going to help with the fact that Coral TPUs don't have enough RAM and only have a slow USB connection to its host system.
In addition, they aren't all that fast. They aren't supported by Ollama & they aren't likely to be because any one capable of doing the work likely has better things to do and even if they did the work, it's unlikely that the Ollama maintainers would merge it because it would add complexity for very little benefit.
@spc789 commented on GitHub (Jun 10, 2024):
I agree with the ramdisk feature, with an nvme/pcie tpu like corel pcie tpus (NOT the usb version) or the hailo tpus they are tied to the pcie bus.
Making a ramdisk, which is a way to forcible keep things in memory can be an option to speed up things.
Personally I do the ramdisk strategy with large panda dataframes and a /dev/shm use on linux if I need interprocess communication done on such things.
Ram is directly tied to the memory bus, so using this strategy could have a huge benefit with TPUs which rely on memory streaming of the data (those who dont have much ram onboard)
@boredcoder411 commented on GitHub (Aug 11, 2024):
Ramdisks sound like a great idea, but how would this work in go?
@jasonsmithio commented on GitHub (Aug 27, 2024):
I am happy to help if anyone is tackling this already!
@mfp20 commented on GitHub (Nov 15, 2024):
Does it work? I'm evaluating the idea to buy one of those hailo m.2 cards as I don't need more gpus...
@boredcoder411 commented on GitHub (Nov 15, 2024):
Prob not cause google doesn't like consumers and has closed issues on the Jax repo requesting tpu support
@easp commented on GitHub (Nov 15, 2024):
@mfp20 That accelerator isn't designed for LLM. Look at the models they use in their benchmarks. EfficientNetV2-M is a 54M parameter model. That's 50x smaller than even a small LLM. They don't have the onboard memory and shifting the weights over PCIe 2x for each token isn't any more realistic than it is for a GPU.
@mfp20 commented on GitHub (Nov 16, 2024):
Any link about this? If the pcie latencies (to access the ram) are too high, there's no point to keep the issue open... as it is not possible to use the TPUs. Other accelerators failed in the past (ex: crypto accelerators on pci slots).
I've never looked into the computational details; I suppose you are right. But pcie5 latencies might be low enough to grant an effective accel. Maybe not much, but even a 1.5x would enable some use cases.
TPUs a part, I wonder if some sort of ram caching mechanism might do the trick. Something like using the ram to load the whole model and then a background process shifting parts of it into the vram before the gpu processing needing that part. I've no idea of the level of parallelism required by those algorithms, and what are the chances to predict what chunks of the model will be needed, but ... any parallelism can be reduced in multiple steps. It's slower but it would enable decent timings on ram-rich systems and consumer gpus with little vram. Probably at the end of the day we might experience something similar to the performance seen with https://github.com/b4rtaz/distributed-llama .
@easp commented on GitHub (Nov 16, 2024):
@mfp20 For monolithic models (ie not mixture-of-experts) it's very easy to predict what chunks of the model will be needed because the entire model is read sequentially from start to finish for each token.
For token generation compute isn't really an issue, PCIe latency isn't the issue, bandwidth is. For these reason, if a portion of the model is in RAM it's generally faster to compute on the CPU than it is to ship the data over PCIe in order to use it on a GPU or TPU.
@mfp20 commented on GitHub (Nov 17, 2024):
Ok, this can be adjusted in software, in a similar way it is adjusted today when running some layers on CPUs and some on GPUs.
I imagine a couple of DMA pumps moving a few model chunks form ram to vram, sync'ed by the main evaluation process. If a model is 56gb and the vram is 8gb, you get 12 chunks only to move for each token. The end result would be an execution slower than gpu-vram (because of insufficient memory bandwidth), and faster execution than cpu-only, but ... hey ... better than no execution at all (because of the lack of enough vram), or totally unusable slow execution of a bare cpu and the system completely clogged (because the general purpose cores are executing the model). It is an enabling solution for consumer grade systems.
If bw is the limit, then a pcie5 TPU can accelerate: both ddr5 and pcie5 have 64 GB/s bw. An order of magnitude less than consumers gpu's vram (200-800 GB/s); but increases in cpu-based performance are registered when llama.cpp workers can rely on complex instructions (ex: AVX, AVX2, AVX512). Having an ASIC chip like the ones in TPUs, instead of somewhat general purpose instructions, might further increase the acceleration. And offload the main cpu.
Problem is: none of those TPU cards are pcie5 x16. And an x16 card might end up being as expensive as a second-hand gpu on ebay...
In any case improving heterogeneous computing by implementing the ram-vram buffering described above might be useful. Probably not much for the single-prompt use case, but for parallel operations. I didn't look at current code (in llama.cpp, ollama, lm studio, and so on) but looks like they are struggling to mix multiple silicons.
@antarix1 commented on GitHub (Jan 31, 2025):
I am a novice user. I have only used ollama on my Linux host and loved it, albeit on ancient hardware without a GPU.
The little that I understand from the above discussion is that although the tensor processors are useful for compute, the severely limited RAM and bus speeds handicap the usability for fast chat-bots.
I found out this item and wondered if it would be any good?
Also, may be foolish talk, but do you think it is possible to build a PCIe card with 4 of these chips and memory brackets for SODIMM-DDR4 RAM so that users may install as much RAM as they want? Please enlighten me on the intricacies and difficulties of making such a card
@mfp20 commented on GitHub (Jan 31, 2025):
There's no el cheapo solution. I keep looking around, like many others; but as of today, AI isn't for all (yet). You can pay tokens to the big players, or experiment with open models using commodity hardware thanks to ollama and alike. That's it.
I didn't look at Coral's datasheet but I doubt it has the memory controller to connect big amounts of ram, and not of the right kind. Moreover, producing PCBs for ram chips/slots is ... expensive, as it needs to be perfectly tuned for all those signals packed in small room; the tracks' routing must be perfect. The card you are suggesting wouldn't be the typical week-end projecy you can buy on Tindie.
Moreover, have a look at nvidia's 5000 cards: they enabled FP4 and claimed a 115% AI performance improvement. But they just ... monetized on quantization that we used to tune in for helping our cheap gpus. The 100% improvement is just ... halving the information quantum in order to double the performance per cycle, given the same amount of transistors of the previous gen; the other 15% is because the 5000 cards have 5% more cores and better ram. In other words: nvidia pwned a software improvement that used to be un users' hands. Not the govs (with their baroque laws), nor the industry (with their marketing) is really helping to democratize the AI...
Your best bet currently are Apple's (because they share ram and vram), starting at 6000+ coins, rockchip's clusters (about 1500 coins to have one) or ... ebay & pray (that US anti-trust authority cracks nvidia) ...
@antarix1 commented on GitHub (Jan 31, 2025):
Excellent insight. Thanks for taking the time and replying in detail.
El cheapo is only a secondary objective. Primary objective is to run CUDA dependent models on non-nVidia hardware but still get some benefit from Tensor cores or run Tensor-lite models. AMD is trying to run CUDA code using translation and is effectively useless.
Aaaaaaand, I never trust a monopoly, be it in code, product or scientific thinking. It becomes a matter of WHEN not IF (don't be evil).
This is indeed true. I have heard from experts how difficult it is to design signal paths for high speed low latency memory. So the price of the final product would be at least similar to a mid-tier used GPU, would be my guess. Then again the focus is to develop an open source design for PCB that enthusiasts can manufacture themselves using off the shelf components.
Thank you for bringing this up. They dare to do this because they stand without competition or even a remote alternative. Besides, I have always treated marketing fluff of percentages as gimmick. As soon as they begin talking in % points, I stop listening. My concern is for when they lock down these cards so that users can no longer optimize and tinker on their own terms.
No Apple thank you. Please refer to Louis Rossmann Not even a used McBk.
eBay is okay, but not reliable.
Please have a look at this, it seems people are already working on it. Also this looks promising as a take-off point for the base design.
Thanks again for your deep thought and consideration. I'd invite others to offer valuable insights into this conversation.
@antarix1 commented on GitHub (Jan 31, 2025):
please check out https://github.com/magic-blue-smoke/Dual-Edge-TPU-Adapter
@mfp20 commented on GitHub (Feb 2, 2025):
Dude, you didn't read the previous contributes in this thread so you are missing a point: the neural units and the ram MUST be tightly coupled (ie: there must be HIGH memory bandwidth, because the neurons need to iterate multiple times over the whole model sitting in ram; the higher the bandwith the better; the universe above the sky is the only limit). That said, if you compare the npu-memory bandwidth on a nvidia card with the bandwidth of the pcie5 bus, you'll see the huge difference in bandwidth. In other words, there are no busses readily available on our computers, that can match the bandwidth available on the gpu card alone. Modern gpu cards are autonomous systems that communicate over the pcie bus from time to time in order to access to the low speed components (disk, network, user I/O) they need to deliver the job...
If you place the npu on a 16x pcie5 slot, you introduce a bottleneck between the npu and the ram. It doesn't matter how many Corals you pack on a single pcie slot... the more you pack... the more the bottleneck will impact the npu performances. I pointed you to Apple's because there are softwares able to exploit the thunderbolt/usb4 connections in oder to focus multiple macbooks (note: each having up to 96GB of vram) on the same AI job (ex: powered by a 200GB model) but again ... 40gbps isn't 1800gbps available to the nvidia gpu... so the end result will be WAY slower than a single Blackwell based system. There are already softwares to work adound these issues but the end result CAN'T match the performance of a proper hardware solution.
In software you can buffer, exploit some common hw components (ex: MMU units, DMA units) and so on, in order to parallelize the work over multiple cheap gpus each having some tens of gpbs bandhwidth available on pcie or thunderbolt busses but looks like the AI isn't parallelizable much, so you can't have much success. In hardware instead, there are other limits: you can't make a 4D object in our three-dimensional space; so you can't produce a tesseract (ie: a geometry having equal distance between all the computation cores and all the memory units); have a look at the NUMA architectures available on the market (ex: Intel Xeon and AMD Threadrippers). And that's the reason why those NVIDIA racks are bloody expensive: they are a full pack of work arounds in order to have computation cores and memory at some-sort-of-equal distances.
Even if you manage to have some $$$ (millions) to buy one of those nvidia racks, then you need the money to pay the electric bill, and the data to train the models. In other words: unless you are Mark Zuckerberg or who ever else (Microsoft, Google, Oracle, some govs) have both the money and the mankind's data, you can't fully take advantage of AI tech.
There are exotic solutions also: quantum computing, biological computing, and so on. But ... well ... do they work? Do they exist? How much do they cost?
You can use Coral's (and other cheap AI solutions; there are better units around) for AI-based pattern matching jobs (ex: computer vision). What you cannot do is to run those huge generative models we currently run with ollama. That's why ollama doesn't support TPUs. I might have been blunt but ... that's not me... it's just the sad part of the AI story.
What we can realistically expect from the ollama project is that they introduce some form of clustering capabilities already seen in similar softwares. That's all these developers can do for us, if they are willing to.
@antarix1 commented on GitHub (Feb 4, 2025):
point taken. thanks again for the detailed reply.