[GH-ISSUE #2269] Recommended Spec For Dolphin Mixtral on AWS #63342

New Issue

GiteaMirror · 2026-05-03T13:03:21-05:00

GiteaMirror commented

2026-05-03 13:03:21 -05:00

Originally created by @alkali333 on GitHub (Jan 30, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2269

Hi there,

I have been playing around with various models on Amazon EC2 instances, but I'm not too experienced with AWS and I'm not sure what setup is optimal for running dolphin mixtral and other LLMS.

Can anybody recommend an instance that will run it relatively smoothly, or just the specification I need? I've been able to get good performance on some setups but I don't know if I am paying too much.

Thanks

Originally created by @alkali333 on GitHub (Jan 30, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2269 Hi there, I have been playing around with various models on Amazon EC2 instances, but I'm not too experienced with AWS and I'm not sure what setup is optimal for running dolphin mixtral and other LLMS. Can anybody recommend an instance that will run it relatively smoothly, or just the specification I need? I've been able to get good performance on some setups but I don't know if I am paying too much. Thanks

GiteaMirror closed this issue

2026-05-03 13:03:22 -05:00

GiteaMirror commented

2026-05-03 13:03:23 -05:00

@orlyandico commented on GitHub (Feb 3, 2024):

Unless your company is paying for your AWS spend, may I suggest hyperstack.cloud ?

They are WAY cheaper than AWS. They have RTX-A6000 Ada Generation with 48GB of GPU memory for $1.10/hour on demand.

The (generally) best bang for the buck AWS GPU instances are g4dn and g5g, which is $0.5260 on-demand for a single-GPU instance with 16GB of RAM. Based on my own benchmarking the A6000 is more than double the performance of the Nvidia T4 in the g4dn when using ollama, so although its 2x the price, you get 2x the performance and 3x the GPU memory.

hyperstack has the cheaper A4000 at $0.43/hour which is cheaper than the T4 g4dn.xlarge and faster (although how much faster, I have not measured)

Stay far, far away from AWS g2, g3 instances (super old) or even the P2/P3. They simply don't have the price-performance.

AWS doesn't have a single-GPU A100 instance, only an 8-GPU and it's $20/hour. Also, A100 and H100 GPU availability is very low.

@orlyandico commented on GitHub (Feb 3, 2024): Unless your company is paying for your AWS spend, may I suggest hyperstack.cloud ? They are WAY cheaper than AWS. They have RTX-A6000 Ada Generation with 48GB of GPU memory for $1.10/hour on demand. The (generally) best bang for the buck AWS GPU instances are g4dn and g5g, which is $0.5260 on-demand for a single-GPU instance with 16GB of RAM. Based on my own benchmarking the A6000 is more than double the performance of the Nvidia T4 in the g4dn when using ollama, so although its 2x the price, you get 2x the performance and 3x the GPU memory. hyperstack has the cheaper A4000 at $0.43/hour which is cheaper than the T4 g4dn.xlarge and faster (although how much faster, I have not measured) Stay far, far away from AWS g2, g3 instances (super old) or even the P2/P3. They simply don't have the price-performance. AWS doesn't have a single-GPU A100 instance, only an 8-GPU and it's $20/hour. Also, A100 and H100 GPU availability is very low.

GiteaMirror commented

2026-05-03 13:03:23 -05:00

@yvescleuder commented on GitHub (Feb 28, 2024):

Hey @orlyandico,

I have the same doubt, I want to use Model Gema, but I have no idea what the cost-benefit instance would be.
Which would be recommended? And which provider?

@yvescleuder commented on GitHub (Feb 28, 2024): Hey @orlyandico, I have the same doubt, I want to use Model Gema, but I have no idea what the cost-benefit instance would be. Which would be recommended? And which provider?

GiteaMirror commented

2026-05-03 13:03:24 -05:00

@orlyandico commented on GitHub (Feb 28, 2024):

I can only speak for Hyperstack (as that's what I use personally). The Big Three hyperscalers are more expensive with more features, but if you're going to only be running Ollama, overkill.

Another option: get an Ethereum mining rig (the mainboard, power supply, and case) and populate it with Tesla P40 GPU's from ebay (around $200 each). They aren't the fastest (a bit faster than an RTX 3060) but they have 24GB of VRAM each. If you can get three on the mainboard, that's 72GB of VRAM for under $1000.

@orlyandico commented on GitHub (Feb 28, 2024): I can only speak for Hyperstack (as that's what I use personally). The Big Three hyperscalers are more expensive with more features, but if you're going to only be running Ollama, overkill. Another option: get an Ethereum mining rig (the mainboard, power supply, and case) and populate it with Tesla P40 GPU's from ebay (around $200 each). They aren't the fastest (a bit faster than an RTX 3060) but they have 24GB of VRAM each. If you can get three on the mainboard, that's 72GB of VRAM for under $1000.

GiteaMirror commented

2026-05-03 13:03:25 -05:00

@yvescleuder commented on GitHub (Feb 28, 2024):

Hi @orlyandico,

I need to host it, as I will use it for my company.
We have a module that we made based on the OpenIA GPT 3.5

However, we do not use the API in the conversation model, but just a question. Every time our customer makes an interaction, they create a new question.
We wanted to use it in the conversation model, but with OpenIA it would be very expensive, as they charge for input and output (tokens).
We want to create conversation and for that I believe that a self-hosted model is the best option.

@yvescleuder commented on GitHub (Feb 28, 2024): Hi @orlyandico, I need to host it, as I will use it for my company. We have a module that we made based on the OpenIA GPT 3.5 However, we do not use the API in the conversation model, but just a question. Every time our customer makes an interaction, they create a new question. We wanted to use it in the conversation model, but with OpenIA it would be very expensive, as they charge for input and output (tokens). We want to create conversation and for that I believe that a self-hosted model is the best option.

GiteaMirror commented

2026-05-03 13:03:28 -05:00

@orlyandico commented on GitHub (Feb 29, 2024):

How large is the model? will it fit on a single GPU? most of the smaller hyperscalers provide single-GPU SKU's, which is cheaper but if your model won't fit.. then there is a problem.

Inference time is linear to model size. There are a lot of decent 7B models. But inference is quadratic to the context window, so if your chats get very long, inferences/second will drop. What is the # of users you expect? that may dictate how many GPU's you need to sustain a given inferences/second.

There is a very nice article here - https://newsletter.pragmaticengineer.com/p/scaling-chatgpt

TL; DR - ChatGPT (and presumably many/all LLM's) are memory bound, not GPU processor bound. Which means A100 is good enough (you may not need H100). But if your model is too large you will still end up with a multi-GPU configuration. However, Nvidia came up with some optimizations (not sure if Ollama is using them) that gets 2X performance increase on Ada generation. So you want an Ada GPU (like the RTX A6000 Ada that I reference above) - https://huggingface.co/blog/optimum-nvidia

You can also try quantizing the model down to 4 bits. Microsoft has some recent research showing good accuracy at 1.56 bits per weight (!) https://arxiv.org/html/2402.17764v1

(probably not available as publicly usable code yet)

If you are going to use a model that fits in a 16GB GPU, then I would look for whoever has the cheapest 16GB VRAM Ada generation GPU around.

Finally - why Gemma? it is not the highest-scoring 7B class model on HF LLM leaderboard..

@orlyandico commented on GitHub (Feb 29, 2024): How large is the model? will it fit on a single GPU? most of the smaller hyperscalers provide single-GPU SKU's, which is cheaper but if your model won't fit.. then there is a problem. Inference time is linear to model size. There are a lot of decent 7B models. But inference is quadratic to the context window, so if your chats get very long, inferences/second will drop. What is the # of users you expect? that may dictate how many GPU's you need to sustain a given inferences/second. There is a very nice article here - https://newsletter.pragmaticengineer.com/p/scaling-chatgpt TL; DR - ChatGPT (and presumably many/all LLM's) are memory bound, not GPU processor bound. Which means A100 is good enough (you may not need H100). But if your model is too large you will still end up with a multi-GPU configuration. However, Nvidia came up with some optimizations (not sure if Ollama is using them) that gets 2X performance increase on Ada generation. So you want an Ada GPU (like the RTX A6000 Ada that I reference above) - https://huggingface.co/blog/optimum-nvidia You can also try quantizing the model down to 4 bits. Microsoft has some recent research showing good accuracy at 1.56 bits per weight (!) https://arxiv.org/html/2402.17764v1 (probably not available as publicly usable code yet) If you are going to use a model that fits in a 16GB GPU, then I would look for whoever has the cheapest 16GB VRAM Ada generation GPU around. Finally - why Gemma? it is not the highest-scoring 7B class model on HF LLM leaderboard..

GiteaMirror commented

2026-05-03 13:03:32 -05:00

@yvescleuder commented on GitHub (Feb 29, 2024):

Let's go,

Maybe I was hasty when I spoke to "Gemma", I haven't actually chosen which AI is best for my scenario, especially because I need to study to understand which one to choose, I'm just entering the world of AI for now. So far I've used OpenIA, so I don't have many arguments about models and I don't even know how they work.
I don't have many simultaneous users, there are few, but they will use it in the conversation model, it could be a large chat.
I don't know exactly which model to use, how can I find out which is best for my scenario?
OpenIA works very well for us, but the cost is quite high if we are going to use conversation in model 4.

@yvescleuder commented on GitHub (Feb 29, 2024): Let's go, Maybe I was hasty when I spoke to "Gemma", I haven't actually chosen which AI is best for my scenario, especially because I need to study to understand which one to choose, I'm just entering the world of AI for now. So far I've used OpenIA, so I don't have many arguments about models and I don't even know how they work. I don't have many simultaneous users, there are few, but they will use it in the conversation model, it could be a large chat. I don't know exactly which model to use, how can I find out which is best for my scenario? OpenIA works very well for us, but the cost is quite high if we are going to use conversation in model 4.

GiteaMirror commented

2026-05-03 13:03:34 -05:00

@orlyandico commented on GitHub (Feb 29, 2024):

Well.. OpenAI 4.0 is the best model, hands down. None of the open-source ones come close. The one that comes closest (today) is Smaug-70B, which has been added to the Ollama repo. It is a huge model, however, you would probably need 2x A100 to self-host it (or 2x RTX A6000 Ada). Question is - do you need the accuracy of Smaug-70B? I have had good experience with DolphinPhi (which is in Ollama model gallery) which is a 1.6B model. Pretty much any model with RAG (and a corpus of your data stored in say OpenSearch or PostgreSQL) would probably perform acceptably.

Have you looked at this? it uses a smaller local LLM to reduce the token count to send to OpenAI, thus reducing cost - https://github.com/microsoft/LLMLingua/

Basically it removes tokens from the input that it thinks (based on the local 3B or 7B LLM) are not needed. In my experience... didn't work so great (it prints out what it thinks is the savings on OpenAI API calls). If you are using RAG, the prompts can get long really fast, so something like LLMLingua would be of help.

@orlyandico commented on GitHub (Feb 29, 2024): Well.. OpenAI 4.0 is the best model, hands down. None of the open-source ones come close. The one that comes closest (today) is Smaug-70B, which has been added to the Ollama repo. It is a huge model, however, you would probably need 2x A100 to self-host it (or 2x RTX A6000 Ada). Question is - do you need the accuracy of Smaug-70B? I have had good experience with DolphinPhi (which is in Ollama model gallery) which is a 1.6B model. Pretty much any model with RAG (and a corpus of your data stored in say OpenSearch or PostgreSQL) would probably perform acceptably. Have you looked at this? it uses a smaller local LLM to reduce the token count to send to OpenAI, thus reducing cost - https://github.com/microsoft/LLMLingua/ Basically it removes tokens from the input that it thinks (based on the local 3B or 7B LLM) are not needed. In my experience... didn't work so great (it prints out what it thinks is the savings on OpenAI API calls). If you are using RAG, the prompts can get long really fast, so something like LLMLingua would be of help.

GiteaMirror commented

2026-05-03 13:03:35 -05:00

@yvescleuder commented on GitHub (Feb 29, 2024):

Possibly I shouldn't need something at this level, I can do experiments within my application, I can start using small models to understand what I actually need.
What would be your recommendation? And what type of machine do I need?

@yvescleuder commented on GitHub (Feb 29, 2024): Possibly I shouldn't need something at this level, I can do experiments within my application, I can start using small models to understand what I actually need. What would be your recommendation? And what type of machine do I need?

GiteaMirror commented

2026-05-03 13:03:36 -05:00

@orlyandico commented on GitHub (Feb 29, 2024):

Here are some steps for doing RAG and Ollama locally on a Linux box - https://github.com/marklysze/LangChain-RAG-Linux

The key issue is that to answer the questions, you need to have an indexed corpus of the documents. Say if it's a bunch of local government laws, guidelines, regulations, etc. the LLM does not know anything about these, and would hallucinate. To avoid hallucinations and to "ground" the answer in the existing document corpus, you need RAG (and you need a database to hold the document corpus). The above link would step through these.

If you use something like DolphinPhi 1.6B as the LLM, then pretty much any GPU will work. I personally use an RTX 3060 (non-Ti) which has 12GB of RAM. It can handle the 7B models at decent inference rates (certainly enough for your prototyping). So a PC with Linux, an RTX 3060 or 4060, 64GB of RAM... should be plenty.

@orlyandico commented on GitHub (Feb 29, 2024): Here are some steps for doing RAG and Ollama locally on a Linux box - https://github.com/marklysze/LangChain-RAG-Linux The key issue is that to answer the questions, you need to have an indexed corpus of the documents. Say if it's a bunch of local government laws, guidelines, regulations, etc. the LLM does not know anything about these, and would hallucinate. To avoid hallucinations and to "ground" the answer in the existing document corpus, you need RAG (and you need a database to hold the document corpus). The above link would step through these. If you use something like DolphinPhi 1.6B as the LLM, then pretty much any GPU will work. I personally use an RTX 3060 (non-Ti) which has 12GB of RAM. It can handle the 7B models at decent inference rates (certainly enough for your prototyping). So a PC with Linux, an RTX 3060 or 4060, 64GB of RAM... should be plenty.

GiteaMirror commented

2026-05-03 13:03:37 -05:00

@bmizerany commented on GitHub (Mar 11, 2024):

This is a great question, and I hope you found your answer! I'm closing this only because it doesn't fall into the category of an "issue".

For general questions/help/support please join us in Discord or Reddit

@bmizerany commented on GitHub (Mar 11, 2024): This is a great question, and I hope you found your answer! I'm closing this only because it doesn't fall into the category of an "issue". For general questions/help/support please join us in Discord or Reddit * https://discord.com/invite/ollama * https://www.reddit.com/r/ollama

GiteaMirror commented

2026-05-03 13:03:38 -05:00

@gnumoksha commented on GitHub (Mar 18, 2024):

I've successfully run Ollama (llama2) on a g5.xlarge instance running Ubuntu 22.04. The CUDA library didn't work on Amazon Linux 2023.

@gnumoksha commented on GitHub (Mar 18, 2024): I've successfully run Ollama (llama2) on a g5.xlarge instance running Ubuntu 22.04. The CUDA library didn't work on Amazon Linux 2023.

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#63342