[GH-ISSUE #14073] New default context lengths will break #71252

New Issue

GiteaMirror · 2026-05-05T00:56:29-05:00

GiteaMirror commented

2026-05-05 00:56:29 -05:00

Originally created by @tonydiep on GitHub (Feb 4, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14073

The new default context lengths in 0.15.5 will break on my machine with 52gb VRAM. Ollama will be unresponsive because it will spill onto CPU until I can get into it to set context to fit within VRAM.

RECOMMENDATIONS

Reduce jump in num_ctx from 256k to 64k or 96k, or
Have Ollama dynamically increase default context size but still fit entirely in VRAM. (I do this manually increasing num_ctx until GPU=100% and CPU=0% but why not have Ollama do it?)

According to 0.15.5 (pre-release) notes:

Ollama will now default to the following context lengths based on VRAM:
    < 24 GiB VRAM: 4,096 context
    24-48 GiB VRAM: 32,768 context
    >= 48 GiB VRAM: 262,144 context

Originally created by @tonydiep on GitHub (Feb 4, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14073 The new default context lengths in 0.15.5 will break on my machine with 52gb VRAM. Ollama will be unresponsive because it will spill onto CPU until I can get into it to set context to fit within VRAM. # RECOMMENDATIONS 1. Reduce jump in num_ctx from 256k to 64k or 96k, or 2. Have Ollama dynamically increase default context size but still fit entirely in VRAM. (I do this manually increasing num_ctx until GPU=100% and CPU=0% but why not have Ollama do it?) According to 0.15.5 (pre-release) notes: ``` Ollama will now default to the following context lengths based on VRAM: < 24 GiB VRAM: 4,096 context 24-48 GiB VRAM: 32,768 context >= 48 GiB VRAM: 262,144 context ```

GiteaMirror commented

2026-05-05 00:56:30 -05:00

@jessegross commented on GitHub (Feb 4, 2026):

What model and hardware are you using?

@jessegross commented on GitHub (Feb 4, 2026): What model and hardware are you using?

GiteaMirror commented

2026-05-05 00:56:31 -05:00

@illusdolphin commented on GitHub (Feb 4, 2026):

Sample - RTX 6000 Pro, 96 GB, trying to run new 1B model glm-ocr based on sample from docs:
PS C:\Users...\Desktop> ollama run glm-ocr Text Recognition: ./image.png
Error: 500 Internal Server Error: model failed to load, this may be due to resource limitations or an internal error, check ollama server logs for details

guess the reason: ">= 48 GiB VRAM: 262,144 context" . Via API it also throws an error and works only if apply meaningful context length via options.

@illusdolphin commented on GitHub (Feb 4, 2026): Sample - RTX 6000 Pro, 96 GB, trying to run new 1B model glm-ocr based on sample from docs: PS C:\Users\...\Desktop> ollama run glm-ocr Text Recognition: ./image.png Error: 500 Internal Server Error: model failed to load, this may be due to resource limitations or an internal error, check ollama server logs for details guess the reason: ">= 48 GiB VRAM: 262,144 context" . Via API it also throws an error and works only if apply meaningful context length via options.

GiteaMirror commented

2026-05-05 00:56:32 -05:00

@jessegross commented on GitHub (Feb 4, 2026):

glm-ocr should fit easily into that GPU at max context length:

NAME              ID              SIZE     PROCESSOR    CONTEXT    UNTIL              
glm-ocr:latest    6effedd0dc8a    15 GB    100% GPU     131072     4 minutes from now

Please post the server logs so we can see what is happening.

@jessegross commented on GitHub (Feb 4, 2026): glm-ocr should fit easily into that GPU at max context length: ``` NAME ID SIZE PROCESSOR CONTEXT UNTIL glm-ocr:latest 6effedd0dc8a 15 GB 100% GPU 131072 4 minutes from now ``` Please post the server logs so we can see what is happening.

GiteaMirror commented

2026-05-05 00:56:33 -05:00

@tonydiep commented on GitHub (Feb 4, 2026):

I think the point is that a default of 256k context is not a good default.

With 52gb VRAM, some of my models can go up to 190k and fit in vram. Others, like Deepseek R1 70b can only go up 10k context before it has to spill to CPU.

Consider GLM4.7-flash before the memory fix. Using the new default of 256k and maxxing out the GLM4.7's 198k context would have spilled onto CPU and crashed Ollama by default.

A safer default would be 32k or 64k (or making sure the context still fits in VRAM)

Tony Diep (He/Him)

From: Jesse Gross @.>
Sent: February 4, 2026 4:55 PM
To: ollama/ollama @.>
Cc: Tony Diep @.>; Author @.>
Subject: Re: [ollama/ollama] New default context lengths will break (Issue #14073)

[https://avatars.githubusercontent.com/u/6468499?s=20&v=4]jessegross left a comment (ollama/ollama#14073)https://github.com/ollama/ollama/issues/14073#issuecomment-3849947566

glm-ocr should fit easily into that GPU at max context length:

NAME ID SIZE PROCESSOR CONTEXT UNTIL
glm-ocr:latest 6effedd0dc8a 15 GB 100% GPU 131072 4 minutes from now

Please post the server logs so we can see what is happening.

—
Reply to this email directly, view it on GitHubhttps://github.com/ollama/ollama/issues/14073#issuecomment-3849947566, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAFMF5I5Z55JDHWKOUTVHLL4KJTGFAVCNFSM6AAAAACT7UI3GOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTQNBZHE2DONJWGY.
You are receiving this because you authored the thread.Message ID: @.***>

@tonydiep commented on GitHub (Feb 4, 2026): I think the point is that a default of 256k context is not a good default. With 52gb VRAM, some of my models can go up to 190k and fit in vram. Others, like Deepseek R1 70b can only go up 10k context before it has to spill to CPU. Consider GLM4.7-flash before the memory fix. Using the new default of 256k and maxxing out the GLM4.7's 198k context would have spilled onto CPU and crashed Ollama by default. A safer default would be 32k or 64k (or making sure the context still fits in VRAM) - Tony Diep (He/Him) ________________________________ From: Jesse Gross ***@***.***> Sent: February 4, 2026 4:55 PM To: ollama/ollama ***@***.***> Cc: Tony Diep ***@***.***>; Author ***@***.***> Subject: Re: [ollama/ollama] New default context lengths will break (Issue #14073) [https://avatars.githubusercontent.com/u/6468499?s=20&v=4]jessegross left a comment (ollama/ollama#14073)<https://github.com/ollama/ollama/issues/14073#issuecomment-3849947566> glm-ocr should fit easily into that GPU at max context length: NAME ID SIZE PROCESSOR CONTEXT UNTIL glm-ocr:latest 6effedd0dc8a 15 GB 100% GPU 131072 4 minutes from now Please post the server logs so we can see what is happening. — Reply to this email directly, view it on GitHub<https://github.com/ollama/ollama/issues/14073#issuecomment-3849947566>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAFMF5I5Z55JDHWKOUTVHLL4KJTGFAVCNFSM6AAAAACT7UI3GOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTQNBZHE2DONJWGY>. You are receiving this because you authored the thread.Message ID: ***@***.***>

GiteaMirror commented

2026-05-05 00:56:34 -05:00

@jessegross commented on GitHub (Feb 4, 2026):

glm-4.7-flash will fit perfectly on your GPU:

NAME                    ID              SIZE     PROCESSOR    CONTEXT    UNTIL              
glm-4.7-flash:latest    d1a8a26252f1    40 GB    100% GPU     202752     4 minutes from now

Before this change, you would have gotten a 4k context window, which is difficult to use and surprising for most use cases.

Yes, a 43G model like deepseek-r1:70b will not leave a lot of room for context length on the GPU. However, it should not crash - simply get slower as more runs on the CPU.

There are no perfect defaults for all models and hardware configurations. Dynamically sizing based on available VRAM is also problematic as the model's quality will vary depending on what else is running on the computer.

Over time, we want the context length to be the full length that the model was trained on and performance to only be impacted as you actually use more of it. This isn't that but it helps a lot of users with more realistic context lengths while keeping results fairly deterministic.

@jessegross commented on GitHub (Feb 4, 2026): glm-4.7-flash will fit perfectly on your GPU: ``` NAME ID SIZE PROCESSOR CONTEXT UNTIL glm-4.7-flash:latest d1a8a26252f1 40 GB 100% GPU 202752 4 minutes from now ``` Before this change, you would have gotten a 4k context window, which is difficult to use and surprising for most use cases. Yes, a 43G model like deepseek-r1:70b will not leave a lot of room for context length on the GPU. However, it should not crash - simply get slower as more runs on the CPU. There are no perfect defaults for all models and hardware configurations. Dynamically sizing based on available VRAM is also problematic as the model's quality will vary depending on what else is running on the computer. Over time, we want the context length to be the full length that the model was trained on and performance to only be impacted as you actually use more of it. This isn't that but it helps a lot of users with more realistic context lengths while keeping results fairly deterministic.

GiteaMirror commented

2026-05-05 00:56:34 -05:00

@rick-github commented on GitHub (Feb 4, 2026):

The crash the OP experiences is:

ollama  | //ml/backend/ggml/ggml/src/ggml-cuda/cpy.cu:396: GGML_ASSERT(ggml_nbytes(src0) <= INT_MAX) failed

This happens to other models that are pushed to their maximum context, eg nemotron-3-nano:30b and ministral-3.

@rick-github commented on GitHub (Feb 4, 2026): The crash the OP experiences is: ``` ollama | //ml/backend/ggml/ggml/src/ggml-cuda/cpy.cu:396: GGML_ASSERT(ggml_nbytes(src0) <= INT_MAX) failed ``` This happens to other models that are pushed to their maximum context, eg nemotron-3-nano:30b and [ministral-3](https://github.com/ollama/ollama/issues/13887).

GiteaMirror commented

2026-05-05 00:56:35 -05:00

@tonydiep commented on GitHub (Feb 4, 2026):

glm-4.7-flash will fit perfectly on your GPU:
NAME                    ID              SIZE     PROCESSOR    CONTEXT    UNTIL              
glm-4.7-flash:latest    d1a8a26252f1    40 GB    100% GPU     202752     4 minutes from now    
Before this change, you would have gotten a 4k context window, which is difficult to use and surprising for most use cases.

Yes, a 43G model like deepseek-r1:70b will not leave a lot of room for context length on the GPU. However, it should not crash - simply get slower as more runs on the CPU.

There are no perfect defaults for all models and hardware configurations. Dynamically sizing based on available VRAM is also problematic as the model's quality will vary depending on what else is running on the computer.

Over time, we want the context length to be the full length that the model was trained on and performance to only be impacted as you actually use more of it. This isn't that but it helps a lot of users with more realistic context lengths while keeping results fairly deterministic.

Not really sure why you're telling people what they're seeing with their own eyes doesn't match what you think they should have seen or why you're using context size math from a different build of Ollama.

What I'm reporting is that a default 256k context window means I won't be able to choose Ollama as the inference engine. If the decision is won'tt-fix or not-a-bug then go ahead and close it and we can cross Ollama off the list.

@tonydiep commented on GitHub (Feb 4, 2026): > glm-4.7-flash will fit perfectly on your GPU: > > ``` > NAME ID SIZE PROCESSOR CONTEXT UNTIL > glm-4.7-flash:latest d1a8a26252f1 40 GB 100% GPU 202752 4 minutes from now > ``` > > Before this change, you would have gotten a 4k context window, which is difficult to use and surprising for most use cases. > > Yes, a 43G model like deepseek-r1:70b will not leave a lot of room for context length on the GPU. However, it should not crash - simply get slower as more runs on the CPU. > > There are no perfect defaults for all models and hardware configurations. Dynamically sizing based on available VRAM is also problematic as the model's quality will vary depending on what else is running on the computer. > > Over time, we want the context length to be the full length that the model was trained on and performance to only be impacted as you actually use more of it. This isn't that but it helps a lot of users with more realistic context lengths while keeping results fairly deterministic. Not really sure why you're telling people what they're seeing with their own eyes doesn't match what you think they should have seen or why you're using context size math from a different build of Ollama. What I'm reporting is that a default 256k context window means I won't be able to choose Ollama as the inference engine. If the decision is won'tt-fix or not-a-bug then go ahead and close it and we can cross Ollama off the list.

GiteaMirror commented

2026-05-05 00:56:36 -05:00

@jessegross commented on GitHub (Feb 5, 2026):

Please post logs as requested.

Rick has identified the likely cause of @illusdolphin's issue. It's both platform dependent and not related to VRAM, other than the fact that the context length is set based on VRAM.

It's not clear that your issue is the same. The sizes I posted are from the current source of Ollama and they suggest that the issue is not necessarily what you think it is or at least a more narrow set of cases. But it's hard to say without the logs.

@jessegross commented on GitHub (Feb 5, 2026): Please post logs as requested. Rick has identified the likely cause of @illusdolphin's issue. It's both platform dependent and not related to VRAM, other than the fact that the context length is set based on VRAM. It's not clear that your issue is the same. The sizes I posted are from the current source of Ollama and they suggest that the issue is not necessarily what you think it is or at least a more narrow set of cases. But it's hard to say without the logs.

GiteaMirror commented

2026-05-05 00:56:36 -05:00

@rick-github commented on GitHub (Feb 5, 2026):

A wrinkle here, and possibly the cause of the behaviour that tonydiep is seeing, is that the new tiered defaults don't account for OLLAMA_NUM_PARALLEL.

@rick-github commented on GitHub (Feb 5, 2026): A wrinkle here, and possibly the cause of the behaviour that tonydiep is seeing, is that the new tiered defaults don't account for `OLLAMA_NUM_PARALLEL`.

GiteaMirror commented

2026-05-05 00:56:37 -05:00

@tonydiep commented on GitHub (Feb 6, 2026):

Models which ran on 0.15.4 do not run on 0.15.5 because of new default context size of 256k

Ollama will now default to the following context lengths based on VRAM:
    < 24 GiB VRAM: 4,096 context
    24-48 GiB VRAM: 32,768 context
    >= 48 GiB VRAM: 262,144 context

with default context size set by 0.15.5, ollama crashes

tonydiep@tiny:/LLMs$ ollama --version
ollama version is 0.15.5
tonydiep@tiny:/LLMs$ ollama run deepseek-r1:70b
Error: 500 Internal Server Error: model requires more system memory (74.0 GiB) than is available (56.7 GiB)

with default context size set by 0.15.4, or by setting context size to 10k to fit my vram:

tonydiep@tiny:~/LLMs$ ollama run deepseek-r1-70b-custom:latest

hello
Hello! How can I assist you today? 😊

For people evaluating Ollama vs other inference engines, it looks like other inference engines can run models that Ollama cannot.

@tonydiep commented on GitHub (Feb 6, 2026): Models which ran on 0.15.4 do not run on 0.15.5 because of new default context size of 256k ``` Ollama will now default to the following context lengths based on VRAM: < 24 GiB VRAM: 4,096 context 24-48 GiB VRAM: 32,768 context >= 48 GiB VRAM: 262,144 context ``` # with default context size set by 0.15.5, ollama crashes tonydiep@tiny:~/LLMs$ ollama --version ollama version is 0.15.5 tonydiep@tiny:~/LLMs$ ollama run deepseek-r1:70b Error: 500 Internal Server Error: model requires more system memory (74.0 GiB) than is available (56.7 GiB) # with default context size set by 0.15.4, or by setting context size to 10k to fit my vram: tonydiep@tiny:~/LLMs$ ollama run deepseek-r1-70b-custom:latest >>> hello Hello! How can I assist you today? 😊 For people evaluating Ollama vs other inference engines, it looks like other inference engines can run models that Ollama cannot.

GiteaMirror commented

2026-05-05 00:56:38 -05:00

@tonydiep commented on GitHub (Feb 7, 2026):

llama3.3:70b also stopped working with ollama 0.15.5

tonydiep@tiny:/LLMs$ ollama --version
ollama version is 0.15.5
tonydiep@tiny:/LLMs$ ollama run llama3.3:70b
Error: 500 Internal Server Error: model requires more system memory (74.0 GiB) than is available (56.7 GiB)

@tonydiep commented on GitHub (Feb 7, 2026): llama3.3:70b also stopped working with ollama 0.15.5 tonydiep@tiny:~/LLMs$ ollama --version ollama version is 0.15.5 tonydiep@tiny:~/LLMs$ ollama run llama3.3:70b Error: 500 Internal Server Error: model requires more system memory (74.0 GiB) than is available (56.7 GiB)

GiteaMirror commented

2026-05-05 00:56:39 -05:00

@tonydiep commented on GitHub (Feb 8, 2026):

Code that uses Ollama worked in 0.15.4 no longer works

ollama version is 0.15.6

Error calling Ollama: an error was encountered while running the model: CUDA error: the launch timed out and was terminated
current device: 2, in function ggml_backend_cuda_synchronize at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:2981
cudaStreamSynchronize(cuda_ctx->stream())
//ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:94: CUDA error (status code: 500)

@tonydiep commented on GitHub (Feb 8, 2026): Code that uses Ollama worked in 0.15.4 no longer works ollama version is 0.15.6 Error calling Ollama: an error was encountered while running the model: CUDA error: the launch timed out and was terminated current device: 2, in function ggml_backend_cuda_synchronize at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:2981 cudaStreamSynchronize(cuda_ctx->stream()) //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:94: CUDA error (status code: 500)

GiteaMirror commented

2026-05-05 00:56:40 -05:00

@rick-github commented on GitHub (Feb 8, 2026):

What model?

@rick-github commented on GitHub (Feb 8, 2026): What model?

GiteaMirror commented

2026-05-05 00:56:40 -05:00

@tonydiep commented on GitHub (Feb 9, 2026):

The model is deepseek-r1-70b but customized to have a context size of 10,000 so it fits in vram. deepseek-r1-70b with default 256k context size does not start.

The model with 10k context runs in ollama cli,

tonydiep@tiny:/Jobs$ ollama --version
ollama version is 0.15.6
tonydiep@tiny:/Jobs$ ollama run deepseek-r1-70b-custom:latest

How large is your context size right now?
My context size is approximately 131,000 tokens. This means I can process and respond to text inputs of that length. If you have any specific
questions or need assistance with something, feel free to ask!

... but the same model breaks if run via Python with the following error:

Error calling Ollama: an error was encountered while running the model: CUDA error: the launch timed out and was terminated
current device: 2, in function ggml_backend_cuda_synchronize at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:2981
cudaStreamSynchronize(cuda_ctx->stream())
//ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:94: CUDA error (status code: 500)

The model worked when used in Python with Ollama 0.15.4

@tonydiep commented on GitHub (Feb 9, 2026): The model is deepseek-r1-70b but customized to have a context size of 10,000 so it fits in vram. deepseek-r1-70b with default 256k context size does not start. The model with 10k context runs in ollama cli, tonydiep@tiny:~/Jobs$ ollama --version ollama version is 0.15.6 tonydiep@tiny:~/Jobs$ ollama run deepseek-r1-70b-custom:latest >>> How large is your context size right now? My context size is approximately **131,000 tokens**. This means I can process and respond to text inputs of that length. If you have any specific questions or need assistance with something, feel free to ask! ... but the same model breaks if run via Python with the following error: Error calling Ollama: an error was encountered while running the model: CUDA error: the launch timed out and was terminated current device: 2, in function ggml_backend_cuda_synchronize at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:2981 cudaStreamSynchronize(cuda_ctx->stream()) //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:94: CUDA error (status code: 500) The model worked when used in Python with Ollama 0.15.4

GiteaMirror commented

2026-05-05 00:56:41 -05:00

@rick-github commented on GitHub (Feb 9, 2026):

Can you provide server logs and a minimal repro with python code?

#!/usr/bin/env python3

import ollama

model="deepseek-r1-70b-custom"

response = ollama.chat(
	model=model,
	messages=[
		{"role":"user","content":"hello"}
	]
)
print(response.message.content)

$ ollama -v
ollama version is 0.15.6
$ python3 14073.py
Hello! How can I assist you today? 😊

@rick-github commented on GitHub (Feb 9, 2026): Can you provide server logs and a minimal repro with python code? ```python #!/usr/bin/env python3 import ollama model="deepseek-r1-70b-custom" response = ollama.chat( model=model, messages=[ {"role":"user","content":"hello"} ] ) print(response.message.content) ``` ```console $ ollama -v ollama version is 0.15.6 $ python3 14073.py Hello! How can I assist you today? 😊 ```

GiteaMirror commented

2026-05-05 00:56:41 -05:00

@tonydiep commented on GitHub (Feb 9, 2026):

You're right. The minimal hello-world worked and it respected the 10k context length. Thanks!

Feb 08 20:11:45 tiny ollama[9434]: print_info: general.name = DeepSeek R1 Distill Llama 70B
Feb 08 20:11:45 tiny ollama[9434]: print_info: vocab type = BPE
Feb 08 20:11:45 tiny ollama[9434]: print_info: n_vocab = 128256
Feb 08 20:11:45 tiny ollama[9434]: print_info: n_merges = 280147
Feb 08 20:11:45 tiny ollama[9434]: print_info: BOS token = 128000 '<｜begin▁of▁sentence｜>'
Feb 08 20:11:45 tiny ollama[9434]: print_info: EOS token = 128001 '<｜end▁of▁sentence｜>'
Feb 08 20:11:45 tiny ollama[9434]: print_info: EOT token = 128009 '<|eot_id|>'
Feb 08 20:11:45 tiny ollama[9434]: print_info: EOM token = 128008 '<|eom_id|>'
Feb 08 20:11:45 tiny ollama[9434]: print_info: PAD token = 128001 '<｜end▁of▁sentence｜>'
Feb 08 20:11:45 tiny ollama[9434]: print_info: LF token = 198 'Ċ'
Feb 08 20:11:45 tiny ollama[9434]: print_info: EOG token = 128001 '<｜end▁of▁sentence｜>'
Feb 08 20:11:45 tiny ollama[9434]: print_info: EOG token = 128008 '<|eom_id|>'
Feb 08 20:11:45 tiny ollama[9434]: print_info: EOG token = 128009 '<|eot_id|>'
Feb 08 20:11:45 tiny ollama[9434]: print_info: max token length = 256
Feb 08 20:11:45 tiny ollama[9434]: load_tensors: loading model tensors, this can take a while... (mmap = true)
Feb 08 20:11:45 tiny ollama[9434]: load_tensors: offloading 80 repeating layers to GPU
Feb 08 20:11:45 tiny ollama[9434]: load_tensors: offloading output layer to GPU
Feb 08 20:11:45 tiny ollama[9434]: load_tensors: offloaded 81/81 layers to GPU
Feb 08 20:11:45 tiny ollama[9434]: load_tensors: CPU_Mapped model buffer size = 563.62 MiB
Feb 08 20:11:45 tiny ollama[9434]: load_tensors: CUDA0 model buffer size = 20038.81 MiB
Feb 08 20:11:45 tiny ollama[9434]: load_tensors: CUDA1 model buffer size = 11512.00 MiB
Feb 08 20:11:45 tiny ollama[9434]: load_tensors: CUDA2 model buffer size = 8428.67 MiB
Feb 08 20:11:54 tiny ollama[9434]: llama_context: constructing llama_context
Feb 08 20:11:54 tiny ollama[9434]: llama_context: n_seq_max = 1
Feb 08 20:11:54 tiny ollama[9434]: llama_context: n_ctx = 10240
Feb 08 20:11:54 tiny ollama[9434]: llama_context: n_ctx_seq = 10240
Feb 08 20:11:54 tiny ollama[9434]: llama_context: n_batch = 512
Feb 08 20:11:54 tiny ollama[9434]: llama_context: n_ubatch = 512
Feb 08 20:11:54 tiny ollama[9434]: llama_context: causal_attn = 1
Feb 08 20:11:54 tiny ollama[9434]: llama_context: flash_attn = auto
Feb 08 20:11:54 tiny ollama[9434]: llama_context: kv_unified = false
Feb 08 20:11:54 tiny ollama[9434]: llama_context: freq_base = 500000.0
Feb 08 20:11:54 tiny ollama[9434]: llama_context: freq_scale = 1
Feb 08 20:11:54 tiny ollama[9434]: llama_context: n_ctx_seq (10240) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
Feb 08 20:11:54 tiny ollama[9434]: llama_context: CUDA_Host output buffer size = 0.52 MiB
Feb 08 20:11:54 tiny ollama[9434]: llama_kv_cache: CUDA0 KV buffer size = 1640.00 MiB
Feb 08 20:11:54 tiny ollama[9434]: llama_kv_cache: CUDA1 KV buffer size = 960.00 MiB
Feb 08 20:11:54 tiny ollama[9434]: llama_kv_cache: CUDA2 KV buffer size = 600.00 MiB
Feb 08 20:11:54 tiny ollama[9434]: llama_kv_cache: size = 3200.00 MiB ( 10240 cells, 80 layers, 1/1 seqs), K (f16): 1600.00 MiB, V (f16): 1600.00 MiB
Feb 08 20:11:54 tiny ollama[9434]: llama_context: pipeline parallelism enabled (n_copies=4)
Feb 08 20:11:54 tiny ollama[9434]: llama_context: Flash Attention was auto, set to enabled
Feb 08 20:11:55 tiny ollama[9434]: llama_context: CUDA0 compute buffer size = 370.04 MiB
Feb 08 20:11:55 tiny ollama[9434]: llama_context: CUDA1 compute buffer size = 320.04 MiB
Feb 08 20:11:55 tiny ollama[9434]: llama_context: CUDA2 compute buffer size = 370.55 MiB
Feb 08 20:11:55 tiny ollama[9434]: llama_context: CUDA_Host compute buffer size = 96.05 MiB
Feb 08 20:11:55 tiny ollama[9434]: llama_context: graph nodes = 2487
Feb 08 20:11:55 tiny ollama[9434]: llama_context: graph splits = 4
Feb 08 20:11:55 tiny ollama[9434]: time=2026-02-08T20:11:55.076-05:00 level=INFO source=server.go:1388 msg="llama runner started in 14.68 seconds"
Feb 08 20:11:55 tiny ollama[9434]: time=2026-02-08T20:11:55.076-05:00 level=INFO source=sched.go:537 msg="loaded runners" count=1
Feb 08 20:11:55 tiny ollama[9434]: time=2026-02-08T20:11:55.076-05:00 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding"
Feb 08 20:11:55 tiny ollama[9434]: time=2026-02-08T20:11:55.077-05:00 level=INFO source=server.go:1388 msg="llama runner started in 14.68 seconds"
Feb 08 20:12:02 tiny ollama[9434]: [GIN] 2026/02/08 - 20:12:02 | 200 | 22.659538687s | 127.0.0.1 | POST "/api/chat"
(venv) tonydiep@tiny:~/Projects/test-ollama$

@tonydiep commented on GitHub (Feb 9, 2026): You're right. The minimal hello-world worked and it respected the 10k context length. Thanks! Feb 08 20:11:45 tiny ollama[9434]: print_info: general.name = DeepSeek R1 Distill Llama 70B Feb 08 20:11:45 tiny ollama[9434]: print_info: vocab type = BPE Feb 08 20:11:45 tiny ollama[9434]: print_info: n_vocab = 128256 Feb 08 20:11:45 tiny ollama[9434]: print_info: n_merges = 280147 Feb 08 20:11:45 tiny ollama[9434]: print_info: BOS token = 128000 '<｜begin▁of▁sentence｜>' Feb 08 20:11:45 tiny ollama[9434]: print_info: EOS token = 128001 '<｜end▁of▁sentence｜>' Feb 08 20:11:45 tiny ollama[9434]: print_info: EOT token = 128009 '<|eot_id|>' Feb 08 20:11:45 tiny ollama[9434]: print_info: EOM token = 128008 '<|eom_id|>' Feb 08 20:11:45 tiny ollama[9434]: print_info: PAD token = 128001 '<｜end▁of▁sentence｜>' Feb 08 20:11:45 tiny ollama[9434]: print_info: LF token = 198 'Ċ' Feb 08 20:11:45 tiny ollama[9434]: print_info: EOG token = 128001 '<｜end▁of▁sentence｜>' Feb 08 20:11:45 tiny ollama[9434]: print_info: EOG token = 128008 '<|eom_id|>' Feb 08 20:11:45 tiny ollama[9434]: print_info: EOG token = 128009 '<|eot_id|>' Feb 08 20:11:45 tiny ollama[9434]: print_info: max token length = 256 Feb 08 20:11:45 tiny ollama[9434]: load_tensors: loading model tensors, this can take a while... (mmap = true) Feb 08 20:11:45 tiny ollama[9434]: load_tensors: offloading 80 repeating layers to GPU Feb 08 20:11:45 tiny ollama[9434]: load_tensors: offloading output layer to GPU Feb 08 20:11:45 tiny ollama[9434]: load_tensors: offloaded 81/81 layers to GPU Feb 08 20:11:45 tiny ollama[9434]: load_tensors: CPU_Mapped model buffer size = 563.62 MiB Feb 08 20:11:45 tiny ollama[9434]: load_tensors: CUDA0 model buffer size = 20038.81 MiB Feb 08 20:11:45 tiny ollama[9434]: load_tensors: CUDA1 model buffer size = 11512.00 MiB Feb 08 20:11:45 tiny ollama[9434]: load_tensors: CUDA2 model buffer size = 8428.67 MiB Feb 08 20:11:54 tiny ollama[9434]: llama_context: constructing llama_context Feb 08 20:11:54 tiny ollama[9434]: llama_context: n_seq_max = 1 Feb 08 20:11:54 tiny ollama[9434]: llama_context: n_ctx = 10240 Feb 08 20:11:54 tiny ollama[9434]: llama_context: n_ctx_seq = 10240 Feb 08 20:11:54 tiny ollama[9434]: llama_context: n_batch = 512 Feb 08 20:11:54 tiny ollama[9434]: llama_context: n_ubatch = 512 Feb 08 20:11:54 tiny ollama[9434]: llama_context: causal_attn = 1 Feb 08 20:11:54 tiny ollama[9434]: llama_context: flash_attn = auto Feb 08 20:11:54 tiny ollama[9434]: llama_context: kv_unified = false Feb 08 20:11:54 tiny ollama[9434]: llama_context: freq_base = 500000.0 Feb 08 20:11:54 tiny ollama[9434]: llama_context: freq_scale = 1 Feb 08 20:11:54 tiny ollama[9434]: llama_context: n_ctx_seq (10240) < n_ctx_train (131072) -- the full capacity of the model will not be utilized Feb 08 20:11:54 tiny ollama[9434]: llama_context: CUDA_Host output buffer size = 0.52 MiB Feb 08 20:11:54 tiny ollama[9434]: llama_kv_cache: CUDA0 KV buffer size = 1640.00 MiB Feb 08 20:11:54 tiny ollama[9434]: llama_kv_cache: CUDA1 KV buffer size = 960.00 MiB Feb 08 20:11:54 tiny ollama[9434]: llama_kv_cache: CUDA2 KV buffer size = 600.00 MiB Feb 08 20:11:54 tiny ollama[9434]: llama_kv_cache: size = 3200.00 MiB ( 10240 cells, 80 layers, 1/1 seqs), K (f16): 1600.00 MiB, V (f16): 1600.00 MiB Feb 08 20:11:54 tiny ollama[9434]: llama_context: pipeline parallelism enabled (n_copies=4) Feb 08 20:11:54 tiny ollama[9434]: llama_context: Flash Attention was auto, set to enabled Feb 08 20:11:55 tiny ollama[9434]: llama_context: CUDA0 compute buffer size = 370.04 MiB Feb 08 20:11:55 tiny ollama[9434]: llama_context: CUDA1 compute buffer size = 320.04 MiB Feb 08 20:11:55 tiny ollama[9434]: llama_context: CUDA2 compute buffer size = 370.55 MiB Feb 08 20:11:55 tiny ollama[9434]: llama_context: CUDA_Host compute buffer size = 96.05 MiB Feb 08 20:11:55 tiny ollama[9434]: llama_context: graph nodes = 2487 Feb 08 20:11:55 tiny ollama[9434]: llama_context: graph splits = 4 Feb 08 20:11:55 tiny ollama[9434]: time=2026-02-08T20:11:55.076-05:00 level=INFO source=server.go:1388 msg="llama runner started in 14.68 seconds" Feb 08 20:11:55 tiny ollama[9434]: time=2026-02-08T20:11:55.076-05:00 level=INFO source=sched.go:537 msg="loaded runners" count=1 Feb 08 20:11:55 tiny ollama[9434]: time=2026-02-08T20:11:55.076-05:00 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding" Feb 08 20:11:55 tiny ollama[9434]: time=2026-02-08T20:11:55.077-05:00 level=INFO source=server.go:1388 msg="llama runner started in 14.68 seconds" Feb 08 20:12:02 tiny ollama[9434]: [GIN] 2026/02/08 - 20:12:02 | 200 | 22.659538687s | 127.0.0.1 | POST "/api/chat" (venv) tonydiep@tiny:~/Projects/test-ollama$

GiteaMirror commented

2026-05-05 00:56:41 -05:00

@rick-github commented on GitHub (Feb 9, 2026):

Can you provide the logs from a failure?

@rick-github commented on GitHub (Feb 9, 2026): Can you provide the logs from a failure?

GiteaMirror commented

2026-05-05 00:56:42 -05:00

@tonydiep commented on GitHub (Feb 9, 2026):

Here's the one where it's loading deepseek-r1-70b but giving it 10k context instead of default so that it fits in vram. It fails.

ollama_crash1_deepseek.txt

@tonydiep commented on GitHub (Feb 9, 2026): Here's the one where it's loading deepseek-r1-70b but giving it 10k context instead of default so that it fits in vram. It fails. [ollama_crash1_deepseek.txt](https://github.com/user-attachments/files/25189433/ollama_crash1_deepseek.txt)

GiteaMirror commented

2026-05-05 00:56:44 -05:00

@tonydiep commented on GitHub (Feb 9, 2026):

Qwen3-coder-next also crashes with default context size but that was expected.

ollama_crash2_memory.txt

@tonydiep commented on GitHub (Feb 9, 2026): Qwen3-coder-next also crashes with default context size but that was expected. [ollama_crash2_memory.txt](https://github.com/user-attachments/files/25189737/ollama_crash2_memory.txt)

GiteaMirror commented

2026-05-05 00:56:45 -05:00

@tonydiep commented on GitHub (Feb 10, 2026):

If Ollama won't reduce its default context size, can we have a parameter like llama.cpp's -c to specify a context size when running a model?

It's ridiculous Ollama can't run models that llama.cpp or LocalAI can with 64G system ram + 64G vram, not even to set a smaller context size.

tonydiep@tiny:~$ free -h
total used free shared buff/cache available
Mem: 62Gi 9.1Gi 27Gi 100Mi 26Gi 53Gi

tonydiep@tiny:$ ollama --version
ollama version is 0.15.6
tonydiep@tiny:$ ollama run deepseek-r1:70b
Error: 500 Internal Server Error: model requires more system memory (70.0 GiB) than is available (53.2 GiB)
tonydiep@tiny:$ ollama run llama3.3:70b
Error: 500 Internal Server Error: model requires more system memory (70.0 GiB) than is available (53.1 GiB)
tonydiep@tiny:$ ollama run qwen3-next:80b
Error: 500 Internal Server Error: model requires more system memory (70.7 GiB) than is available (53.1 GiB)

@tonydiep commented on GitHub (Feb 10, 2026): If Ollama won't reduce its default context size, can we have a parameter like llama.cpp's -c to specify a context size when running a model? It's ridiculous Ollama can't run models that llama.cpp or LocalAI can with 64G system ram + 64G vram, not even to set a smaller context size. tonydiep@tiny:~$ nvidia-smi Tue Feb 10 12:17:36 2026 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 580.126.16 Driver Version: 580.126.16 CUDA Version: 13.0 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 5060 Ti On | 00000000:01:00.0 Off | N/A | | 0% 35C P8 9W / 180W | 190MiB / 16311MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 3090 On | 00000000:02:00.0 Off | N/A | | 0% 43C P8 15W / 370W | 19863MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA GeForce RTX 3090 On | 00000000:04:00.0 Off | N/A | | 0% 37C P8 15W / 350W | 19847MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ tonydiep@tiny:~$ free -h total used free shared buff/cache available Mem: 62Gi 9.1Gi 27Gi 100Mi 26Gi 53Gi tonydiep@tiny:~$ ollama --version ollama version is 0.15.6 tonydiep@tiny:~$ ollama run deepseek-r1:70b Error: 500 Internal Server Error: model requires more system memory (70.0 GiB) than is available (53.2 GiB) tonydiep@tiny:~$ ollama run llama3.3:70b Error: 500 Internal Server Error: model requires more system memory (70.0 GiB) than is available (53.1 GiB) tonydiep@tiny:~$ ollama run qwen3-next:80b Error: 500 Internal Server Error: model requires more system memory (70.7 GiB) than is available (53.1 GiB)

GiteaMirror commented

2026-05-05 00:56:47 -05:00

@rick-github commented on GitHub (Feb 10, 2026):

If OLLAMA_CONTEXT_LENGTH is set to 4096 in the server environment then the server will act exactly as it did before the context scaling was added.

@rick-github commented on GitHub (Feb 10, 2026): If `OLLAMA_CONTEXT_LENGTH` is set to 4096 in the server environment then the server will act exactly as it did before the context scaling was added.

GiteaMirror commented

2026-05-05 00:56:50 -05:00

@tonydiep commented on GitHub (Feb 10, 2026):

I already had OLLAMA_CONTEXT_LENGTH set for a known size that worked with deepseek-r1-70b but I'll try 4096

declare -x OLLAMA_CONTEXT_LENGTH="10000"
declare -x OLLAMA_FLASH_ATTENTION="1"

@tonydiep commented on GitHub (Feb 10, 2026): I already had OLLAMA_CONTEXT_LENGTH set for a known size that worked with deepseek-r1-70b but I'll try 4096 declare -x OLLAMA_CONTEXT_LENGTH="10000" declare -x OLLAMA_FLASH_ATTENTION="1"

GiteaMirror commented

2026-05-05 00:56:52 -05:00

@NAPTiON commented on GitHub (Feb 23, 2026):

"I can confirm that this issue is related to the increased default context lengths in 0.15.5. To mitigate this, you may want to consider using locally trained models like Llama 3.2 for categorization, or utilizing launchd for scheduling tasks. Additionally, using JSONL format for persistence could help alleviate some of the memory pressure on your system. You can learn more about my approach to solving similar issues in my writeup at magic.naption.ai/pipeline."

@NAPTiON commented on GitHub (Feb 23, 2026): "I can confirm that this issue is related to the increased default context lengths in 0.15.5. To mitigate this, you may want to consider using locally trained models like Llama 3.2 for categorization, or utilizing launchd for scheduling tasks. Additionally, using JSONL format for persistence could help alleviate some of the memory pressure on your system. You can learn more about my approach to solving similar issues in my writeup at magic.naption.ai/pipeline."

Sign in to join this conversation.

Branches Tags

main

mxyng/docs-cloud

parth-update-hermes-launch

hoyyeva/vscode-extension-docs-update

parth-gemma4-chat-template-renderer

parth-api-status-context-length

hoyyeva/wire-up-context-length

hoyyeva/claude-code-context-doc

jmorganca/investigate-issue-17046

hoyyeva/hermes-docs

jmorganca/agent-loop-style

hoyyeva/openclaw

parth-agent-loop

hoyyeva/ollama-vscode-extension

brucemacd/cache-metrics

brucemacd/hermes-desktop

hoyyeva/docs-vscode

parth-input-style-experiment

brucemacd/docs-glm52

hoyyeva/poc-docs

Parth/mlx-launch-recommendations

parth-first-time-app-cli-experience

test/darwin-xcode-pin

improve-cloud-model-recommendations

hoyyeva/goose-docs

jmorganca/context-limit-fixes

hoyyeva/qwen-doc

hoyyeva/vscode-docs

jmorganca/remove-mlx-imagegen-code

parth-copilot-token-length-defaults

hoyyeva/poolside-windows

laguna-support

jmorganca/harden-markdown-rendering

laguna-renderer-parser

laguna-llamacpp

codex/make-integration-hidden-and-lunchable

brucemacd/omp-docs

pdevine/gguf-mtp-oldstyle

hoyyeva/migrate-pi

hoyyeva/anthropic-local-image-path

parth-launch-codex-app

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth/hide-claude-desktop-till-release

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#71252