[GH-ISSUE #14116] Tiered context length can exhaust VRAM #55722

New Issue

GiteaMirror · 2026-04-29T09:38:10-05:00

GiteaMirror commented

2026-04-29 09:38:10 -05:00

Originally created by @rick-github on GitHub (Feb 6, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14116

What is the issue?

A new feature in 0.15.5 is setting the default context length (OLLAMA_CONTEXT_LENGTH) based on the amount of VRAM detected:

  * < 24 GiB VRAM: 4,096 context
  * 24-48 GiB VRAM: 32,768 context
  * >= 48 GiB VRAM: 262,144 context

However, the setting of OLLAMA_NUM_PARALLEL is not taken into account and the total VRAM used is proportional to OLLAMA_CONTEXT_LENGTH * OLLAMA_NUM_PARALLEL. This can exhaust the available VRAM and cause model spilling to system RAM/swap, resulting in poor performance.

Either the automatically selected context size should take parallelism into account, or the effect of OLLAMA_NUM_PARALLEL on total VRAM requirements should be documented.

Even if OLLAMA_NUM_PARALLEL is not set, the larger context sizes for GPUs >= 24GiB VRAM can cause model spilling.

To prevent tiered context length from causing a problem, OLLAMA_CONTEXT_LENGTH can be explicitly set in the server environment.

Relevant log output

OS

No response

GPU

No response

CPU

No response

Ollama version

0.5.15

Originally created by @rick-github on GitHub (Feb 6, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14116 ### What is the issue? A new feature in 0.15.5 is setting the default context length (`OLLAMA_CONTEXT_LENGTH`) based on the amount of VRAM detected: ``` * < 24 GiB VRAM: 4,096 context * 24-48 GiB VRAM: 32,768 context * >= 48 GiB VRAM: 262,144 context ``` However, the setting of `OLLAMA_NUM_PARALLEL` is not taken into account and the total VRAM used is proportional to `OLLAMA_CONTEXT_LENGTH` * `OLLAMA_NUM_PARALLEL`. This can exhaust the available VRAM and cause model spilling to system RAM/swap, resulting in poor performance. Either the automatically selected context size should take parallelism into account, or the effect of `OLLAMA_NUM_PARALLEL` on total VRAM requirements should be documented. Even if `OLLAMA_NUM_PARALLEL` is not set, the larger context sizes for GPUs >= 24GiB VRAM can cause model spilling. To prevent tiered context length from causing a problem, `OLLAMA_CONTEXT_LENGTH` can be explicitly set in the [server environment](https://github.com/ollama/ollama/blob/main/docs/faq.mdx#how-do-i-configure-ollama-server). ### Relevant log output ```shell ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version 0.5.15

GiteaMirror added the bug label 2026-04-29 09:38:10 -05:00

GiteaMirror commented

2026-04-29 09:38:13 -05:00

@jessegross commented on GitHub (Feb 6, 2026):

The long term goal is to get the default context length to be the model's trained context length to avoid lower quality and surprises. In that situation, we would want to spill to system RAM and take the performance hit only when you actually use that context rather than see the lower performance immediately at we would today. In that world, Ollama could also automatically configure num parallel, increasing it to the limit that would stay in VRAM since that will maximize performance.

Working back from that to today, I don't think that we want to automatically divide the context by num parallel. Overall, the expectation should be that num parallel increases VRAM usage (as it always has). It's an advanced feature and for people who use it, I would recommend also setting the context length.

@jessegross commented on GitHub (Feb 6, 2026): The long term goal is to get the default context length to be the model's trained context length to avoid lower quality and surprises. In that situation, we would want to spill to system RAM and take the performance hit only when you actually use that context rather than see the lower performance immediately at we would today. In that world, Ollama could also automatically configure num parallel, increasing it to the limit that would stay in VRAM since that will maximize performance. Working back from that to today, I don't think that we want to automatically divide the context by num parallel. Overall, the expectation _should_ be that num parallel increases VRAM usage (as it always has). It's an advanced feature and for people who use it, I would recommend also setting the context length.

GiteaMirror commented

2026-04-29 09:38:14 -05:00

@rick-github commented on GitHub (Feb 6, 2026):

Leaving open for visibility.

@rick-github commented on GitHub (Feb 6, 2026): Leaving open for visibility.

GiteaMirror commented

2026-04-29 09:38:14 -05:00

@winstonma commented on GitHub (Feb 7, 2026):

I am using AMD iGPU. It has 64GB of memory (57GB are sharable memory). I am running Qwen3-Next (80B-A3B) model. Here is the info collected using ollama ps:

Context	Processor	Memory	tokens/s
16384	100% GPU	51GB	~15
32768	100% GPU	55GB	~15
262144	45%/55% CPU/GPU	108 GB	~6

How can I add a flag so that ollama can dynamically set the context size to fit everything in GPU?

@winstonma commented on GitHub (Feb 7, 2026): I am using AMD iGPU. It has 64GB of memory (57GB are sharable memory). I am running Qwen3-Next (80B-A3B) model. Here is the info collected using `ollama ps`: | Context | Processor | Memory | tokens/s | |--------|--------|--------| --------| | 16384 | 100% GPU | 51GB | ~15 | | 32768 | 100% GPU | 55GB |~15 | | 262144 | 45%/55% CPU/GPU | 108 GB |~6 | How can I add a flag so that ollama can dynamically set the context size to fit everything in GPU?

GiteaMirror commented

2026-04-29 09:38:14 -05:00

@rick-github commented on GitHub (Feb 7, 2026):

How can I add a flag so that ollama can dynamically set the context size to fit everything in GPU?

There is no such flag yet. That's the goal as described by Jesse, but at the moment only a static limit can be set with OLLAMA_CONTEXT_LENGTH.

@rick-github commented on GitHub (Feb 7, 2026): > How can I add a flag so that ollama can dynamically set the context size to fit everything in GPU? There is no such flag yet. That's the goal as [described](https://github.com/ollama/ollama/issues/14116#issuecomment-3862847075) by Jesse, but at the moment only a static limit can be set with [`OLLAMA_CONTEXT_LENGTH`](https://docs.ollama.com/faq#how-can-i-specify-the-context-window-size).

GiteaMirror commented

2026-04-29 09:38:15 -05:00

@winstonma commented on GitHub (Feb 7, 2026):

How can I add a flag so that ollama can dynamically set the context size to fit everything in GPU?

There is no such flag yet. That's the goal as described by Jesse, but at the moment only a static limit can be set with OLLAMA_CONTEXT_LENGTH.

In the meantime I am setting a fixed OLLAMA_CONTEXT_LENGTH using systemctl. I think Jesse's goal is very valid. But OLLAMA_CONTEXT_LENGTH is a global variable that would affect every model.

I think the dynamic approach is a very good start but I am not sure setting the context size to 4096 for <24GB can help achieve the goal. So I believe there is a reason to set the context size to dynamic.

Could a dynamic context size up to the GPU memory limit could be the added feature? Thanks

EDIT: I think I am looking for per model custom default setting (e.g. default no_think, context size) without modifying modelfile.

@winstonma commented on GitHub (Feb 7, 2026): > > How can I add a flag so that ollama can dynamically set the context size to fit everything in GPU? > > There is no such flag yet. That's the goal as [described](https://github.com/ollama/ollama/issues/14116#issuecomment-3862847075) by Jesse, but at the moment only a static limit can be set with [`OLLAMA_CONTEXT_LENGTH`](https://docs.ollama.com/faq#how-can-i-specify-the-context-window-size). In the meantime I am setting a fixed `OLLAMA_CONTEXT_LENGTH` using systemctl. I think Jesse's goal is very valid. But OLLAMA_CONTEXT_LENGTH is a global variable that would affect every model. I think the dynamic approach is a very good start but I am not sure setting the context size to 4096 for <24GB can help achieve the goal. So I believe there is a reason to set the context size to dynamic. Could a dynamic context size up to the GPU memory limit could be the added feature? Thanks EDIT: I think I am looking for per model custom default setting (e.g. default no_think, context size) without modifying modelfile.

GiteaMirror commented

2026-04-29 09:38:15 -05:00

@boomam commented on GitHub (Feb 12, 2026):

I'm guessing this is the cause of what I'm seeing today.
VRAM usage at 200Mb, 96GB Total, complains at loading a 42Gb Model

Error: 500 Internal Server Error: model requires more system memory (92.5 GiB) than is available (38.3 GiB)

@boomam commented on GitHub (Feb 12, 2026): I'm guessing this is the cause of what I'm seeing today. VRAM usage at 200Mb, 96GB Total, complains at loading a 42Gb Model ``` Error: 500 Internal Server Error: model requires more system memory (92.5 GiB) than is available (38.3 GiB) ```

GiteaMirror commented

2026-04-29 09:38:16 -05:00

@rick-github commented on GitHub (Feb 13, 2026):

What model? 92.5G should fit in 95.8G. Server logs may provide more info.

@rick-github commented on GitHub (Feb 13, 2026): What model? 92.5G should fit in 95.8G. [Server logs](https://docs.ollama.com/troubleshooting) may provide more info.

GiteaMirror commented

2026-04-29 09:38:16 -05:00

@boomam commented on GitHub (Feb 13, 2026):

What model? 92.5G should fit in 95.8G. Server logs may provide more info.

Just doing some testing -

Hardware
- AMD Strix Halo 128GB
- 96GB set in UEFI to VRAM
Versions
- Running v0.16.0-rocm (same happened with previous v0.15 release, too).
- Docker v29.2.1
- Ubuntu 25.10
  - Kernel 6.17.0-14-generic
  - AMD Driver 7.2.70200
Ollama Notes/Settings
- ollama run model --verbose - command used to test
- Context length set to 8192 with OLLAMA_CONTEXT_LENGTH - added as testing, issue still occurs regardless of this.
- Parallel set to 1 with OLLAMA_NUM_PARALLEL - added as testing, issue still occurs regardless of this.

Model Name	Model Size	Result
llama3.1:70b	42Gb	`Error: 500 Internal Server Error: model requires more system memory (92.5 GiB) than is available (38.1 GiB)`
qwen3-coder-next:q4_K_M	51GB, 61GB loaded	runs
gpt-oss:120b	65GB	runs
llama4:16x17b	67 GB	`Error: 500 Internal Server Error: model requires more system memory (254.8 GiB) than is available (33.7 GiB)`
devstral-2:123b	74GB	`Error: 500 Internal Server Error: model failed to load, this may be due to resource limitations or an internal error, check ollama server logs for details`
mixtral:8x22b	79GB	`Error: 500 Internal Server Error: model requires more system memory (48.8 GiB) than is available (38.1 GiB)`
qwen3-coder-next:Q8_0	84GB	runs

When they fail, i get repeats of this in the logs -

[ollama-general] 2026-02-13T00:17:32.537530814Z goroutine 66 gp=0xc00012a700 m=nil [IO wait]:
[ollama-general] 2026-02-13T00:17:32.537531545Z runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0xb?)
[ollama-general] 2026-02-13T00:17:32.537532157Z 	runtime/proc.go:435 +0xce fp=0xc000122dd8 sp=0xc000122db8 pc=0x5f4c70dd6f4e
[ollama-general] 2026-02-13T00:17:32.537532948Z runtime.netpollblock(0x5f4c70dfa7f8?, 0x70d70506?, 0x4c?)
[ollama-general] 2026-02-13T00:17:32.537533569Z 	runtime/netpoll.go:575 +0xf7 fp=0xc000122e10 sp=0xc000122dd8 pc=0x5f4c70d9c0f7
[ollama-general] 2026-02-13T00:17:32.537534351Z internal/poll.runtime_pollWait(0x7de9d7484cc8, 0x72)
[ollama-general] 2026-02-13T00:17:32.537535102Z 	runtime/netpoll.go:351 +0x85 fp=0xc000122e30 sp=0xc000122e10 pc=0x5f4c70dd6165
[ollama-general] 2026-02-13T00:17:32.537535743Z internal/poll.(*pollDesc).wait(0xc000309200?, 0xc0003bf751?, 0x0)
[ollama-general] 2026-02-13T00:17:32.537537346Z 	internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000122e58 sp=0xc000122e30 pc=0x5f4c70e5e487
[ollama-general] 2026-02-13T00:17:32.537538048Z internal/poll.(*pollDesc).waitRead(...)
[ollama-general] 2026-02-13T00:17:32.537538669Z 	internal/poll/fd_poll_runtime.go:89
[ollama-general] 2026-02-13T00:17:32.537539310Z internal/poll.(*FD).Read(0xc000309200, {0xc0003bf751, 0x1, 0x1})
[ollama-general] 2026-02-13T00:17:32.537540081Z 	internal/poll/fd_unix.go:165 +0x27a fp=0xc000122ef0 sp=0xc000122e58 pc=0x5f4c70e5f77a
[ollama-general] 2026-02-13T00:17:32.537540733Z net.(*netFD).Read(0xc000309200, {0xc0003bf751?, 0x0?, 0x0?})
[ollama-general] 2026-02-13T00:17:32.537541384Z 	net/fd_posix.go:55 +0x25 fp=0xc000122f38 sp=0xc000122ef0 pc=0x5f4c70ed4cc5
[ollama-general] 2026-02-13T00:17:32.537542266Z net.(*conn).Read(0xc000158748, {0xc0003bf751?, 0x0?, 0x0?})
[ollama-general] 2026-02-13T00:17:32.537542907Z 	net/net.go:194 +0x45 fp=0xc000122f80 sp=0xc000122f38 pc=0x5f4c70ee3085
[ollama-general] 2026-02-13T00:17:32.537543608Z net/http.(*connReader).backgroundRead(0xc0003bf740)
[ollama-general] 2026-02-13T00:17:32.537544219Z 	net/http/server.go:690 +0x37 fp=0xc000122fc8 sp=0xc000122f80 pc=0x5f4c710cfbd7
[ollama-general] 2026-02-13T00:17:32.537546073Z net/http.(*connReader).startBackgroundRead.gowrap2()
[ollama-general] 2026-02-13T00:17:32.537546704Z 	net/http/server.go:686 +0x25 fp=0xc000122fe0 sp=0xc000122fc8 pc=0x5f4c710cfb05
[ollama-general] 2026-02-13T00:17:32.537547365Z runtime.goexit({})
[ollama-general] 2026-02-13T00:17:32.537547936Z 	runtime/asm_amd64.s:1700 +0x1 fp=0xc000122fe8 sp=0xc000122fe0 pc=0x5f4c70ddeec1
[ollama-general] 2026-02-13T00:17:32.537548587Z created by net/http.(*connReader).startBackgroundRead in goroutine 10
[ollama-general] 2026-02-13T00:17:32.537549229Z 	net/http/server.go:686 +0xb6
[ollama-general] 2026-02-13T00:17:32.537549830Z 
[ollama-general] 2026-02-13T00:17:32.537550411Z rax    0x0
[ollama-general] 2026-02-13T00:17:32.537551002Z rbx    0x1aa
[ollama-general] 2026-02-13T00:17:32.537551593Z rcx    0x7dea200acb2c
[ollama-general] 2026-02-13T00:17:32.537552184Z rdx    0x6
[ollama-general] 2026-02-13T00:17:32.537552745Z rdi    0x1a1
[ollama-general] 2026-02-13T00:17:32.537553316Z rsi    0x1aa
[ollama-general] 2026-02-13T00:17:32.537553877Z rbp    0x7de9d53f5b90
[ollama-general] 2026-02-13T00:17:32.537554458Z rsp    0x7de9d53f5b50
[ollama-general] 2026-02-13T00:17:32.537555020Z r8     0x0
[ollama-general] 2026-02-13T00:17:32.537555611Z r9     0x0
[ollama-general] 2026-02-13T00:17:32.537556172Z r10    0x8
[ollama-general] 2026-02-13T00:17:32.537556733Z r11    0x246
[ollama-general] 2026-02-13T00:17:32.537557304Z r12    0x6
[ollama-general] 2026-02-13T00:17:32.537557865Z r13    0x7de976e2325a
[ollama-general] 2026-02-13T00:17:32.537558426Z r14    0x16
[ollama-general] 2026-02-13T00:17:32.537558997Z r15    0x7de9c80011a0
[ollama-general] 2026-02-13T00:17:32.537559588Z rip    0x7dea200acb2c
[ollama-general] 2026-02-13T00:17:32.537560169Z rflags 0x246
[ollama-general] 2026-02-13T00:17:32.537560740Z cs     0x33
[ollama-general] 2026-02-13T00:17:32.537561351Z fs     0x0
[ollama-general] 2026-02-13T00:17:32.537561953Z gs     0x0
[ollama-general] 2026-02-13T00:17:32.582499334Z time=2026-02-13T00:17:32.582Z level=ERROR source=server.go:1205 msg="do load request" error="Post \"http://127.0.0.1:35163/load\": EOF"
[ollama-general] 2026-02-13T00:17:32.582753421Z time=2026-02-13T00:17:32.582Z level=ERROR source=server.go:1205 msg="do load request" error="Post \"http://127.0.0.1:35163/load\": dial tcp 127.0.0.1:35163: connect: connection refused"
[ollama-general] 2026-02-13T00:17:32.582757078Z time=2026-02-13T00:17:32.582Z level=INFO source=sched.go:490 msg="Load failed" model=/root/.ollama/models/blobs/sha256-bd6c22cad19a402cea476d148800ec28704e6de55a34ca5da2d1b924df90945e error="model failed to load, this may be due to resource limitations or an internal error, check ollama server logs for details"
[ollama-general] 2026-02-13T00:17:32.583150127Z time=2026-02-13T00:17:32.582Z level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 2"
[ollama-general] 2026-02-13T00:17:32.583192897Z [GIN] 2026/02/13 - 00:17:32 | 500 |  2.070730296s |       127.0.0.1 | POST     "/api/generate"

Its been a while since i used them, but those models that are giving error 500's, did used to work.

@boomam commented on GitHub (Feb 13, 2026): > What model? 92.5G should fit in 95.8G. [Server logs](https://docs.ollama.com/troubleshooting) may provide more info. Just doing some testing - * Hardware * AMD Strix Halo 128GB * 96GB set in UEFI to VRAM * Versions * Running v0.16.0-rocm (same happened with previous v0.15 release, too). * Docker v29.2.1 * Ubuntu 25.10 * Kernel 6.17.0-14-generic * AMD Driver 7.2.70200 * Ollama Notes/Settings * `ollama run model --verbose` - command used to test * Context length set to `8192` with `OLLAMA_CONTEXT_LENGTH` - added as testing, issue still occurs regardless of this. * Parallel set to `1` with `OLLAMA_NUM_PARALLEL` - added as testing, issue still occurs regardless of this. | Model Name | Model Size | Result | | --- | --- | --- | | llama3.1:70b | 42Gb | `Error: 500 Internal Server Error: model requires more system memory (92.5 GiB) than is available (38.1 GiB)` | | qwen3-coder-next:q4_K_M | 51GB, 61GB loaded | runs | | gpt-oss:120b | 65GB | runs | | llama4:16x17b | 67 GB | `Error: 500 Internal Server Error: model requires more system memory (254.8 GiB) than is available (33.7 GiB)` | | devstral-2:123b | 74GB | `Error: 500 Internal Server Error: model failed to load, this may be due to resource limitations or an internal error, check ollama server logs for details` | | mixtral:8x22b | 79GB | `Error: 500 Internal Server Error: model requires more system memory (48.8 GiB) than is available (38.1 GiB)` | | qwen3-coder-next:Q8_0 | 84GB | runs | When they fail, i get repeats of this in the logs - ``` [ollama-general] 2026-02-13T00:17:32.537530814Z goroutine 66 gp=0xc00012a700 m=nil [IO wait]: [ollama-general] 2026-02-13T00:17:32.537531545Z runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0xb?) [ollama-general] 2026-02-13T00:17:32.537532157Z runtime/proc.go:435 +0xce fp=0xc000122dd8 sp=0xc000122db8 pc=0x5f4c70dd6f4e [ollama-general] 2026-02-13T00:17:32.537532948Z runtime.netpollblock(0x5f4c70dfa7f8?, 0x70d70506?, 0x4c?) [ollama-general] 2026-02-13T00:17:32.537533569Z runtime/netpoll.go:575 +0xf7 fp=0xc000122e10 sp=0xc000122dd8 pc=0x5f4c70d9c0f7 [ollama-general] 2026-02-13T00:17:32.537534351Z internal/poll.runtime_pollWait(0x7de9d7484cc8, 0x72) [ollama-general] 2026-02-13T00:17:32.537535102Z runtime/netpoll.go:351 +0x85 fp=0xc000122e30 sp=0xc000122e10 pc=0x5f4c70dd6165 [ollama-general] 2026-02-13T00:17:32.537535743Z internal/poll.(*pollDesc).wait(0xc000309200?, 0xc0003bf751?, 0x0) [ollama-general] 2026-02-13T00:17:32.537537346Z internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000122e58 sp=0xc000122e30 pc=0x5f4c70e5e487 [ollama-general] 2026-02-13T00:17:32.537538048Z internal/poll.(*pollDesc).waitRead(...) [ollama-general] 2026-02-13T00:17:32.537538669Z internal/poll/fd_poll_runtime.go:89 [ollama-general] 2026-02-13T00:17:32.537539310Z internal/poll.(*FD).Read(0xc000309200, {0xc0003bf751, 0x1, 0x1}) [ollama-general] 2026-02-13T00:17:32.537540081Z internal/poll/fd_unix.go:165 +0x27a fp=0xc000122ef0 sp=0xc000122e58 pc=0x5f4c70e5f77a [ollama-general] 2026-02-13T00:17:32.537540733Z net.(*netFD).Read(0xc000309200, {0xc0003bf751?, 0x0?, 0x0?}) [ollama-general] 2026-02-13T00:17:32.537541384Z net/fd_posix.go:55 +0x25 fp=0xc000122f38 sp=0xc000122ef0 pc=0x5f4c70ed4cc5 [ollama-general] 2026-02-13T00:17:32.537542266Z net.(*conn).Read(0xc000158748, {0xc0003bf751?, 0x0?, 0x0?}) [ollama-general] 2026-02-13T00:17:32.537542907Z net/net.go:194 +0x45 fp=0xc000122f80 sp=0xc000122f38 pc=0x5f4c70ee3085 [ollama-general] 2026-02-13T00:17:32.537543608Z net/http.(*connReader).backgroundRead(0xc0003bf740) [ollama-general] 2026-02-13T00:17:32.537544219Z net/http/server.go:690 +0x37 fp=0xc000122fc8 sp=0xc000122f80 pc=0x5f4c710cfbd7 [ollama-general] 2026-02-13T00:17:32.537546073Z net/http.(*connReader).startBackgroundRead.gowrap2() [ollama-general] 2026-02-13T00:17:32.537546704Z net/http/server.go:686 +0x25 fp=0xc000122fe0 sp=0xc000122fc8 pc=0x5f4c710cfb05 [ollama-general] 2026-02-13T00:17:32.537547365Z runtime.goexit({}) [ollama-general] 2026-02-13T00:17:32.537547936Z runtime/asm_amd64.s:1700 +0x1 fp=0xc000122fe8 sp=0xc000122fe0 pc=0x5f4c70ddeec1 [ollama-general] 2026-02-13T00:17:32.537548587Z created by net/http.(*connReader).startBackgroundRead in goroutine 10 [ollama-general] 2026-02-13T00:17:32.537549229Z net/http/server.go:686 +0xb6 [ollama-general] 2026-02-13T00:17:32.537549830Z [ollama-general] 2026-02-13T00:17:32.537550411Z rax 0x0 [ollama-general] 2026-02-13T00:17:32.537551002Z rbx 0x1aa [ollama-general] 2026-02-13T00:17:32.537551593Z rcx 0x7dea200acb2c [ollama-general] 2026-02-13T00:17:32.537552184Z rdx 0x6 [ollama-general] 2026-02-13T00:17:32.537552745Z rdi 0x1a1 [ollama-general] 2026-02-13T00:17:32.537553316Z rsi 0x1aa [ollama-general] 2026-02-13T00:17:32.537553877Z rbp 0x7de9d53f5b90 [ollama-general] 2026-02-13T00:17:32.537554458Z rsp 0x7de9d53f5b50 [ollama-general] 2026-02-13T00:17:32.537555020Z r8 0x0 [ollama-general] 2026-02-13T00:17:32.537555611Z r9 0x0 [ollama-general] 2026-02-13T00:17:32.537556172Z r10 0x8 [ollama-general] 2026-02-13T00:17:32.537556733Z r11 0x246 [ollama-general] 2026-02-13T00:17:32.537557304Z r12 0x6 [ollama-general] 2026-02-13T00:17:32.537557865Z r13 0x7de976e2325a [ollama-general] 2026-02-13T00:17:32.537558426Z r14 0x16 [ollama-general] 2026-02-13T00:17:32.537558997Z r15 0x7de9c80011a0 [ollama-general] 2026-02-13T00:17:32.537559588Z rip 0x7dea200acb2c [ollama-general] 2026-02-13T00:17:32.537560169Z rflags 0x246 [ollama-general] 2026-02-13T00:17:32.537560740Z cs 0x33 [ollama-general] 2026-02-13T00:17:32.537561351Z fs 0x0 [ollama-general] 2026-02-13T00:17:32.537561953Z gs 0x0 [ollama-general] 2026-02-13T00:17:32.582499334Z time=2026-02-13T00:17:32.582Z level=ERROR source=server.go:1205 msg="do load request" error="Post \"http://127.0.0.1:35163/load\": EOF" [ollama-general] 2026-02-13T00:17:32.582753421Z time=2026-02-13T00:17:32.582Z level=ERROR source=server.go:1205 msg="do load request" error="Post \"http://127.0.0.1:35163/load\": dial tcp 127.0.0.1:35163: connect: connection refused" [ollama-general] 2026-02-13T00:17:32.582757078Z time=2026-02-13T00:17:32.582Z level=INFO source=sched.go:490 msg="Load failed" model=/root/.ollama/models/blobs/sha256-bd6c22cad19a402cea476d148800ec28704e6de55a34ca5da2d1b924df90945e error="model failed to load, this may be due to resource limitations or an internal error, check ollama server logs for details" [ollama-general] 2026-02-13T00:17:32.583150127Z time=2026-02-13T00:17:32.582Z level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 2" [ollama-general] 2026-02-13T00:17:32.583192897Z [GIN] 2026/02/13 - 00:17:32 | 500 | 2.070730296s | 127.0.0.1 | POST "/api/generate" ``` Its been a while since i used them, but those models that are giving error 500's, did used to work.

GiteaMirror commented

2026-04-29 09:38:17 -05:00

@rick-github commented on GitHub (Feb 13, 2026):

Show the bit before the crashdump, preferably from the start. If you've set OLLAMA_CONTEXT_LENGTH it's not the tiered context length.

@rick-github commented on GitHub (Feb 13, 2026): Show the bit before the crashdump, preferably from the start. If you've set `OLLAMA_CONTEXT_LENGTH` it's not the tiered context length.

GiteaMirror commented

2026-04-29 09:38:17 -05:00

@boomam commented on GitHub (Feb 13, 2026):

Can you elaborate on what you're asking for there please?
The log is just full of the above log, repeated over and over, when it occurs.

@boomam commented on GitHub (Feb 13, 2026): Can you elaborate on what you're asking for there please? The log is just full of the above log, repeated over and over, when it occurs.

GiteaMirror commented

2026-04-29 09:38:18 -05:00

@rick-github commented on GitHub (Feb 13, 2026):

The log contains information about how the server is configured and what it is doing. It starts with a line containing the string "server config", and will then go on to show device detection and readiness. When you load a model, it will show stats about the model, how layers are being allocated, the progress of the model load, and metadata about prompts sent to the model. If a crash occurs, the server will log information about the cause and then end with the crashdump. This is what I am asking for.

@rick-github commented on GitHub (Feb 13, 2026): The log contains information about how the server is configured and what it is doing. It starts with a line containing the string "server config", and will then go on to show device detection and readiness. When you load a model, it will show stats about the model, how layers are being allocated, the progress of the model load, and metadata about prompts sent to the model. If a crash occurs, the server will log information about the cause and then end with the crashdump. This is what I am asking for.

GiteaMirror commented

2026-04-29 09:38:19 -05:00

@rick-github commented on GitHub (Feb 13, 2026):

Run this to get the log for the most recent session:

journalctl -u ollama --no-pager --since "$(systemctl show ollama --property=ActiveEnterTimestamp --value)"

@rick-github commented on GitHub (Feb 13, 2026): Run this to get the log for the most recent session: ``` journalctl -u ollama --no-pager --since "$(systemctl show ollama --property=ActiveEnterTimestamp --value)" ```

GiteaMirror commented

2026-04-29 09:38:19 -05:00

@boomam commented on GitHub (Feb 13, 2026):

Fresh start of the container -

time=2026-02-13T00:34:10.121Z level=INFO source=routes.go:1636 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:0 OLLAMA_DEBUG:INFO OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:24h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:4 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2026-02-13T00:34:10.125Z level=INFO source=images.go:473 msg="total blobs: 113"
time=2026-02-13T00:34:10.127Z level=INFO source=images.go:480 msg="total unused blobs removed: 0"
time=2026-02-13T00:34:10.128Z level=INFO source=routes.go:1689 msg="Listening on [::]:11434 (version 0.16.0)"
time=2026-02-13T00:34:10.128Z level=INFO source=runner.go:67 msg="discovering available GPUs..."
time=2026-02-13T00:34:10.128Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 37507"
time=2026-02-13T00:34:10.848Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 43623"
time=2026-02-13T00:34:11.464Z level=INFO source=types.go:42 msg="inference compute" id=0 filter_id=0 library=ROCm compute=gfx1151 name=ROCm0 description="AMD Radeon Graphics" libdirs=ollama,rocm driver=60342.13 pci_id=0000:c2:00.0 type=iGPU total="111.3 GiB" available="111.2 GiB"
time=2026-02-13T00:34:11.464Z level=INFO source=routes.go:1739 msg="vram-based default context" total_vram="111.3 GiB" default_num_ctx=262144

Log after trying one of the models that fails -

[GIN] 2026/02/13 - 00:35:39 | 200 |      64.952µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/02/13 - 00:35:40 | 200 |   104.11925ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2026/02/13 - 00:35:40 | 200 |  101.180307ms |       127.0.0.1 | POST     "/api/show"
time=2026-02-13T00:35:40.247Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 40435"
time=2026-02-13T00:35:40.857Z level=WARN source=cpu_linux.go:130 msg="failed to parse CPU allowed micro secs" error="strconv.ParseInt: parsing \"max\": invalid syntax"
llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from /root/.ollama/models/blobs/sha256-de20d2cf2dc430b1717a8b07a9df029d651f3895dbffec4729a3902a6fe344c9 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 70B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 70B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 80
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 15
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  162 tensors
llama_model_loader: - type q4_K:  441 tensors
llama_model_loader: - type q5_K:   40 tensors
llama_model_loader: - type q6_K:   81 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 39.59 GiB (4.82 BPW) 
load: printing all EOG tokens:
load:   - 128001 ('<|end_of_text|>')
load:   - 128008 ('<|eom_id|>')
load:   - 128009 ('<|eot_id|>')
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 1
print_info: no_alloc         = 0
print_info: model type       = ?B
print_info: model params     = 70.55 B
print_info: general.name     = Meta Llama 3.1 70B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128001 '<|end_of_text|>'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2026-02-13T00:35:41.037Z level=WARN source=server.go:169 msg="requested context size too large for model" num_ctx=262144 n_ctx_train=131072
time=2026-02-13T00:35:41.037Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --model /root/.ollama/models/blobs/sha256-de20d2cf2dc430b1717a8b07a9df029d651f3895dbffec4729a3902a6fe344c9 --port 40785"
time=2026-02-13T00:35:41.038Z level=INFO source=sched.go:463 msg="system memory" total="30.6 GiB" free="30.4 GiB" free_swap="8.0 GiB"
time=2026-02-13T00:35:41.038Z level=INFO source=sched.go:470 msg="gpu memory" id=0 library=ROCm available="110.7 GiB" free="111.2 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-02-13T00:35:41.038Z level=INFO source=server.go:498 msg="loading model" "model layers"=81 requested=-1
time=2026-02-13T00:35:41.038Z level=WARN source=server.go:1044 msg="model request too large for system" requested="92.5 GiB" available="38.4 GiB" total="30.6 GiB" free="30.4 GiB" swap="8.0 GiB"
time=2026-02-13T00:35:41.038Z level=INFO source=sched.go:490 msg="Load failed" model=/root/.ollama/models/blobs/sha256-de20d2cf2dc430b1717a8b07a9df029d651f3895dbffec4729a3902a6fe344c9 error="model requires more system memory (92.5 GiB) than is available (38.4 GiB)"
time=2026-02-13T00:35:41.046Z level=INFO source=runner.go:965 msg="starting go runner"
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
[GIN] 2026/02/13 - 00:35:41 | 500 |  943.203397ms |       127.0.0.1 | POST     "/api/generate"

@boomam commented on GitHub (Feb 13, 2026): Fresh start of the container - ``` time=2026-02-13T00:34:10.121Z level=INFO source=routes.go:1636 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:0 OLLAMA_DEBUG:INFO OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:24h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:4 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2026-02-13T00:34:10.125Z level=INFO source=images.go:473 msg="total blobs: 113" time=2026-02-13T00:34:10.127Z level=INFO source=images.go:480 msg="total unused blobs removed: 0" time=2026-02-13T00:34:10.128Z level=INFO source=routes.go:1689 msg="Listening on [::]:11434 (version 0.16.0)" time=2026-02-13T00:34:10.128Z level=INFO source=runner.go:67 msg="discovering available GPUs..." time=2026-02-13T00:34:10.128Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 37507" time=2026-02-13T00:34:10.848Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 43623" time=2026-02-13T00:34:11.464Z level=INFO source=types.go:42 msg="inference compute" id=0 filter_id=0 library=ROCm compute=gfx1151 name=ROCm0 description="AMD Radeon Graphics" libdirs=ollama,rocm driver=60342.13 pci_id=0000:c2:00.0 type=iGPU total="111.3 GiB" available="111.2 GiB" time=2026-02-13T00:34:11.464Z level=INFO source=routes.go:1739 msg="vram-based default context" total_vram="111.3 GiB" default_num_ctx=262144 ``` Log after trying one of the models that fails - ``` [GIN] 2026/02/13 - 00:35:39 | 200 | 64.952µs | 127.0.0.1 | HEAD "/" [GIN] 2026/02/13 - 00:35:40 | 200 | 104.11925ms | 127.0.0.1 | POST "/api/show" [GIN] 2026/02/13 - 00:35:40 | 200 | 101.180307ms | 127.0.0.1 | POST "/api/show" time=2026-02-13T00:35:40.247Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 40435" time=2026-02-13T00:35:40.857Z level=WARN source=cpu_linux.go:130 msg="failed to parse CPU allowed micro secs" error="strconv.ParseInt: parsing \"max\": invalid syntax" llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from /root/.ollama/models/blobs/sha256-de20d2cf2dc430b1717a8b07a9df029d651f3895dbffec4729a3902a6fe344c9 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 70B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 llama_model_loader: - kv 5: general.size_label str = 70B llama_model_loader: - kv 6: general.license str = llama3.1 llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 9: llama.block_count u32 = 80 llama_model_loader: - kv 10: llama.context_length u32 = 131072 llama_model_loader: - kv 11: llama.embedding_length u32 = 8192 llama_model_loader: - kv 12: llama.feed_forward_length u32 = 28672 llama_model_loader: - kv 13: llama.attention.head_count u32 = 64 llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 17: general.file_type u32 = 15 llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 28: general.quantization_version u32 = 2 llama_model_loader: - type f32: 162 tensors llama_model_loader: - type q4_K: 441 tensors llama_model_loader: - type q5_K: 40 tensors llama_model_loader: - type q6_K: 81 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 39.59 GiB (4.82 BPW) load: printing all EOG tokens: load: - 128001 ('<|end_of_text|>') load: - 128008 ('<|eom_id|>') load: - 128009 ('<|eot_id|>') load: special tokens cache size = 256 load: token to piece cache size = 0.7999 MB print_info: arch = llama print_info: vocab_only = 1 print_info: no_alloc = 0 print_info: model type = ?B print_info: model params = 70.55 B print_info: general.name = Meta Llama 3.1 70B Instruct print_info: vocab type = BPE print_info: n_vocab = 128256 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128009 '<|eot_id|>' print_info: EOM token = 128008 '<|eom_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128001 '<|end_of_text|>' print_info: EOG token = 128008 '<|eom_id|>' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2026-02-13T00:35:41.037Z level=WARN source=server.go:169 msg="requested context size too large for model" num_ctx=262144 n_ctx_train=131072 time=2026-02-13T00:35:41.037Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --model /root/.ollama/models/blobs/sha256-de20d2cf2dc430b1717a8b07a9df029d651f3895dbffec4729a3902a6fe344c9 --port 40785" time=2026-02-13T00:35:41.038Z level=INFO source=sched.go:463 msg="system memory" total="30.6 GiB" free="30.4 GiB" free_swap="8.0 GiB" time=2026-02-13T00:35:41.038Z level=INFO source=sched.go:470 msg="gpu memory" id=0 library=ROCm available="110.7 GiB" free="111.2 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-02-13T00:35:41.038Z level=INFO source=server.go:498 msg="loading model" "model layers"=81 requested=-1 time=2026-02-13T00:35:41.038Z level=WARN source=server.go:1044 msg="model request too large for system" requested="92.5 GiB" available="38.4 GiB" total="30.6 GiB" free="30.4 GiB" swap="8.0 GiB" time=2026-02-13T00:35:41.038Z level=INFO source=sched.go:490 msg="Load failed" model=/root/.ollama/models/blobs/sha256-de20d2cf2dc430b1717a8b07a9df029d651f3895dbffec4729a3902a6fe344c9 error="model requires more system memory (92.5 GiB) than is available (38.4 GiB)" time=2026-02-13T00:35:41.046Z level=INFO source=runner.go:965 msg="starting go runner" load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so [GIN] 2026/02/13 - 00:35:41 | 500 | 943.203397ms | 127.0.0.1 | POST "/api/generate" ```

GiteaMirror commented

2026-04-29 09:38:20 -05:00

@rick-github commented on GitHub (Feb 13, 2026):

OLLAMA_CONTEXT_LENGTH:0

The context length is not set. Where have you configured it?

@rick-github commented on GitHub (Feb 13, 2026): ``` OLLAMA_CONTEXT_LENGTH:0 ``` The context length is not set. Where have you configured it?

GiteaMirror commented

2026-04-29 09:38:20 -05:00

@boomam commented on GitHub (Feb 13, 2026):

In the compose file i use.

@boomam commented on GitHub (Feb 13, 2026): In the compose file i use.

GiteaMirror commented

2026-04-29 09:38:21 -05:00

@rick-github commented on GitHub (Feb 13, 2026):

If you share the compose file that would be helpful.

@rick-github commented on GitHub (Feb 13, 2026): If you share the compose file that would be helpful.

GiteaMirror commented

2026-04-29 09:38:21 -05:00

@boomam commented on GitHub (Feb 13, 2026):

services:
  ollama-general:
    image: ollama/ollama:0.16.0-rocm
    container_name: ollama-general
    restart: always
    ports:
      - "11434:11434"
    volumes:
      - ollama-general:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0:11434
      - OLLAMA_CONTEXT_LENGTH=8192
      - OLLAMA_KEEP_ALIVE=24h
      - OLLAMA_NUM_PARALLEL=1
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:11434/ || exit 1"]
      interval: 60s
      timeout: 10s
      retries: 5
    devices:
      - /dev/kfd:/dev/kfd
      - /dev/dri:/dev/dri
    networks:
      - internal

volumes:
  ollama-general:

networks:
  internal:
    name: internal
    external: true

@boomam commented on GitHub (Feb 13, 2026): ```yaml services: ollama-general: image: ollama/ollama:0.16.0-rocm container_name: ollama-general restart: always ports: - "11434:11434" volumes: - ollama-general:/root/.ollama environment: - OLLAMA_HOST=0.0.0.0:11434 - OLLAMA_CONTEXT_LENGTH=8192 - OLLAMA_KEEP_ALIVE=24h - OLLAMA_NUM_PARALLEL=1 healthcheck: test: ["CMD-SHELL", "curl -f http://localhost:11434/ || exit 1"] interval: 60s timeout: 10s retries: 5 devices: - /dev/kfd:/dev/kfd - /dev/dri:/dev/dri networks: - internal volumes: ollama-general: networks: internal: name: internal external: true ```

GiteaMirror commented

2026-04-29 09:38:23 -05:00

@boomam commented on GitHub (Feb 13, 2026):

Ah, i think i fixed it. Give me a moment to validate...

@boomam commented on GitHub (Feb 13, 2026): Ah, i think i fixed it. Give me a moment to validate...

GiteaMirror commented

2026-04-29 09:38:26 -05:00

@boomam commented on GitHub (Feb 13, 2026):

ok, if i set these variables, as such -
- OLLAMA_CONTEXT_LENGTH=8192
- OLLAMA_NUM_PARALLEL=1
Then ollama run llama3.1:70b works.
If i remove the context length and leave parallel at 1, it still works.
If i reset parallel back to 4, it fails again.

Not tested the other failed models yet.

@boomam commented on GitHub (Feb 13, 2026): ok, if i set these variables, as such - - OLLAMA_CONTEXT_LENGTH=8192 - OLLAMA_NUM_PARALLEL=1 Then `ollama run llama3.1:70b` works. If i remove the context length and leave parallel at 1, it still works. If i reset parallel back to 4, it fails again. Not tested the other failed models yet.

GiteaMirror commented

2026-04-29 09:38:27 -05:00

@rick-github commented on GitHub (Feb 13, 2026):

OLLAMA_NUM_PARALLEL=4 needs 4 times the context space.

@rick-github commented on GitHub (Feb 13, 2026): `OLLAMA_NUM_PARALLEL=4` needs 4 times the context space.

GiteaMirror commented

2026-04-29 09:38:27 -05:00

@boomam commented on GitHub (Feb 13, 2026):

Strange though, that some models, even larger ones with similar context specs, work, but certain models like llama3.1, don't.

@boomam commented on GitHub (Feb 13, 2026): Strange though, that some models, even larger ones with similar context specs, work, but certain models like llama3.1, don't.

GiteaMirror commented

2026-04-29 09:38:27 -05:00

@boomam commented on GitHub (Feb 13, 2026):

Regardless though, 'problem' solved - appreciate the fast responses @rick-github. :-)

@boomam commented on GitHub (Feb 13, 2026): Regardless though, 'problem' solved - appreciate the fast responses @rick-github. :-)

GiteaMirror commented

2026-04-29 09:38:28 -05:00

@jocull commented on GitHub (Feb 17, 2026):

As noted in https://github.com/open-webui/open-webui/issues/21537 I am chasing this down because it produces a significant change on my Mac. Context windows have suddenly jumped by 10x and models now must reload and use significantly more memory.

So is this change here to stay and I must adjust my machine and configuration for it, or is this unintentional and I should await to be reverted/fixed?

Ollama is installed via homebrew as well which makes altering the environment vars more difficult.

@jocull commented on GitHub (Feb 17, 2026): As noted in https://github.com/open-webui/open-webui/issues/21537 I am chasing this down because it produces a significant change on my Mac. Context windows have suddenly jumped by 10x and models now must reload and use significantly more memory. So is this change here to stay and I must adjust my machine and configuration for it, or is this unintentional and I should await to be reverted/fixed? Ollama is installed via homebrew as well which makes altering the environment vars more difficult.

GiteaMirror commented

2026-04-29 09:38:29 -05:00

@rick-github commented on GitHub (Feb 17, 2026):

Set OLLAMA_CONTEXT_LENGTH=4096 in the server environment and the behaviour will be the same as before the tiered context change.

@rick-github commented on GitHub (Feb 17, 2026): Set `OLLAMA_CONTEXT_LENGTH=4096` in the server environment and the behaviour will be the same as before the tiered context change.

GiteaMirror commented

2026-04-29 09:38:29 -05:00

@jocull commented on GitHub (Feb 17, 2026):

Setting the environment vars with a homebrew installed ollama is inordinately hard, which appears to be a longstanding issue: https://github.com/orgs/Homebrew/discussions/6196#discussioncomment-14849386

@jocull commented on GitHub (Feb 17, 2026): Setting the environment vars with a homebrew installed ollama is inordinately hard, which appears to be a longstanding issue: https://github.com/orgs/Homebrew/discussions/6196#discussioncomment-14849386

GiteaMirror commented

2026-04-29 09:38:30 -05:00

@rick-github commented on GitHub (Feb 18, 2026):

If your installer prevents you from configuring the program it's installing, you can fallback to configuring the model itself.

% ollama show --modelfile qwen3:14b > Modelfile
% echo PARAMETER num_ctx 4096 >> Modelfile
% ollama create qwen3:14b

If you don't like the idea of modifying the model (eg, the modification will get removed if you re-pull the model), then you can create a copy of the model:

% ollama show --modelfile qwen3:14b > Modelfile
% echo PARAMETER num_ctx 4096 >> Modelfile
% ollama create qwen3:14b-task

and then configure OpenWebUI to use this copy for the task of generating chat titles. Go to Admin Panel > Settings > Interface and set Task Model to the new model.

@rick-github commented on GitHub (Feb 18, 2026): If your installer prevents you from configuring the program it's installing, you can fallback to configuring the model itself. ```console % ollama show --modelfile qwen3:14b > Modelfile % echo PARAMETER num_ctx 4096 >> Modelfile % ollama create qwen3:14b ``` If you don't like the idea of modifying the model (eg, the modification will get removed if you re-pull the model), then you can create a copy of the model: ```console % ollama show --modelfile qwen3:14b > Modelfile % echo PARAMETER num_ctx 4096 >> Modelfile % ollama create qwen3:14b-task ``` and then configure OpenWebUI to use this copy for the task of generating chat titles. Go to Admin Panel > Settings > Interface and set `Task Model` to the new model.

GiteaMirror commented

2026-04-29 09:38:30 -05:00

@winstonma commented on GitHub (Feb 23, 2026):

This is the extracted from AI Engineering book:

GPUs usually come with 16 GB, 24 GB, 48 GB, and 80 GB of memory. Therefore, many popular models are those that max out these memory configurations. It’s not a coincidence that many models today have 7 billion or 65 billion parameters

Should it use runtime-adaptive context lengths (tuned to actual free VRAM instead of a static cap) help avoid OOM on smaller cards while still allowing longer contexts on bigger ones? Curious what the trade-offs look like

@winstonma commented on GitHub (Feb 23, 2026): This is the extracted from [AI Engineering](https://www.oreilly.com/library/view/ai-engineering/9781098166298/) book: > GPUs usually come with 16 GB, 24 GB, 48 GB, and 80 GB of memory. Therefore, many popular models are those that max out these memory configurations. It’s not a coincidence that many models today have 7 billion or 65 billion parameters Should it use runtime-adaptive context lengths (tuned to actual free VRAM instead of a static cap) help avoid OOM on smaller cards while still allowing longer contexts on bigger ones? Curious what the trade-offs look like

GiteaMirror commented

2026-04-29 09:38:30 -05:00

@viba1 commented on GitHub (Mar 9, 2026):

same problem here

@viba1 commented on GitHub (Mar 9, 2026): same problem here

GiteaMirror commented

2026-04-29 09:38:31 -05:00

@rick-github commented on GitHub (Mar 9, 2026):

To prevent tiered context length from causing a problem, OLLAMA_CONTEXT_LENGTH can be explicitly set in the server environment.

@rick-github commented on GitHub (Mar 9, 2026): To prevent tiered context length from causing a problem, `OLLAMA_CONTEXT_LENGTH` can be explicitly set in the [server environment](https://github.com/ollama/ollama/blob/main/docs/faq.mdx#how-do-i-configure-ollama-server).

GiteaMirror commented

2026-04-29 09:38:31 -05:00

@viba1 commented on GitHub (Mar 9, 2026):

To prevent tiered context length from causing a problem, OLLAMA_CONTEXT_LENGTH can be explicitly set in the server environment.

Isn't setting OLLAMA_CONTEXT_LENGTH (in .env) to avoid performance degradation on large models (qwen3:32b) likely to degrade the size of what I could use on smaller models (gemma3:12b) while still not being sufficient for slightly larger models (qwen3.5:35b)?

@viba1 commented on GitHub (Mar 9, 2026): > To prevent tiered context length from causing a problem, `OLLAMA_CONTEXT_LENGTH` can be explicitly set in the [server environment](https://github.com/ollama/ollama/blob/main/docs/faq.mdx#how-do-i-configure-ollama-server). Isn't setting OLLAMA_CONTEXT_LENGTH (in .env) to avoid performance degradation on large models (qwen3:32b) likely to degrade the size of what I could use on smaller models (gemma3:12b) while still not being sufficient for slightly larger models (qwen3.5:35b)?

GiteaMirror commented

2026-04-29 09:38:32 -05:00

@rick-github commented on GitHub (Mar 9, 2026):

Setting OLLAMA_CONTEXT_LENGTH=4096 will restore the context length behaviour to what it was prior to 0.15.5. It just sets the default, clients are free to load models with larger (or smaller) context sizes as required. If qwen3.5:35b suffers performance degradation in 0.17.7 because of the default context size, the same would have been true in 0.15.4 and earlier (other than that the model not actually being supported). The new tiered context length behaviour is designed to make it easier to load larger models like qwen3.5:35b.

@rick-github commented on GitHub (Mar 9, 2026): Setting `OLLAMA_CONTEXT_LENGTH=4096` will restore the context length behaviour to what it was prior to 0.15.5. It just sets the default, clients are free to load models with larger (or smaller) context sizes as required. If qwen3.5:35b suffers performance degradation in 0.17.7 because of the default context size, the same would have been true in 0.15.4 and earlier (other than that the model not actually being supported). The new tiered context length behaviour is designed to make it easier to load larger models like qwen3.5:35b.

GiteaMirror commented

2026-04-29 09:38:32 -05:00

@viba1 commented on GitHub (Mar 9, 2026):

Could we not imagine a dynamic context size based on available VRAM and the size of the selected model, in order to always obtain the maximum context while remaining 100% GPU?

@viba1 commented on GitHub (Mar 9, 2026): Could we not imagine a dynamic context size based on available VRAM and the size of the selected model, in order to always obtain the maximum context while remaining 100% GPU?

GiteaMirror commented

2026-04-29 09:38:33 -05:00

@rick-github commented on GitHub (Mar 9, 2026):

The MLX runner has dynamic context, the llama.cpp and ollama runners do not. There are some PRs retro-fitting semi-dynamic content to the existing runners.

@rick-github commented on GitHub (Mar 9, 2026): The MLX runner has dynamic context, the llama.cpp and ollama runners do not. There are some PRs retro-fitting semi-dynamic content to the existing runners.

GiteaMirror commented

2026-04-29 09:38:34 -05:00

@stubhead commented on GitHub (Mar 24, 2026):

hello,

i've set Environment=OLLAMA_CONTEXT_LENGTH=4096 in ollama's systemd override.conf. however, when i restart the ollama server, i see the following in the log:

level=INFO source=routes.go:1832 msg="vram-based default context" total_vram="90.0 GiB" default_num_ctx=262144

doesn't the default_num_ctx=262144 mean ollama is using 256k and not 4k as requested?

i can confirm that i'm seeing OLLAMA_CONTEXT_LENGTH=4096 in the log's informational debug output as well, about 10 lines earlier, so it looks like ollama is at least reading the override.conf into account.

why the disparity?

@stubhead commented on GitHub (Mar 24, 2026): hello, i've set `Environment=OLLAMA_CONTEXT_LENGTH=4096` in ollama's systemd override.conf. however, when i restart the ollama server, i see the following in the log: `level=INFO source=routes.go:1832 msg="vram-based default context" total_vram="90.0 GiB" default_num_ctx=262144` doesn't the `default_num_ctx=262144` mean ollama is using 256k and not 4k as requested? i *can* confirm that i'm seeing `OLLAMA_CONTEXT_LENGTH=4096` in the log's informational debug output as well, about 10 lines earlier, so it looks like ollama is at least reading the override.conf into account. why the disparity?

GiteaMirror commented

2026-04-29 09:38:36 -05:00

@rick-github commented on GitHub (Mar 24, 2026):

"vram-based default context" indicates the default context if not overridden by OLLAMA_CONTEXT_LENGTH or num_ctx in the API call/Modelfile.

@rick-github commented on GitHub (Mar 24, 2026): `"vram-based default context"` indicates the default context if not overridden by `OLLAMA_CONTEXT_LENGTH` or `num_ctx` in the API call/Modelfile.

GiteaMirror commented

2026-04-29 09:38:37 -05:00

@stubhead commented on GitHub (Mar 25, 2026):

"vram-based default context" indicates the default context if not overridden by OLLAMA_CONTEXT_LENGTH or num_ctx in the API call/Modelfile.

many thanks for the clarification, and the quick reply!

@stubhead commented on GitHub (Mar 25, 2026): > `"vram-based default context"` indicates the default context if not overridden by `OLLAMA_CONTEXT_LENGTH` or `num_ctx` in the API call/Modelfile. many thanks for the clarification, and the quick reply!

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

parth-launch-plan-gating

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#55722