[GH-ISSUE #14116] Tiered context length can exhaust VRAM #55722

Open
opened 2026-04-29 09:38:10 -05:00 by GiteaMirror · 36 comments
Owner

Originally created by @rick-github on GitHub (Feb 6, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14116

What is the issue?

A new feature in 0.15.5 is setting the default context length (OLLAMA_CONTEXT_LENGTH) based on the amount of VRAM detected:

  * < 24 GiB VRAM: 4,096 context
  * 24-48 GiB VRAM: 32,768 context
  * >= 48 GiB VRAM: 262,144 context

However, the setting of OLLAMA_NUM_PARALLEL is not taken into account and the total VRAM used is proportional to OLLAMA_CONTEXT_LENGTH * OLLAMA_NUM_PARALLEL. This can exhaust the available VRAM and cause model spilling to system RAM/swap, resulting in poor performance.

Either the automatically selected context size should take parallelism into account, or the effect of OLLAMA_NUM_PARALLEL on total VRAM requirements should be documented.

Even if OLLAMA_NUM_PARALLEL is not set, the larger context sizes for GPUs >= 24GiB VRAM can cause model spilling.

To prevent tiered context length from causing a problem, OLLAMA_CONTEXT_LENGTH can be explicitly set in the server environment.

Relevant log output


OS

No response

GPU

No response

CPU

No response

Ollama version

0.5.15

Originally created by @rick-github on GitHub (Feb 6, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14116 ### What is the issue? A new feature in 0.15.5 is setting the default context length (`OLLAMA_CONTEXT_LENGTH`) based on the amount of VRAM detected: ``` * < 24 GiB VRAM: 4,096 context * 24-48 GiB VRAM: 32,768 context * >= 48 GiB VRAM: 262,144 context ``` However, the setting of `OLLAMA_NUM_PARALLEL` is not taken into account and the total VRAM used is proportional to `OLLAMA_CONTEXT_LENGTH` * `OLLAMA_NUM_PARALLEL`. This can exhaust the available VRAM and cause model spilling to system RAM/swap, resulting in poor performance. Either the automatically selected context size should take parallelism into account, or the effect of `OLLAMA_NUM_PARALLEL` on total VRAM requirements should be documented. Even if `OLLAMA_NUM_PARALLEL` is not set, the larger context sizes for GPUs >= 24GiB VRAM can cause model spilling. To prevent tiered context length from causing a problem, `OLLAMA_CONTEXT_LENGTH` can be explicitly set in the [server environment](https://github.com/ollama/ollama/blob/main/docs/faq.mdx#how-do-i-configure-ollama-server). ### Relevant log output ```shell ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version 0.5.15
GiteaMirror added the bug label 2026-04-29 09:38:10 -05:00
Author
Owner

@jessegross commented on GitHub (Feb 6, 2026):

The long term goal is to get the default context length to be the model's trained context length to avoid lower quality and surprises. In that situation, we would want to spill to system RAM and take the performance hit only when you actually use that context rather than see the lower performance immediately at we would today. In that world, Ollama could also automatically configure num parallel, increasing it to the limit that would stay in VRAM since that will maximize performance.

Working back from that to today, I don't think that we want to automatically divide the context by num parallel. Overall, the expectation should be that num parallel increases VRAM usage (as it always has). It's an advanced feature and for people who use it, I would recommend also setting the context length.

<!-- gh-comment-id:3862847075 --> @jessegross commented on GitHub (Feb 6, 2026): The long term goal is to get the default context length to be the model's trained context length to avoid lower quality and surprises. In that situation, we would want to spill to system RAM and take the performance hit only when you actually use that context rather than see the lower performance immediately at we would today. In that world, Ollama could also automatically configure num parallel, increasing it to the limit that would stay in VRAM since that will maximize performance. Working back from that to today, I don't think that we want to automatically divide the context by num parallel. Overall, the expectation _should_ be that num parallel increases VRAM usage (as it always has). It's an advanced feature and for people who use it, I would recommend also setting the context length.
Author
Owner

@rick-github commented on GitHub (Feb 6, 2026):

Leaving open for visibility.

<!-- gh-comment-id:3863063168 --> @rick-github commented on GitHub (Feb 6, 2026): Leaving open for visibility.
Author
Owner

@winstonma commented on GitHub (Feb 7, 2026):

I am using AMD iGPU. It has 64GB of memory (57GB are sharable memory). I am running Qwen3-Next (80B-A3B) model. Here is the info collected using ollama ps:

Context Processor Memory tokens/s
16384 100% GPU 51GB ~15
32768 100% GPU 55GB ~15
262144 45%/55% CPU/GPU 108 GB ~6

How can I add a flag so that ollama can dynamically set the context size to fit everything in GPU?

<!-- gh-comment-id:3863211276 --> @winstonma commented on GitHub (Feb 7, 2026): I am using AMD iGPU. It has 64GB of memory (57GB are sharable memory). I am running Qwen3-Next (80B-A3B) model. Here is the info collected using `ollama ps`: | Context | Processor | Memory | tokens/s | |--------|--------|--------| --------| | 16384 | 100% GPU | 51GB | ~15 | | 32768 | 100% GPU | 55GB |~15 | | 262144 | 45%/55% CPU/GPU | 108 GB |~6 | How can I add a flag so that ollama can dynamically set the context size to fit everything in GPU?
Author
Owner

@rick-github commented on GitHub (Feb 7, 2026):

How can I add a flag so that ollama can dynamically set the context size to fit everything in GPU?

There is no such flag yet. That's the goal as described by Jesse, but at the moment only a static limit can be set with OLLAMA_CONTEXT_LENGTH.

<!-- gh-comment-id:3863218944 --> @rick-github commented on GitHub (Feb 7, 2026): > How can I add a flag so that ollama can dynamically set the context size to fit everything in GPU? There is no such flag yet. That's the goal as [described](https://github.com/ollama/ollama/issues/14116#issuecomment-3862847075) by Jesse, but at the moment only a static limit can be set with [`OLLAMA_CONTEXT_LENGTH`](https://docs.ollama.com/faq#how-can-i-specify-the-context-window-size).
Author
Owner

@winstonma commented on GitHub (Feb 7, 2026):

How can I add a flag so that ollama can dynamically set the context size to fit everything in GPU?

There is no such flag yet. That's the goal as described by Jesse, but at the moment only a static limit can be set with OLLAMA_CONTEXT_LENGTH.

In the meantime I am setting a fixed OLLAMA_CONTEXT_LENGTH using systemctl. I think Jesse's goal is very valid. But OLLAMA_CONTEXT_LENGTH is a global variable that would affect every model.

I think the dynamic approach is a very good start but I am not sure setting the context size to 4096 for <24GB can help achieve the goal. So I believe there is a reason to set the context size to dynamic.

Could a dynamic context size up to the GPU memory limit could be the added feature? Thanks

EDIT: I think I am looking for per model custom default setting (e.g. default no_think, context size) without modifying modelfile.

<!-- gh-comment-id:3863254864 --> @winstonma commented on GitHub (Feb 7, 2026): > > How can I add a flag so that ollama can dynamically set the context size to fit everything in GPU? > > There is no such flag yet. That's the goal as [described](https://github.com/ollama/ollama/issues/14116#issuecomment-3862847075) by Jesse, but at the moment only a static limit can be set with [`OLLAMA_CONTEXT_LENGTH`](https://docs.ollama.com/faq#how-can-i-specify-the-context-window-size). In the meantime I am setting a fixed `OLLAMA_CONTEXT_LENGTH` using systemctl. I think Jesse's goal is very valid. But OLLAMA_CONTEXT_LENGTH is a global variable that would affect every model. I think the dynamic approach is a very good start but I am not sure setting the context size to 4096 for <24GB can help achieve the goal. So I believe there is a reason to set the context size to dynamic. Could a dynamic context size up to the GPU memory limit could be the added feature? Thanks EDIT: I think I am looking for per model custom default setting (e.g. default no_think, context size) without modifying modelfile.
Author
Owner

@boomam commented on GitHub (Feb 12, 2026):

I'm guessing this is the cause of what I'm seeing today.
VRAM usage at 200Mb, 96GB Total, complains at loading a 42Gb Model

Error: 500 Internal Server Error: model requires more system memory (92.5 GiB) than is available (38.3 GiB)
<!-- gh-comment-id:3894033861 --> @boomam commented on GitHub (Feb 12, 2026): I'm guessing this is the cause of what I'm seeing today. VRAM usage at 200Mb, 96GB Total, complains at loading a 42Gb Model ``` Error: 500 Internal Server Error: model requires more system memory (92.5 GiB) than is available (38.3 GiB) ```
Author
Owner

@rick-github commented on GitHub (Feb 13, 2026):

What model? 92.5G should fit in 95.8G. Server logs may provide more info.

<!-- gh-comment-id:3894053770 --> @rick-github commented on GitHub (Feb 13, 2026): What model? 92.5G should fit in 95.8G. [Server logs](https://docs.ollama.com/troubleshooting) may provide more info.
Author
Owner

@boomam commented on GitHub (Feb 13, 2026):

What model? 92.5G should fit in 95.8G. Server logs may provide more info.

Just doing some testing -

  • Hardware
    • AMD Strix Halo 128GB
    • 96GB set in UEFI to VRAM
  • Versions
    • Running v0.16.0-rocm (same happened with previous v0.15 release, too).
    • Docker v29.2.1
    • Ubuntu 25.10
      • Kernel 6.17.0-14-generic
      • AMD Driver 7.2.70200
  • Ollama Notes/Settings
    • ollama run model --verbose - command used to test
    • Context length set to 8192 with OLLAMA_CONTEXT_LENGTH - added as testing, issue still occurs regardless of this.
    • Parallel set to 1 with OLLAMA_NUM_PARALLEL - added as testing, issue still occurs regardless of this.
Model Name Model Size Result
llama3.1:70b 42Gb Error: 500 Internal Server Error: model requires more system memory (92.5 GiB) than is available (38.1 GiB)
qwen3-coder-next:q4_K_M 51GB, 61GB loaded runs
gpt-oss:120b 65GB runs
llama4:16x17b 67 GB Error: 500 Internal Server Error: model requires more system memory (254.8 GiB) than is available (33.7 GiB)
devstral-2:123b 74GB Error: 500 Internal Server Error: model failed to load, this may be due to resource limitations or an internal error, check ollama server logs for details
mixtral:8x22b 79GB Error: 500 Internal Server Error: model requires more system memory (48.8 GiB) than is available (38.1 GiB)
qwen3-coder-next:Q8_0 84GB runs

When they fail, i get repeats of this in the logs -

[ollama-general] 2026-02-13T00:17:32.537530814Z goroutine 66 gp=0xc00012a700 m=nil [IO wait]:
[ollama-general] 2026-02-13T00:17:32.537531545Z runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0xb?)
[ollama-general] 2026-02-13T00:17:32.537532157Z 	runtime/proc.go:435 +0xce fp=0xc000122dd8 sp=0xc000122db8 pc=0x5f4c70dd6f4e
[ollama-general] 2026-02-13T00:17:32.537532948Z runtime.netpollblock(0x5f4c70dfa7f8?, 0x70d70506?, 0x4c?)
[ollama-general] 2026-02-13T00:17:32.537533569Z 	runtime/netpoll.go:575 +0xf7 fp=0xc000122e10 sp=0xc000122dd8 pc=0x5f4c70d9c0f7
[ollama-general] 2026-02-13T00:17:32.537534351Z internal/poll.runtime_pollWait(0x7de9d7484cc8, 0x72)
[ollama-general] 2026-02-13T00:17:32.537535102Z 	runtime/netpoll.go:351 +0x85 fp=0xc000122e30 sp=0xc000122e10 pc=0x5f4c70dd6165
[ollama-general] 2026-02-13T00:17:32.537535743Z internal/poll.(*pollDesc).wait(0xc000309200?, 0xc0003bf751?, 0x0)
[ollama-general] 2026-02-13T00:17:32.537537346Z 	internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000122e58 sp=0xc000122e30 pc=0x5f4c70e5e487
[ollama-general] 2026-02-13T00:17:32.537538048Z internal/poll.(*pollDesc).waitRead(...)
[ollama-general] 2026-02-13T00:17:32.537538669Z 	internal/poll/fd_poll_runtime.go:89
[ollama-general] 2026-02-13T00:17:32.537539310Z internal/poll.(*FD).Read(0xc000309200, {0xc0003bf751, 0x1, 0x1})
[ollama-general] 2026-02-13T00:17:32.537540081Z 	internal/poll/fd_unix.go:165 +0x27a fp=0xc000122ef0 sp=0xc000122e58 pc=0x5f4c70e5f77a
[ollama-general] 2026-02-13T00:17:32.537540733Z net.(*netFD).Read(0xc000309200, {0xc0003bf751?, 0x0?, 0x0?})
[ollama-general] 2026-02-13T00:17:32.537541384Z 	net/fd_posix.go:55 +0x25 fp=0xc000122f38 sp=0xc000122ef0 pc=0x5f4c70ed4cc5
[ollama-general] 2026-02-13T00:17:32.537542266Z net.(*conn).Read(0xc000158748, {0xc0003bf751?, 0x0?, 0x0?})
[ollama-general] 2026-02-13T00:17:32.537542907Z 	net/net.go:194 +0x45 fp=0xc000122f80 sp=0xc000122f38 pc=0x5f4c70ee3085
[ollama-general] 2026-02-13T00:17:32.537543608Z net/http.(*connReader).backgroundRead(0xc0003bf740)
[ollama-general] 2026-02-13T00:17:32.537544219Z 	net/http/server.go:690 +0x37 fp=0xc000122fc8 sp=0xc000122f80 pc=0x5f4c710cfbd7
[ollama-general] 2026-02-13T00:17:32.537546073Z net/http.(*connReader).startBackgroundRead.gowrap2()
[ollama-general] 2026-02-13T00:17:32.537546704Z 	net/http/server.go:686 +0x25 fp=0xc000122fe0 sp=0xc000122fc8 pc=0x5f4c710cfb05
[ollama-general] 2026-02-13T00:17:32.537547365Z runtime.goexit({})
[ollama-general] 2026-02-13T00:17:32.537547936Z 	runtime/asm_amd64.s:1700 +0x1 fp=0xc000122fe8 sp=0xc000122fe0 pc=0x5f4c70ddeec1
[ollama-general] 2026-02-13T00:17:32.537548587Z created by net/http.(*connReader).startBackgroundRead in goroutine 10
[ollama-general] 2026-02-13T00:17:32.537549229Z 	net/http/server.go:686 +0xb6
[ollama-general] 2026-02-13T00:17:32.537549830Z 
[ollama-general] 2026-02-13T00:17:32.537550411Z rax    0x0
[ollama-general] 2026-02-13T00:17:32.537551002Z rbx    0x1aa
[ollama-general] 2026-02-13T00:17:32.537551593Z rcx    0x7dea200acb2c
[ollama-general] 2026-02-13T00:17:32.537552184Z rdx    0x6
[ollama-general] 2026-02-13T00:17:32.537552745Z rdi    0x1a1
[ollama-general] 2026-02-13T00:17:32.537553316Z rsi    0x1aa
[ollama-general] 2026-02-13T00:17:32.537553877Z rbp    0x7de9d53f5b90
[ollama-general] 2026-02-13T00:17:32.537554458Z rsp    0x7de9d53f5b50
[ollama-general] 2026-02-13T00:17:32.537555020Z r8     0x0
[ollama-general] 2026-02-13T00:17:32.537555611Z r9     0x0
[ollama-general] 2026-02-13T00:17:32.537556172Z r10    0x8
[ollama-general] 2026-02-13T00:17:32.537556733Z r11    0x246
[ollama-general] 2026-02-13T00:17:32.537557304Z r12    0x6
[ollama-general] 2026-02-13T00:17:32.537557865Z r13    0x7de976e2325a
[ollama-general] 2026-02-13T00:17:32.537558426Z r14    0x16
[ollama-general] 2026-02-13T00:17:32.537558997Z r15    0x7de9c80011a0
[ollama-general] 2026-02-13T00:17:32.537559588Z rip    0x7dea200acb2c
[ollama-general] 2026-02-13T00:17:32.537560169Z rflags 0x246
[ollama-general] 2026-02-13T00:17:32.537560740Z cs     0x33
[ollama-general] 2026-02-13T00:17:32.537561351Z fs     0x0
[ollama-general] 2026-02-13T00:17:32.537561953Z gs     0x0
[ollama-general] 2026-02-13T00:17:32.582499334Z time=2026-02-13T00:17:32.582Z level=ERROR source=server.go:1205 msg="do load request" error="Post \"http://127.0.0.1:35163/load\": EOF"
[ollama-general] 2026-02-13T00:17:32.582753421Z time=2026-02-13T00:17:32.582Z level=ERROR source=server.go:1205 msg="do load request" error="Post \"http://127.0.0.1:35163/load\": dial tcp 127.0.0.1:35163: connect: connection refused"
[ollama-general] 2026-02-13T00:17:32.582757078Z time=2026-02-13T00:17:32.582Z level=INFO source=sched.go:490 msg="Load failed" model=/root/.ollama/models/blobs/sha256-bd6c22cad19a402cea476d148800ec28704e6de55a34ca5da2d1b924df90945e error="model failed to load, this may be due to resource limitations or an internal error, check ollama server logs for details"
[ollama-general] 2026-02-13T00:17:32.583150127Z time=2026-02-13T00:17:32.582Z level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 2"
[ollama-general] 2026-02-13T00:17:32.583192897Z [GIN] 2026/02/13 - 00:17:32 | 500 |  2.070730296s |       127.0.0.1 | POST     "/api/generate"

Its been a while since i used them, but those models that are giving error 500's, did used to work.

<!-- gh-comment-id:3894121987 --> @boomam commented on GitHub (Feb 13, 2026): > What model? 92.5G should fit in 95.8G. [Server logs](https://docs.ollama.com/troubleshooting) may provide more info. Just doing some testing - * Hardware * AMD Strix Halo 128GB * 96GB set in UEFI to VRAM * Versions * Running v0.16.0-rocm (same happened with previous v0.15 release, too). * Docker v29.2.1 * Ubuntu 25.10 * Kernel 6.17.0-14-generic * AMD Driver 7.2.70200 * Ollama Notes/Settings * `ollama run model --verbose` - command used to test * Context length set to `8192` with `OLLAMA_CONTEXT_LENGTH` - added as testing, issue still occurs regardless of this. * Parallel set to `1` with `OLLAMA_NUM_PARALLEL` - added as testing, issue still occurs regardless of this. | Model Name | Model Size | Result | | --- | --- | --- | | llama3.1:70b | 42Gb | `Error: 500 Internal Server Error: model requires more system memory (92.5 GiB) than is available (38.1 GiB)` | | qwen3-coder-next:q4_K_M | 51GB, 61GB loaded | runs | | gpt-oss:120b | 65GB | runs | | llama4:16x17b | 67 GB | `Error: 500 Internal Server Error: model requires more system memory (254.8 GiB) than is available (33.7 GiB)` | | devstral-2:123b | 74GB | `Error: 500 Internal Server Error: model failed to load, this may be due to resource limitations or an internal error, check ollama server logs for details` | | mixtral:8x22b | 79GB | `Error: 500 Internal Server Error: model requires more system memory (48.8 GiB) than is available (38.1 GiB)` | | qwen3-coder-next:Q8_0 | 84GB | runs | When they fail, i get repeats of this in the logs - ``` [ollama-general] 2026-02-13T00:17:32.537530814Z goroutine 66 gp=0xc00012a700 m=nil [IO wait]: [ollama-general] 2026-02-13T00:17:32.537531545Z runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0xb?) [ollama-general] 2026-02-13T00:17:32.537532157Z runtime/proc.go:435 +0xce fp=0xc000122dd8 sp=0xc000122db8 pc=0x5f4c70dd6f4e [ollama-general] 2026-02-13T00:17:32.537532948Z runtime.netpollblock(0x5f4c70dfa7f8?, 0x70d70506?, 0x4c?) [ollama-general] 2026-02-13T00:17:32.537533569Z runtime/netpoll.go:575 +0xf7 fp=0xc000122e10 sp=0xc000122dd8 pc=0x5f4c70d9c0f7 [ollama-general] 2026-02-13T00:17:32.537534351Z internal/poll.runtime_pollWait(0x7de9d7484cc8, 0x72) [ollama-general] 2026-02-13T00:17:32.537535102Z runtime/netpoll.go:351 +0x85 fp=0xc000122e30 sp=0xc000122e10 pc=0x5f4c70dd6165 [ollama-general] 2026-02-13T00:17:32.537535743Z internal/poll.(*pollDesc).wait(0xc000309200?, 0xc0003bf751?, 0x0) [ollama-general] 2026-02-13T00:17:32.537537346Z internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000122e58 sp=0xc000122e30 pc=0x5f4c70e5e487 [ollama-general] 2026-02-13T00:17:32.537538048Z internal/poll.(*pollDesc).waitRead(...) [ollama-general] 2026-02-13T00:17:32.537538669Z internal/poll/fd_poll_runtime.go:89 [ollama-general] 2026-02-13T00:17:32.537539310Z internal/poll.(*FD).Read(0xc000309200, {0xc0003bf751, 0x1, 0x1}) [ollama-general] 2026-02-13T00:17:32.537540081Z internal/poll/fd_unix.go:165 +0x27a fp=0xc000122ef0 sp=0xc000122e58 pc=0x5f4c70e5f77a [ollama-general] 2026-02-13T00:17:32.537540733Z net.(*netFD).Read(0xc000309200, {0xc0003bf751?, 0x0?, 0x0?}) [ollama-general] 2026-02-13T00:17:32.537541384Z net/fd_posix.go:55 +0x25 fp=0xc000122f38 sp=0xc000122ef0 pc=0x5f4c70ed4cc5 [ollama-general] 2026-02-13T00:17:32.537542266Z net.(*conn).Read(0xc000158748, {0xc0003bf751?, 0x0?, 0x0?}) [ollama-general] 2026-02-13T00:17:32.537542907Z net/net.go:194 +0x45 fp=0xc000122f80 sp=0xc000122f38 pc=0x5f4c70ee3085 [ollama-general] 2026-02-13T00:17:32.537543608Z net/http.(*connReader).backgroundRead(0xc0003bf740) [ollama-general] 2026-02-13T00:17:32.537544219Z net/http/server.go:690 +0x37 fp=0xc000122fc8 sp=0xc000122f80 pc=0x5f4c710cfbd7 [ollama-general] 2026-02-13T00:17:32.537546073Z net/http.(*connReader).startBackgroundRead.gowrap2() [ollama-general] 2026-02-13T00:17:32.537546704Z net/http/server.go:686 +0x25 fp=0xc000122fe0 sp=0xc000122fc8 pc=0x5f4c710cfb05 [ollama-general] 2026-02-13T00:17:32.537547365Z runtime.goexit({}) [ollama-general] 2026-02-13T00:17:32.537547936Z runtime/asm_amd64.s:1700 +0x1 fp=0xc000122fe8 sp=0xc000122fe0 pc=0x5f4c70ddeec1 [ollama-general] 2026-02-13T00:17:32.537548587Z created by net/http.(*connReader).startBackgroundRead in goroutine 10 [ollama-general] 2026-02-13T00:17:32.537549229Z net/http/server.go:686 +0xb6 [ollama-general] 2026-02-13T00:17:32.537549830Z [ollama-general] 2026-02-13T00:17:32.537550411Z rax 0x0 [ollama-general] 2026-02-13T00:17:32.537551002Z rbx 0x1aa [ollama-general] 2026-02-13T00:17:32.537551593Z rcx 0x7dea200acb2c [ollama-general] 2026-02-13T00:17:32.537552184Z rdx 0x6 [ollama-general] 2026-02-13T00:17:32.537552745Z rdi 0x1a1 [ollama-general] 2026-02-13T00:17:32.537553316Z rsi 0x1aa [ollama-general] 2026-02-13T00:17:32.537553877Z rbp 0x7de9d53f5b90 [ollama-general] 2026-02-13T00:17:32.537554458Z rsp 0x7de9d53f5b50 [ollama-general] 2026-02-13T00:17:32.537555020Z r8 0x0 [ollama-general] 2026-02-13T00:17:32.537555611Z r9 0x0 [ollama-general] 2026-02-13T00:17:32.537556172Z r10 0x8 [ollama-general] 2026-02-13T00:17:32.537556733Z r11 0x246 [ollama-general] 2026-02-13T00:17:32.537557304Z r12 0x6 [ollama-general] 2026-02-13T00:17:32.537557865Z r13 0x7de976e2325a [ollama-general] 2026-02-13T00:17:32.537558426Z r14 0x16 [ollama-general] 2026-02-13T00:17:32.537558997Z r15 0x7de9c80011a0 [ollama-general] 2026-02-13T00:17:32.537559588Z rip 0x7dea200acb2c [ollama-general] 2026-02-13T00:17:32.537560169Z rflags 0x246 [ollama-general] 2026-02-13T00:17:32.537560740Z cs 0x33 [ollama-general] 2026-02-13T00:17:32.537561351Z fs 0x0 [ollama-general] 2026-02-13T00:17:32.537561953Z gs 0x0 [ollama-general] 2026-02-13T00:17:32.582499334Z time=2026-02-13T00:17:32.582Z level=ERROR source=server.go:1205 msg="do load request" error="Post \"http://127.0.0.1:35163/load\": EOF" [ollama-general] 2026-02-13T00:17:32.582753421Z time=2026-02-13T00:17:32.582Z level=ERROR source=server.go:1205 msg="do load request" error="Post \"http://127.0.0.1:35163/load\": dial tcp 127.0.0.1:35163: connect: connection refused" [ollama-general] 2026-02-13T00:17:32.582757078Z time=2026-02-13T00:17:32.582Z level=INFO source=sched.go:490 msg="Load failed" model=/root/.ollama/models/blobs/sha256-bd6c22cad19a402cea476d148800ec28704e6de55a34ca5da2d1b924df90945e error="model failed to load, this may be due to resource limitations or an internal error, check ollama server logs for details" [ollama-general] 2026-02-13T00:17:32.583150127Z time=2026-02-13T00:17:32.582Z level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 2" [ollama-general] 2026-02-13T00:17:32.583192897Z [GIN] 2026/02/13 - 00:17:32 | 500 | 2.070730296s | 127.0.0.1 | POST "/api/generate" ``` Its been a while since i used them, but those models that are giving error 500's, did used to work.
Author
Owner

@rick-github commented on GitHub (Feb 13, 2026):

Show the bit before the crashdump, preferably from the start. If you've set OLLAMA_CONTEXT_LENGTH it's not the tiered context length.

<!-- gh-comment-id:3894138180 --> @rick-github commented on GitHub (Feb 13, 2026): Show the bit before the crashdump, preferably from the start. If you've set `OLLAMA_CONTEXT_LENGTH` it's not the tiered context length.
Author
Owner

@boomam commented on GitHub (Feb 13, 2026):

Can you elaborate on what you're asking for there please?
The log is just full of the above log, repeated over and over, when it occurs.

<!-- gh-comment-id:3894142694 --> @boomam commented on GitHub (Feb 13, 2026): Can you elaborate on what you're asking for there please? The log is just full of the above log, repeated over and over, when it occurs.
Author
Owner

@rick-github commented on GitHub (Feb 13, 2026):

The log contains information about how the server is configured and what it is doing. It starts with a line containing the string "server config", and will then go on to show device detection and readiness. When you load a model, it will show stats about the model, how layers are being allocated, the progress of the model load, and metadata about prompts sent to the model. If a crash occurs, the server will log information about the cause and then end with the crashdump. This is what I am asking for.

<!-- gh-comment-id:3894153711 --> @rick-github commented on GitHub (Feb 13, 2026): The log contains information about how the server is configured and what it is doing. It starts with a line containing the string "server config", and will then go on to show device detection and readiness. When you load a model, it will show stats about the model, how layers are being allocated, the progress of the model load, and metadata about prompts sent to the model. If a crash occurs, the server will log information about the cause and then end with the crashdump. This is what I am asking for.
Author
Owner

@rick-github commented on GitHub (Feb 13, 2026):

Run this to get the log for the most recent session:

journalctl -u ollama --no-pager --since "$(systemctl show ollama --property=ActiveEnterTimestamp --value)"
<!-- gh-comment-id:3894158386 --> @rick-github commented on GitHub (Feb 13, 2026): Run this to get the log for the most recent session: ``` journalctl -u ollama --no-pager --since "$(systemctl show ollama --property=ActiveEnterTimestamp --value)" ```
Author
Owner

@boomam commented on GitHub (Feb 13, 2026):

Fresh start of the container -

time=2026-02-13T00:34:10.121Z level=INFO source=routes.go:1636 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:0 OLLAMA_DEBUG:INFO OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:24h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:4 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2026-02-13T00:34:10.125Z level=INFO source=images.go:473 msg="total blobs: 113"
time=2026-02-13T00:34:10.127Z level=INFO source=images.go:480 msg="total unused blobs removed: 0"
time=2026-02-13T00:34:10.128Z level=INFO source=routes.go:1689 msg="Listening on [::]:11434 (version 0.16.0)"
time=2026-02-13T00:34:10.128Z level=INFO source=runner.go:67 msg="discovering available GPUs..."
time=2026-02-13T00:34:10.128Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 37507"
time=2026-02-13T00:34:10.848Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 43623"
time=2026-02-13T00:34:11.464Z level=INFO source=types.go:42 msg="inference compute" id=0 filter_id=0 library=ROCm compute=gfx1151 name=ROCm0 description="AMD Radeon Graphics" libdirs=ollama,rocm driver=60342.13 pci_id=0000:c2:00.0 type=iGPU total="111.3 GiB" available="111.2 GiB"
time=2026-02-13T00:34:11.464Z level=INFO source=routes.go:1739 msg="vram-based default context" total_vram="111.3 GiB" default_num_ctx=262144

Log after trying one of the models that fails -

[GIN] 2026/02/13 - 00:35:39 | 200 |      64.952µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/02/13 - 00:35:40 | 200 |   104.11925ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2026/02/13 - 00:35:40 | 200 |  101.180307ms |       127.0.0.1 | POST     "/api/show"
time=2026-02-13T00:35:40.247Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 40435"
time=2026-02-13T00:35:40.857Z level=WARN source=cpu_linux.go:130 msg="failed to parse CPU allowed micro secs" error="strconv.ParseInt: parsing \"max\": invalid syntax"
llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from /root/.ollama/models/blobs/sha256-de20d2cf2dc430b1717a8b07a9df029d651f3895dbffec4729a3902a6fe344c9 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 70B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 70B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 80
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 15
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  162 tensors
llama_model_loader: - type q4_K:  441 tensors
llama_model_loader: - type q5_K:   40 tensors
llama_model_loader: - type q6_K:   81 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 39.59 GiB (4.82 BPW) 
load: printing all EOG tokens:
load:   - 128001 ('<|end_of_text|>')
load:   - 128008 ('<|eom_id|>')
load:   - 128009 ('<|eot_id|>')
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 1
print_info: no_alloc         = 0
print_info: model type       = ?B
print_info: model params     = 70.55 B
print_info: general.name     = Meta Llama 3.1 70B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128001 '<|end_of_text|>'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2026-02-13T00:35:41.037Z level=WARN source=server.go:169 msg="requested context size too large for model" num_ctx=262144 n_ctx_train=131072
time=2026-02-13T00:35:41.037Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --model /root/.ollama/models/blobs/sha256-de20d2cf2dc430b1717a8b07a9df029d651f3895dbffec4729a3902a6fe344c9 --port 40785"
time=2026-02-13T00:35:41.038Z level=INFO source=sched.go:463 msg="system memory" total="30.6 GiB" free="30.4 GiB" free_swap="8.0 GiB"
time=2026-02-13T00:35:41.038Z level=INFO source=sched.go:470 msg="gpu memory" id=0 library=ROCm available="110.7 GiB" free="111.2 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-02-13T00:35:41.038Z level=INFO source=server.go:498 msg="loading model" "model layers"=81 requested=-1
time=2026-02-13T00:35:41.038Z level=WARN source=server.go:1044 msg="model request too large for system" requested="92.5 GiB" available="38.4 GiB" total="30.6 GiB" free="30.4 GiB" swap="8.0 GiB"
time=2026-02-13T00:35:41.038Z level=INFO source=sched.go:490 msg="Load failed" model=/root/.ollama/models/blobs/sha256-de20d2cf2dc430b1717a8b07a9df029d651f3895dbffec4729a3902a6fe344c9 error="model requires more system memory (92.5 GiB) than is available (38.4 GiB)"
time=2026-02-13T00:35:41.046Z level=INFO source=runner.go:965 msg="starting go runner"
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
[GIN] 2026/02/13 - 00:35:41 | 500 |  943.203397ms |       127.0.0.1 | POST     "/api/generate"
<!-- gh-comment-id:3894164558 --> @boomam commented on GitHub (Feb 13, 2026): Fresh start of the container - ``` time=2026-02-13T00:34:10.121Z level=INFO source=routes.go:1636 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:0 OLLAMA_DEBUG:INFO OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:24h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:4 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2026-02-13T00:34:10.125Z level=INFO source=images.go:473 msg="total blobs: 113" time=2026-02-13T00:34:10.127Z level=INFO source=images.go:480 msg="total unused blobs removed: 0" time=2026-02-13T00:34:10.128Z level=INFO source=routes.go:1689 msg="Listening on [::]:11434 (version 0.16.0)" time=2026-02-13T00:34:10.128Z level=INFO source=runner.go:67 msg="discovering available GPUs..." time=2026-02-13T00:34:10.128Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 37507" time=2026-02-13T00:34:10.848Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 43623" time=2026-02-13T00:34:11.464Z level=INFO source=types.go:42 msg="inference compute" id=0 filter_id=0 library=ROCm compute=gfx1151 name=ROCm0 description="AMD Radeon Graphics" libdirs=ollama,rocm driver=60342.13 pci_id=0000:c2:00.0 type=iGPU total="111.3 GiB" available="111.2 GiB" time=2026-02-13T00:34:11.464Z level=INFO source=routes.go:1739 msg="vram-based default context" total_vram="111.3 GiB" default_num_ctx=262144 ``` Log after trying one of the models that fails - ``` [GIN] 2026/02/13 - 00:35:39 | 200 | 64.952µs | 127.0.0.1 | HEAD "/" [GIN] 2026/02/13 - 00:35:40 | 200 | 104.11925ms | 127.0.0.1 | POST "/api/show" [GIN] 2026/02/13 - 00:35:40 | 200 | 101.180307ms | 127.0.0.1 | POST "/api/show" time=2026-02-13T00:35:40.247Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 40435" time=2026-02-13T00:35:40.857Z level=WARN source=cpu_linux.go:130 msg="failed to parse CPU allowed micro secs" error="strconv.ParseInt: parsing \"max\": invalid syntax" llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from /root/.ollama/models/blobs/sha256-de20d2cf2dc430b1717a8b07a9df029d651f3895dbffec4729a3902a6fe344c9 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 70B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 llama_model_loader: - kv 5: general.size_label str = 70B llama_model_loader: - kv 6: general.license str = llama3.1 llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 9: llama.block_count u32 = 80 llama_model_loader: - kv 10: llama.context_length u32 = 131072 llama_model_loader: - kv 11: llama.embedding_length u32 = 8192 llama_model_loader: - kv 12: llama.feed_forward_length u32 = 28672 llama_model_loader: - kv 13: llama.attention.head_count u32 = 64 llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 17: general.file_type u32 = 15 llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 28: general.quantization_version u32 = 2 llama_model_loader: - type f32: 162 tensors llama_model_loader: - type q4_K: 441 tensors llama_model_loader: - type q5_K: 40 tensors llama_model_loader: - type q6_K: 81 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 39.59 GiB (4.82 BPW) load: printing all EOG tokens: load: - 128001 ('<|end_of_text|>') load: - 128008 ('<|eom_id|>') load: - 128009 ('<|eot_id|>') load: special tokens cache size = 256 load: token to piece cache size = 0.7999 MB print_info: arch = llama print_info: vocab_only = 1 print_info: no_alloc = 0 print_info: model type = ?B print_info: model params = 70.55 B print_info: general.name = Meta Llama 3.1 70B Instruct print_info: vocab type = BPE print_info: n_vocab = 128256 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128009 '<|eot_id|>' print_info: EOM token = 128008 '<|eom_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128001 '<|end_of_text|>' print_info: EOG token = 128008 '<|eom_id|>' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2026-02-13T00:35:41.037Z level=WARN source=server.go:169 msg="requested context size too large for model" num_ctx=262144 n_ctx_train=131072 time=2026-02-13T00:35:41.037Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --model /root/.ollama/models/blobs/sha256-de20d2cf2dc430b1717a8b07a9df029d651f3895dbffec4729a3902a6fe344c9 --port 40785" time=2026-02-13T00:35:41.038Z level=INFO source=sched.go:463 msg="system memory" total="30.6 GiB" free="30.4 GiB" free_swap="8.0 GiB" time=2026-02-13T00:35:41.038Z level=INFO source=sched.go:470 msg="gpu memory" id=0 library=ROCm available="110.7 GiB" free="111.2 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-02-13T00:35:41.038Z level=INFO source=server.go:498 msg="loading model" "model layers"=81 requested=-1 time=2026-02-13T00:35:41.038Z level=WARN source=server.go:1044 msg="model request too large for system" requested="92.5 GiB" available="38.4 GiB" total="30.6 GiB" free="30.4 GiB" swap="8.0 GiB" time=2026-02-13T00:35:41.038Z level=INFO source=sched.go:490 msg="Load failed" model=/root/.ollama/models/blobs/sha256-de20d2cf2dc430b1717a8b07a9df029d651f3895dbffec4729a3902a6fe344c9 error="model requires more system memory (92.5 GiB) than is available (38.4 GiB)" time=2026-02-13T00:35:41.046Z level=INFO source=runner.go:965 msg="starting go runner" load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so [GIN] 2026/02/13 - 00:35:41 | 500 | 943.203397ms | 127.0.0.1 | POST "/api/generate" ```
Author
Owner

@rick-github commented on GitHub (Feb 13, 2026):

OLLAMA_CONTEXT_LENGTH:0

The context length is not set. Where have you configured it?

<!-- gh-comment-id:3894168342 --> @rick-github commented on GitHub (Feb 13, 2026): ``` OLLAMA_CONTEXT_LENGTH:0 ``` The context length is not set. Where have you configured it?
Author
Owner

@boomam commented on GitHub (Feb 13, 2026):

In the compose file i use.

<!-- gh-comment-id:3894171242 --> @boomam commented on GitHub (Feb 13, 2026): In the compose file i use.
Author
Owner

@rick-github commented on GitHub (Feb 13, 2026):

If you share the compose file that would be helpful.

<!-- gh-comment-id:3894175885 --> @rick-github commented on GitHub (Feb 13, 2026): If you share the compose file that would be helpful.
Author
Owner

@boomam commented on GitHub (Feb 13, 2026):

services:
  ollama-general:
    image: ollama/ollama:0.16.0-rocm
    container_name: ollama-general
    restart: always
    ports:
      - "11434:11434"
    volumes:
      - ollama-general:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0:11434
      - OLLAMA_CONTEXT_LENGTH=8192
      - OLLAMA_KEEP_ALIVE=24h
      - OLLAMA_NUM_PARALLEL=1
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:11434/ || exit 1"]
      interval: 60s
      timeout: 10s
      retries: 5
    devices:
      - /dev/kfd:/dev/kfd
      - /dev/dri:/dev/dri
    networks:
      - internal

volumes:
  ollama-general:

networks:
  internal:
    name: internal
    external: true

<!-- gh-comment-id:3894183696 --> @boomam commented on GitHub (Feb 13, 2026): ```yaml services: ollama-general: image: ollama/ollama:0.16.0-rocm container_name: ollama-general restart: always ports: - "11434:11434" volumes: - ollama-general:/root/.ollama environment: - OLLAMA_HOST=0.0.0.0:11434 - OLLAMA_CONTEXT_LENGTH=8192 - OLLAMA_KEEP_ALIVE=24h - OLLAMA_NUM_PARALLEL=1 healthcheck: test: ["CMD-SHELL", "curl -f http://localhost:11434/ || exit 1"] interval: 60s timeout: 10s retries: 5 devices: - /dev/kfd:/dev/kfd - /dev/dri:/dev/dri networks: - internal volumes: ollama-general: networks: internal: name: internal external: true ```
Author
Owner

@boomam commented on GitHub (Feb 13, 2026):

Ah, i think i fixed it. Give me a moment to validate...

<!-- gh-comment-id:3894188292 --> @boomam commented on GitHub (Feb 13, 2026): Ah, i think i fixed it. Give me a moment to validate...
Author
Owner

@boomam commented on GitHub (Feb 13, 2026):

ok, if i set these variables, as such -
- OLLAMA_CONTEXT_LENGTH=8192
- OLLAMA_NUM_PARALLEL=1
Then ollama run llama3.1:70b works.
If i remove the context length and leave parallel at 1, it still works.
If i reset parallel back to 4, it fails again.

Not tested the other failed models yet.

<!-- gh-comment-id:3894211208 --> @boomam commented on GitHub (Feb 13, 2026): ok, if i set these variables, as such - - OLLAMA_CONTEXT_LENGTH=8192 - OLLAMA_NUM_PARALLEL=1 Then `ollama run llama3.1:70b` works. If i remove the context length and leave parallel at 1, it still works. If i reset parallel back to 4, it fails again. Not tested the other failed models yet.
Author
Owner

@rick-github commented on GitHub (Feb 13, 2026):

OLLAMA_NUM_PARALLEL=4 needs 4 times the context space.

<!-- gh-comment-id:3894214867 --> @rick-github commented on GitHub (Feb 13, 2026): `OLLAMA_NUM_PARALLEL=4` needs 4 times the context space.
Author
Owner

@boomam commented on GitHub (Feb 13, 2026):

Strange though, that some models, even larger ones with similar context specs, work, but certain models like llama3.1, don't.

<!-- gh-comment-id:3894221978 --> @boomam commented on GitHub (Feb 13, 2026): Strange though, that some models, even larger ones with similar context specs, work, but certain models like llama3.1, don't.
Author
Owner

@boomam commented on GitHub (Feb 13, 2026):

Regardless though, 'problem' solved - appreciate the fast responses @rick-github. :-)

<!-- gh-comment-id:3894230067 --> @boomam commented on GitHub (Feb 13, 2026): Regardless though, 'problem' solved - appreciate the fast responses @rick-github. :-)
Author
Owner

@jocull commented on GitHub (Feb 17, 2026):

As noted in https://github.com/open-webui/open-webui/issues/21537 I am chasing this down because it produces a significant change on my Mac. Context windows have suddenly jumped by 10x and models now must reload and use significantly more memory.

So is this change here to stay and I must adjust my machine and configuration for it, or is this unintentional and I should await to be reverted/fixed?

Ollama is installed via homebrew as well which makes altering the environment vars more difficult.

<!-- gh-comment-id:3916639426 --> @jocull commented on GitHub (Feb 17, 2026): As noted in https://github.com/open-webui/open-webui/issues/21537 I am chasing this down because it produces a significant change on my Mac. Context windows have suddenly jumped by 10x and models now must reload and use significantly more memory. So is this change here to stay and I must adjust my machine and configuration for it, or is this unintentional and I should await to be reverted/fixed? Ollama is installed via homebrew as well which makes altering the environment vars more difficult.
Author
Owner

@rick-github commented on GitHub (Feb 17, 2026):

Set OLLAMA_CONTEXT_LENGTH=4096 in the server environment and the behaviour will be the same as before the tiered context change.

<!-- gh-comment-id:3916676564 --> @rick-github commented on GitHub (Feb 17, 2026): Set `OLLAMA_CONTEXT_LENGTH=4096` in the server environment and the behaviour will be the same as before the tiered context change.
Author
Owner

@jocull commented on GitHub (Feb 17, 2026):

Setting the environment vars with a homebrew installed ollama is inordinately hard, which appears to be a longstanding issue: https://github.com/orgs/Homebrew/discussions/6196#discussioncomment-14849386

<!-- gh-comment-id:3916855478 --> @jocull commented on GitHub (Feb 17, 2026): Setting the environment vars with a homebrew installed ollama is inordinately hard, which appears to be a longstanding issue: https://github.com/orgs/Homebrew/discussions/6196#discussioncomment-14849386
Author
Owner

@rick-github commented on GitHub (Feb 18, 2026):

If your installer prevents you from configuring the program it's installing, you can fallback to configuring the model itself.

% ollama show --modelfile qwen3:14b > Modelfile
% echo PARAMETER num_ctx 4096 >> Modelfile
% ollama create qwen3:14b

If you don't like the idea of modifying the model (eg, the modification will get removed if you re-pull the model), then you can create a copy of the model:

% ollama show --modelfile qwen3:14b > Modelfile
% echo PARAMETER num_ctx 4096 >> Modelfile
% ollama create qwen3:14b-task

and then configure OpenWebUI to use this copy for the task of generating chat titles. Go to Admin Panel > Settings > Interface and set Task Model to the new model.

<!-- gh-comment-id:3918103980 --> @rick-github commented on GitHub (Feb 18, 2026): If your installer prevents you from configuring the program it's installing, you can fallback to configuring the model itself. ```console % ollama show --modelfile qwen3:14b > Modelfile % echo PARAMETER num_ctx 4096 >> Modelfile % ollama create qwen3:14b ``` If you don't like the idea of modifying the model (eg, the modification will get removed if you re-pull the model), then you can create a copy of the model: ```console % ollama show --modelfile qwen3:14b > Modelfile % echo PARAMETER num_ctx 4096 >> Modelfile % ollama create qwen3:14b-task ``` and then configure OpenWebUI to use this copy for the task of generating chat titles. Go to Admin Panel > Settings > Interface and set `Task Model` to the new model.
Author
Owner

@winstonma commented on GitHub (Feb 23, 2026):

This is the extracted from AI Engineering book:

GPUs usually come with 16 GB, 24 GB, 48 GB, and 80 GB of memory. Therefore, many popular models are those that max out these memory configurations. It’s not a coincidence that many models today have 7 billion or 65 billion parameters

Should it use runtime-adaptive context lengths (tuned to actual free VRAM instead of a static cap) help avoid OOM on smaller cards while still allowing longer contexts on bigger ones? Curious what the trade-offs look like

<!-- gh-comment-id:3942995800 --> @winstonma commented on GitHub (Feb 23, 2026): This is the extracted from [AI Engineering](https://www.oreilly.com/library/view/ai-engineering/9781098166298/) book: > GPUs usually come with 16 GB, 24 GB, 48 GB, and 80 GB of memory. Therefore, many popular models are those that max out these memory configurations. It’s not a coincidence that many models today have 7 billion or 65 billion parameters Should it use runtime-adaptive context lengths (tuned to actual free VRAM instead of a static cap) help avoid OOM on smaller cards while still allowing longer contexts on bigger ones? Curious what the trade-offs look like
Author
Owner

@viba1 commented on GitHub (Mar 9, 2026):

same problem here

<!-- gh-comment-id:4025043701 --> @viba1 commented on GitHub (Mar 9, 2026): same problem here
Author
Owner

@rick-github commented on GitHub (Mar 9, 2026):

To prevent tiered context length from causing a problem, OLLAMA_CONTEXT_LENGTH can be explicitly set in the server environment.

<!-- gh-comment-id:4025052103 --> @rick-github commented on GitHub (Mar 9, 2026): To prevent tiered context length from causing a problem, `OLLAMA_CONTEXT_LENGTH` can be explicitly set in the [server environment](https://github.com/ollama/ollama/blob/main/docs/faq.mdx#how-do-i-configure-ollama-server).
Author
Owner

@viba1 commented on GitHub (Mar 9, 2026):

To prevent tiered context length from causing a problem, OLLAMA_CONTEXT_LENGTH can be explicitly set in the server environment.

Isn't setting OLLAMA_CONTEXT_LENGTH (in .env) to avoid performance degradation on large models (qwen3:32b) likely to degrade the size of what I could use on smaller models (gemma3:12b) while still not being sufficient for slightly larger models (qwen3.5:35b)?

<!-- gh-comment-id:4025096722 --> @viba1 commented on GitHub (Mar 9, 2026): > To prevent tiered context length from causing a problem, `OLLAMA_CONTEXT_LENGTH` can be explicitly set in the [server environment](https://github.com/ollama/ollama/blob/main/docs/faq.mdx#how-do-i-configure-ollama-server). Isn't setting OLLAMA_CONTEXT_LENGTH (in .env) to avoid performance degradation on large models (qwen3:32b) likely to degrade the size of what I could use on smaller models (gemma3:12b) while still not being sufficient for slightly larger models (qwen3.5:35b)?
Author
Owner

@rick-github commented on GitHub (Mar 9, 2026):

Setting OLLAMA_CONTEXT_LENGTH=4096 will restore the context length behaviour to what it was prior to 0.15.5. It just sets the default, clients are free to load models with larger (or smaller) context sizes as required. If qwen3.5:35b suffers performance degradation in 0.17.7 because of the default context size, the same would have been true in 0.15.4 and earlier (other than that the model not actually being supported). The new tiered context length behaviour is designed to make it easier to load larger models like qwen3.5:35b.

<!-- gh-comment-id:4025159203 --> @rick-github commented on GitHub (Mar 9, 2026): Setting `OLLAMA_CONTEXT_LENGTH=4096` will restore the context length behaviour to what it was prior to 0.15.5. It just sets the default, clients are free to load models with larger (or smaller) context sizes as required. If qwen3.5:35b suffers performance degradation in 0.17.7 because of the default context size, the same would have been true in 0.15.4 and earlier (other than that the model not actually being supported). The new tiered context length behaviour is designed to make it easier to load larger models like qwen3.5:35b.
Author
Owner

@viba1 commented on GitHub (Mar 9, 2026):

Could we not imagine a dynamic context size based on available VRAM and the size of the selected model, in order to always obtain the maximum context while remaining 100% GPU?

<!-- gh-comment-id:4025194023 --> @viba1 commented on GitHub (Mar 9, 2026): Could we not imagine a dynamic context size based on available VRAM and the size of the selected model, in order to always obtain the maximum context while remaining 100% GPU?
Author
Owner

@rick-github commented on GitHub (Mar 9, 2026):

The MLX runner has dynamic context, the llama.cpp and ollama runners do not. There are some PRs retro-fitting semi-dynamic content to the existing runners.

<!-- gh-comment-id:4025273903 --> @rick-github commented on GitHub (Mar 9, 2026): The MLX runner has dynamic context, the llama.cpp and ollama runners do not. There are some PRs retro-fitting semi-dynamic content to the existing runners.
Author
Owner

@stubhead commented on GitHub (Mar 24, 2026):

hello,

i've set Environment=OLLAMA_CONTEXT_LENGTH=4096 in ollama's systemd override.conf. however, when i restart the ollama server, i see the following in the log:

level=INFO source=routes.go:1832 msg="vram-based default context" total_vram="90.0 GiB" default_num_ctx=262144

doesn't the default_num_ctx=262144 mean ollama is using 256k and not 4k as requested?

i can confirm that i'm seeing OLLAMA_CONTEXT_LENGTH=4096 in the log's informational debug output as well, about 10 lines earlier, so it looks like ollama is at least reading the override.conf into account.

why the disparity?

<!-- gh-comment-id:4120188008 --> @stubhead commented on GitHub (Mar 24, 2026): hello, i've set `Environment=OLLAMA_CONTEXT_LENGTH=4096` in ollama's systemd override.conf. however, when i restart the ollama server, i see the following in the log: `level=INFO source=routes.go:1832 msg="vram-based default context" total_vram="90.0 GiB" default_num_ctx=262144` doesn't the `default_num_ctx=262144` mean ollama is using 256k and not 4k as requested? i *can* confirm that i'm seeing `OLLAMA_CONTEXT_LENGTH=4096` in the log's informational debug output as well, about 10 lines earlier, so it looks like ollama is at least reading the override.conf into account. why the disparity?
Author
Owner

@rick-github commented on GitHub (Mar 24, 2026):

"vram-based default context" indicates the default context if not overridden by OLLAMA_CONTEXT_LENGTH or num_ctx in the API call/Modelfile.

<!-- gh-comment-id:4120234096 --> @rick-github commented on GitHub (Mar 24, 2026): `"vram-based default context"` indicates the default context if not overridden by `OLLAMA_CONTEXT_LENGTH` or `num_ctx` in the API call/Modelfile.
Author
Owner

@stubhead commented on GitHub (Mar 25, 2026):

"vram-based default context" indicates the default context if not overridden by OLLAMA_CONTEXT_LENGTH or num_ctx in the API call/Modelfile.

many thanks for the clarification, and the quick reply!

<!-- gh-comment-id:4124800727 --> @stubhead commented on GitHub (Mar 25, 2026): > `"vram-based default context"` indicates the default context if not overridden by `OLLAMA_CONTEXT_LENGTH` or `num_ctx` in the API call/Modelfile. many thanks for the clarification, and the quick reply!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#55722