[GH-ISSUE #13791] ollama Vulcan crash(780m) #9035

Open
opened 2026-04-12 21:51:53 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @soozs1 on GitHub (Jan 20, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/13791

What is the issue?

ollama_logs.log

Title: Vulkan backend crash (Exception 0xc0000005) during context shift on AMD Radeon 780M iGPU (Windows)

Environment
Ollama Version: 0.14.2

Operating System: Windows

Processor/GPU: AMD 7840HS with Radeon 780M iGPU (Architecture: gfx1103)

Memory allocated to iGPU in BIOS: 16 GB (UMA Frame Buffer Size)

Backend: Vulkan (experimental, enabled via OLLAMA_VULKAN=1)

Overrides: HSA_OVERRIDE_GFX_VERSION=10.3.0 (Note: logs show a warning about this)

Steps to Reproduce
Set environment variable: OLLAMA_VULKAN=1.

Start ollama serve (with OLLAMA_DEBUG=1 for verbose logs).

Load a large model, e.g., Qwen3-8B (GGUF, Q8_0).

Send a prompt long enough to fill the context window (4096 tokens).

The model begins generation, hits the context limit ("context limit hit - shifting"), and then the Vulkan backend crashes with an access violation.

Observed Behavior & Logs
The model loads successfully, and all 37 layers are offloaded to the GPU ("offloaded 37/37 layers to GPU"). The server starts normally. However, during the inference process, when the context window is full and the system attempts to shift it, a critical error occurs:

text
time=2026-01-20T13:07:41.404+03:00 level=DEBUG source=cache.go:286 msg="context limit hit - shifting" id=0 limit=4096 input=4096 keep=4 discard=2046
...
Exception 0xc0000005 0x1 0x8 0x7ffad67ca728
PC=0x7ffad67ca728
signal arrived during external code execution

time=2026-01-20T13:07:51.097+03:00 level=ERROR source=server.go:1592 msg="post predict" error="Post "http://127.0.0.1:7317/completion": read tcp 127.0.0.1:7322->127.0.0.1:7317: wsarecv: An existing connection was forcibly closed by the remote host."
Key Observations:

Crash Point: The crash (Exception 0xc0000005 - access violation) consistently occurs after the log line "context limit hit - shifting". The Vulkan runner process dies, forcing the main server connection to close.

Model Specific: Smaller models (e.g., phi3:mini) work without issue. The problem appears specific to larger models like the 8B parameter Qwen.

Memory: The GPU has sufficient memory (16.2 GiB total, ~13 GiB free during load). The model's total memory requirement is reported as ~8.8 GiB.

Flash Attention: The logs show "enabling flash attention" for this model run.

Full server logs from startup to crash are attached below.

Expected Behavior
The model should successfully perform the context shift operation and continue generating the response without crashing.

Additional Context / Attempted Workarounds
The standard ROCm backend is not supported for this GPU architecture (gfx1103), throwing an invalid device function error. Vulkan is the only available GPU acceleration path.

The issue is reproducible. Multiple attempts yield the same crash at the same point.

Related Issues: This might be connected to memory management problems discussed in issues like #12913 and #13677, but this specific crash during context shift on Windows with an AMD RDNA3 iGPU is a new combination.

Relevant log output


OS

Windows

GPU

AMD

CPU

AMD

Ollama version

0.14.2

Originally created by @soozs1 on GitHub (Jan 20, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/13791 ### What is the issue? [ollama_logs.log](https://github.com/user-attachments/files/24734885/ollama_logs.log) Title: Vulkan backend crash (Exception 0xc0000005) during context shift on AMD Radeon 780M iGPU (Windows) Environment Ollama Version: 0.14.2 Operating System: Windows Processor/GPU: AMD 7840HS with Radeon 780M iGPU (Architecture: gfx1103) Memory allocated to iGPU in BIOS: 16 GB (UMA Frame Buffer Size) Backend: Vulkan (experimental, enabled via OLLAMA_VULKAN=1) Overrides: HSA_OVERRIDE_GFX_VERSION=10.3.0 (Note: logs show a warning about this) Steps to Reproduce Set environment variable: OLLAMA_VULKAN=1. Start ollama serve (with OLLAMA_DEBUG=1 for verbose logs). Load a large model, e.g., Qwen3-8B (GGUF, Q8_0). Send a prompt long enough to fill the context window (4096 tokens). The model begins generation, hits the context limit ("context limit hit - shifting"), and then the Vulkan backend crashes with an access violation. Observed Behavior & Logs The model loads successfully, and all 37 layers are offloaded to the GPU ("offloaded 37/37 layers to GPU"). The server starts normally. However, during the inference process, when the context window is full and the system attempts to shift it, a critical error occurs: text time=2026-01-20T13:07:41.404+03:00 level=DEBUG source=cache.go:286 msg="context limit hit - shifting" id=0 limit=4096 input=4096 keep=4 discard=2046 ... Exception 0xc0000005 0x1 0x8 0x7ffad67ca728 PC=0x7ffad67ca728 signal arrived during external code execution time=2026-01-20T13:07:51.097+03:00 level=ERROR source=server.go:1592 msg="post predict" error="Post \"http://127.0.0.1:7317/completion\": read tcp 127.0.0.1:7322->127.0.0.1:7317: wsarecv: An existing connection was forcibly closed by the remote host." Key Observations: Crash Point: The crash (Exception 0xc0000005 - access violation) consistently occurs after the log line "context limit hit - shifting". The Vulkan runner process dies, forcing the main server connection to close. Model Specific: Smaller models (e.g., phi3:mini) work without issue. The problem appears specific to larger models like the 8B parameter Qwen. Memory: The GPU has sufficient memory (16.2 GiB total, ~13 GiB free during load). The model's total memory requirement is reported as ~8.8 GiB. Flash Attention: The logs show "enabling flash attention" for this model run. Full server logs from startup to crash are attached below. Expected Behavior The model should successfully perform the context shift operation and continue generating the response without crashing. Additional Context / Attempted Workarounds The standard ROCm backend is not supported for this GPU architecture (gfx1103), throwing an invalid device function error. Vulkan is the only available GPU acceleration path. The issue is reproducible. Multiple attempts yield the same crash at the same point. Related Issues: This might be connected to memory management problems discussed in issues like #12913 and #13677, but this specific crash during context shift on Windows with an AMD RDNA3 iGPU is a new combination. ### Relevant log output ```shell ``` ### OS Windows ### GPU AMD ### CPU AMD ### Ollama version 0.14.2
GiteaMirror added the bug label 2026-04-12 21:51:53 -05:00
Author
Owner

@rick-github commented on GitHub (Jan 20, 2026):

Does it also happen if you use the model from the ollama library (qwen3:8b-q8_0)?

<!-- gh-comment-id:3772153805 --> @rick-github commented on GitHub (Jan 20, 2026): Does it also happen if you use the model from the ollama library ([qwen3:8b-q8_0](https://ollama.com/library/qwen3:8b-q8_0))?
Author
Owner

@cjangaritas commented on GitHub (Jan 20, 2026):

I was able to reproduce the error using ollama model qwen3:0.6b by using a big prompt, 1000 lines.
I used an amd laptop and activated Vulkan to test, the app crashes and gives the provided error.
When disabling Vulkan env var, ollama gives an answer back eventually without crashing.

<!-- gh-comment-id:3775022417 --> @cjangaritas commented on GitHub (Jan 20, 2026): I was able to reproduce the error using ollama model qwen3:0.6b by using a big prompt, 1000 lines. I used an amd laptop and activated Vulkan to test, the app crashes and gives the provided error. When disabling Vulkan env var, ollama gives an answer back eventually without crashing.
Author
Owner

@Nyx1197 commented on GitHub (Jan 22, 2026):

I have the same issue.
For now, I've tried setting the environment variable "OLLAMA_CONTEXT_LENGTH=16384" to avoid the context limit hit - shifting, although this increases vram usage. No similar issues were observed with CUDA or older versions of ROCm (0.12.3). Since I'm using the MI50 (GFX906), I'm unable to use newer versions of ROCm.

<!-- gh-comment-id:3783149612 --> @Nyx1197 commented on GitHub (Jan 22, 2026): I have the same issue. For now, I've tried setting the environment variable "OLLAMA_CONTEXT_LENGTH=16384" to avoid the `context limit hit - shifting`, although this increases vram usage. No similar issues were observed with CUDA or older versions of ROCm (0.12.3). Since I'm using the MI50 (GFX906), I'm unable to use newer versions of ROCm.
Author
Owner

@soozs1 commented on GitHub (Jan 22, 2026):

Yes, the issue is reproducible. The reason I cannot use the ROCm backend is that my iGPU has a locked/static dedicated memory allocation (vRAM) of only 512MB, which is insufficient for loading model layers. This crash occurs even though the system's unified memory architecture can, in principle, dynamically expand the available memory for graphics. In this specific case, that expansion does not seem to be utilized or accessible by the ROCm backend, leading to the failure.

<!-- gh-comment-id:3783572802 --> @soozs1 commented on GitHub (Jan 22, 2026): Yes, the issue is reproducible. The reason I cannot use the ROCm backend is that my iGPU has a locked/static dedicated memory allocation (vRAM) of only 512MB, which is insufficient for loading model layers. This crash occurs even though the system's unified memory architecture can, in principle, dynamically expand the available memory for graphics. In this specific case, that expansion does not seem to be utilized or accessible by the ROCm backend, leading to the failure.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#9035