[GH-ISSUE #10234] Mistral Small 3.1 - Sometimes crashes Olllama during image chats #53228

Closed
opened 2026-04-29 02:24:24 -05:00 by GiteaMirror · 8 comments
Owner

Originally created by @Notbici on GitHub (Apr 11, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10234

What is the issue?

We're demoing LLAMA as a team and I reviewed the logs after we noticed some weird performance hits seeing sometimes a CUDA error regarding "SCALE".

Using Ollama model: https://ollama.com/library/mistral-small3.1:24b-instruct-2503-q8_0

We're unsure how to replicate this, noticing it only happens from time to time and only if we're attaching base64 images to our API calls to Ollama.

Additional Details:
| NVIDIA-SMI 570.86.16 Driver Version: 570.86.16 CUDA Version: 12.8 |

GPUs:

  • 4x RTX 4090

Additional overrides for Ollama:
[Service]
Environment="OLLAMA_HOST=our internal IP"
Environment="OLLAMA_MAX_QUEUE=5"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
Environment="OLLAMA_FLASH_ATTENTION=1"

Thanks!

Relevant log output

Apr 11 11:14:23 workstation3 ollama[1649]: [GIN] 2025/04/11 - 11:14:23 | 200 | 16.160088411s |  <internal ip> | POST     "/api/chat"
Apr 11 11:14:25 workstation3 ollama[1649]: [GIN] 2025/04/11 - 11:14:25 | 200 | 18.199299891s |  <internal ip> | POST     "/api/chat"
Apr 11 11:14:26 workstation3 ollama[1649]: [GIN] 2025/04/11 - 11:14:26 | 200 | 19.944917796s |  <internal ip> | POST     "/api/chat"
Apr 11 11:14:27 workstation3 ollama[1649]: ggml_cuda_compute_forward: SCALE failed
Apr 11 11:14:27 workstation3 ollama[1649]: CUDA error: invalid configuration argument
Apr 11 11:14:27 workstation3 ollama[1649]:   current device: 3, in function ggml_cuda_compute_forward at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:2315
Apr 11 11:14:27 workstation3 ollama[1649]:   err
Apr 11 11:14:27 workstation3 ollama[1649]: //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:73: CUDA error
Apr 11 11:14:27 workstation3 ollama[1649]: SIGSEGV: segmentation violation
Apr 11 11:14:27 workstation3 ollama[1649]: PC=0x7f2e0f40ac97 m=52 sigcode=1 addr=0x214803ee4
Apr 11 11:14:27 workstation3 ollama[1649]: signal arrived during cgo execution
Apr 11 11:14:27 workstation3 ollama[1649]: goroutine 11 gp=0xc000103dc0 m=52 mp=0xc002406008 [syscall]:
Apr 11 11:14:27 workstation3 ollama[1649]: runtime.cgocall(0x55e14a14b180, 0xc0000bb7d0)
Apr 11 11:14:27 workstation3 ollama[1649]:         runtime/cgocall.go:167 +0x4b fp=0xc0000bb7a8 sp=0xc0000bb770 pc=0x55e149313aab
Apr 11 11:14:27 workstation3 ollama[1649]: github.com/ollama/ollama/ml/backend/ggml._Cfunc_ggml_backend_sched_graph_compute_async(0x7f2ea8004870, 0x7f2700332fa0)
Apr 11 11:14:27 workstation3 ollama[1649]:         _cgo_gotypes.go:486 +0x4a fp=0xc0000bb7d0 sp=0xc0000bb7a8 pc=0x55e1497106ca
Apr 11 11:14:27 workstation3 ollama[1649]: github.com/ollama/ollama/ml/backend/ggml.Context.Compute.func1(...)
Apr 11 11:14:27 workstation3 ollama[1649]:         github.com/ollama/ollama/ml/backend/ggml/ggml.go:515
Apr 11 11:14:27 workstation3 ollama[1649]: github.com/ollama/ollama/ml/backend/ggml.Context.Compute({0xc00182c040, 0x7f2700332f00, 0x7f2700332fa0, 0x0, 0x2000}, {0xc0015f4030, 0x1, 0x7f2700332f00?})
Apr 11 11:14:27 workstation3 ollama[1649]:         github.com/ollama/ollama/ml/backend/ggml/ggml.go:515 +0xbd fp=0xc0000bb860 sp=0xc0000bb7d0 pc=0x55e1497198bd
Apr 11 11:14:27 workstation3 ollama[1649]: github.com/ollama/ollama/ml/backend/ggml.(*Context).Compute(0xc00179fad0?, {0xc0015f4030?, 0x1?, 0x4a433020?})
Apr 11 11:14:27 workstation3 ollama[1649]:         <autogenerated>:1 +0x72 fp=0xc0000bb8d8 sp=0xc0000bb860 pc=0x55e14971fd32
Apr 11 11:14:27 workstation3 ollama[1649]: github.com/ollama/ollama/model/models/mistral3.(*TextModel).Forward.func1()
Apr 11 11:14:27 workstation3 ollama[1649]:         github.com/ollama/ollama/model/models/mistral3/model_text.go:117 +0xe2 fp=0xc0000bb930 sp=0xc0000bb8d8 pc=0x55e1497ba2c2
Apr 11 11:14:27 workstation3 ollama[1649]: sync.(*Once).doSlow(0xc0011eb368?, 0x55e14a5ff838?)
Apr 11 11:14:27 workstation3 ollama[1649]:         sync/once.go:78 +0xab fp=0xc0000bb988 sp=0xc0000bb930 pc=0x55e149328b0b
Apr 11 11:14:27 workstation3 ollama[1649]: sync.(*Once).Do(...)
Apr 11 11:14:27 workstation3 ollama[1649]:         sync/once.go:69
Apr 11 11:14:27 workstation3 ollama[1649]: github.com/ollama/ollama/model/models/mistral3.(*TextModel).Forward(0xc00045f030, {0x55e14a5ff838, 0xc00179f8c0}, {0x55e14a6090d0?, 0xc0011eb2c0?}, {0x55e14a6090d0, 0xc0011eb338}, {0x55e14a6090d0, 0xc0011eb350}, {{0x55e14a6090d0, ...}, ...}, ...)
Apr 11 11:14:27 workstation3 ollama[1649]:         github.com/ollama/ollama/model/models/mistral3/model_text.go:114 +0x1c8 fp=0xc0000bba90 sp=0xc0000bb988 pc=0x55e1497b9d48

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.6.5

Originally created by @Notbici on GitHub (Apr 11, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10234 ### What is the issue? We're demoing LLAMA as a team and I reviewed the logs after we noticed some weird performance hits seeing sometimes a CUDA error regarding "SCALE". Using Ollama model: https://ollama.com/library/mistral-small3.1:24b-instruct-2503-q8_0 We're unsure how to replicate this, noticing it only happens from time to time and only if we're attaching base64 images to our API calls to Ollama. Additional Details: | NVIDIA-SMI 570.86.16 Driver Version: 570.86.16 CUDA Version: 12.8 | GPUs: - 4x RTX 4090 Additional overrides for Ollama: [Service] Environment="OLLAMA_HOST=our internal IP" Environment="OLLAMA_MAX_QUEUE=5" Environment="OLLAMA_NUM_PARALLEL=1" Environment="OLLAMA_MAX_LOADED_MODELS=1" Environment="OLLAMA_FLASH_ATTENTION=1" Thanks! ### Relevant log output ```shell Apr 11 11:14:23 workstation3 ollama[1649]: [GIN] 2025/04/11 - 11:14:23 | 200 | 16.160088411s | <internal ip> | POST "/api/chat" Apr 11 11:14:25 workstation3 ollama[1649]: [GIN] 2025/04/11 - 11:14:25 | 200 | 18.199299891s | <internal ip> | POST "/api/chat" Apr 11 11:14:26 workstation3 ollama[1649]: [GIN] 2025/04/11 - 11:14:26 | 200 | 19.944917796s | <internal ip> | POST "/api/chat" Apr 11 11:14:27 workstation3 ollama[1649]: ggml_cuda_compute_forward: SCALE failed Apr 11 11:14:27 workstation3 ollama[1649]: CUDA error: invalid configuration argument Apr 11 11:14:27 workstation3 ollama[1649]: current device: 3, in function ggml_cuda_compute_forward at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:2315 Apr 11 11:14:27 workstation3 ollama[1649]: err Apr 11 11:14:27 workstation3 ollama[1649]: //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:73: CUDA error Apr 11 11:14:27 workstation3 ollama[1649]: SIGSEGV: segmentation violation Apr 11 11:14:27 workstation3 ollama[1649]: PC=0x7f2e0f40ac97 m=52 sigcode=1 addr=0x214803ee4 Apr 11 11:14:27 workstation3 ollama[1649]: signal arrived during cgo execution Apr 11 11:14:27 workstation3 ollama[1649]: goroutine 11 gp=0xc000103dc0 m=52 mp=0xc002406008 [syscall]: Apr 11 11:14:27 workstation3 ollama[1649]: runtime.cgocall(0x55e14a14b180, 0xc0000bb7d0) Apr 11 11:14:27 workstation3 ollama[1649]: runtime/cgocall.go:167 +0x4b fp=0xc0000bb7a8 sp=0xc0000bb770 pc=0x55e149313aab Apr 11 11:14:27 workstation3 ollama[1649]: github.com/ollama/ollama/ml/backend/ggml._Cfunc_ggml_backend_sched_graph_compute_async(0x7f2ea8004870, 0x7f2700332fa0) Apr 11 11:14:27 workstation3 ollama[1649]: _cgo_gotypes.go:486 +0x4a fp=0xc0000bb7d0 sp=0xc0000bb7a8 pc=0x55e1497106ca Apr 11 11:14:27 workstation3 ollama[1649]: github.com/ollama/ollama/ml/backend/ggml.Context.Compute.func1(...) Apr 11 11:14:27 workstation3 ollama[1649]: github.com/ollama/ollama/ml/backend/ggml/ggml.go:515 Apr 11 11:14:27 workstation3 ollama[1649]: github.com/ollama/ollama/ml/backend/ggml.Context.Compute({0xc00182c040, 0x7f2700332f00, 0x7f2700332fa0, 0x0, 0x2000}, {0xc0015f4030, 0x1, 0x7f2700332f00?}) Apr 11 11:14:27 workstation3 ollama[1649]: github.com/ollama/ollama/ml/backend/ggml/ggml.go:515 +0xbd fp=0xc0000bb860 sp=0xc0000bb7d0 pc=0x55e1497198bd Apr 11 11:14:27 workstation3 ollama[1649]: github.com/ollama/ollama/ml/backend/ggml.(*Context).Compute(0xc00179fad0?, {0xc0015f4030?, 0x1?, 0x4a433020?}) Apr 11 11:14:27 workstation3 ollama[1649]: <autogenerated>:1 +0x72 fp=0xc0000bb8d8 sp=0xc0000bb860 pc=0x55e14971fd32 Apr 11 11:14:27 workstation3 ollama[1649]: github.com/ollama/ollama/model/models/mistral3.(*TextModel).Forward.func1() Apr 11 11:14:27 workstation3 ollama[1649]: github.com/ollama/ollama/model/models/mistral3/model_text.go:117 +0xe2 fp=0xc0000bb930 sp=0xc0000bb8d8 pc=0x55e1497ba2c2 Apr 11 11:14:27 workstation3 ollama[1649]: sync.(*Once).doSlow(0xc0011eb368?, 0x55e14a5ff838?) Apr 11 11:14:27 workstation3 ollama[1649]: sync/once.go:78 +0xab fp=0xc0000bb988 sp=0xc0000bb930 pc=0x55e149328b0b Apr 11 11:14:27 workstation3 ollama[1649]: sync.(*Once).Do(...) Apr 11 11:14:27 workstation3 ollama[1649]: sync/once.go:69 Apr 11 11:14:27 workstation3 ollama[1649]: github.com/ollama/ollama/model/models/mistral3.(*TextModel).Forward(0xc00045f030, {0x55e14a5ff838, 0xc00179f8c0}, {0x55e14a6090d0?, 0xc0011eb2c0?}, {0x55e14a6090d0, 0xc0011eb338}, {0x55e14a6090d0, 0xc0011eb350}, {{0x55e14a6090d0, ...}, ...}, ...) Apr 11 11:14:27 workstation3 ollama[1649]: github.com/ollama/ollama/model/models/mistral3/model_text.go:114 +0x1c8 fp=0xc0000bba90 sp=0xc0000bb988 pc=0x55e1497b9d48 ``` ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.6.5
GiteaMirror added the bug label 2026-04-29 02:24:24 -05:00
Author
Owner

@richardhundt commented on GitHub (Apr 13, 2025):

2 x RTX A5000 GPUs.

I'm seeing this model eating up system RAM at a crazy rate until the kernel's oom killer knocks it over, despite the model comfortably fitting into VRAM. It's the only model that does this. Let me know if you need more info.

<!-- gh-comment-id:2800129847 --> @richardhundt commented on GitHub (Apr 13, 2025): 2 x RTX A5000 GPUs. I'm seeing this model eating up system RAM at a crazy rate until the kernel's oom killer knocks it over, despite the model comfortably fitting into VRAM. It's the only model that does this. Let me know if you need more info.
Author
Owner

@Notbici commented on GitHub (Apr 15, 2025):

I'm not a developer so I might say if someone with experience can check that stacktrace and confirm nothing seems wrong, it could be my hardware.

I ran a Mem test and found a faulty stick of ram, replaced it and so far we seem okay, chances it was bad ram.

<!-- gh-comment-id:2804592094 --> @Notbici commented on GitHub (Apr 15, 2025): I'm not a developer so I might say if someone with experience can check that stacktrace and confirm nothing seems wrong, it could be my hardware. I ran a Mem test and found a faulty stick of ram, replaced it and so far we seem okay, chances it was bad ram.
Author
Owner

@deece commented on GitHub (Apr 23, 2025):

I see the same thing, similar model (mistral-small3.1:24b), 7x3090 EPYC with 512GB ECC memory. No ECC errors were reported. I have images passed in by path to the Python ollama client library.

<!-- gh-comment-id:2823752562 --> @deece commented on GitHub (Apr 23, 2025): I see the same thing, similar model (mistral-small3.1:24b), 7x3090 EPYC with 512GB ECC memory. No ECC errors were reported. I have images passed in by path to the Python ollama client library.
Author
Owner

@deece commented on GitHub (Apr 23, 2025):

This seems related: https://github.com/ollama/ollama/issues/7101

<!-- gh-comment-id:2823762382 --> @deece commented on GitHub (Apr 23, 2025): This seems related: https://github.com/ollama/ollama/issues/7101
Author
Owner

@deece commented on GitHub (Apr 23, 2025):

Here is the full debug log from a run where it fell over:
ollama-debug.log

The relevant bit of the log appears to be:

Apr 23 22:05:43 volta ollama[27240]: ggml_cuda_compute_forward: SCALE failed
Apr 23 22:05:43 volta ollama[27240]: CUDA error: invalid configuration argument
Apr 23 22:05:43 volta ollama[27240]:   current device: 6, in function ggml_cuda_compute_forward at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:2374
Apr 23 22:05:43 volta ollama[27240]:   err
Apr 23 22:05:43 volta ollama[27240]: //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:75: CUDA error

I'm not sure whether these are of concern - given that the error occurs when images are passed, it may be related:

Apr 23 22:05:40 volta ollama[27240]: time=2025-04-23T22:05:40.966+10:00 level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
Apr 23 22:05:40 volta ollama[27240]: time=2025-04-23T22:05:40.970+10:00 level=WARN source=ggml.go:152 msg="key not found" key=mistral3.rope.freq_scale default=1
Apr 23 22:05:40 volta ollama[27240]: time=2025-04-23T22:05:40.970+10:00 level=WARN source=ggml.go:152 msg="key not found" key=mistral3.vision.attention.layer_norm_epsilon default=9.999999747378752e-06
Apr 23 22:05:40 volta ollama[27240]: time=2025-04-23T22:05:40.970+10:00 level=WARN source=ggml.go:152 msg="key not found" key=mistral3.vision.longest_edge default=1540
Apr 23 22:05:40 volta ollama[27240]: time=2025-04-23T22:05:40.970+10:00 level=WARN source=ggml.go:152 msg="key not found" key=mistral3.text_config.rms_norm_eps default=9.999999747378752e-06

Looking at the CUDA code, it looks like it's trying to perform a scale operation on a vector of floats and write the output to a destination array. My guess is that the destination pointer is either unitintialised, or points to too small a buffer.

Any hints on how to debug the CUDA code?

<!-- gh-comment-id:2824090843 --> @deece commented on GitHub (Apr 23, 2025): Here is the full debug log from a run where it fell over: [ollama-debug.log](https://github.com/user-attachments/files/19867012/ollama-debug.log) The relevant bit of the log appears to be: ``` Apr 23 22:05:43 volta ollama[27240]: ggml_cuda_compute_forward: SCALE failed Apr 23 22:05:43 volta ollama[27240]: CUDA error: invalid configuration argument Apr 23 22:05:43 volta ollama[27240]: current device: 6, in function ggml_cuda_compute_forward at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:2374 Apr 23 22:05:43 volta ollama[27240]: err Apr 23 22:05:43 volta ollama[27240]: //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:75: CUDA error ``` I'm not sure whether these are of concern - given that the error occurs when images are passed, it may be related: ``` Apr 23 22:05:40 volta ollama[27240]: time=2025-04-23T22:05:40.966+10:00 level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" Apr 23 22:05:40 volta ollama[27240]: time=2025-04-23T22:05:40.970+10:00 level=WARN source=ggml.go:152 msg="key not found" key=mistral3.rope.freq_scale default=1 Apr 23 22:05:40 volta ollama[27240]: time=2025-04-23T22:05:40.970+10:00 level=WARN source=ggml.go:152 msg="key not found" key=mistral3.vision.attention.layer_norm_epsilon default=9.999999747378752e-06 Apr 23 22:05:40 volta ollama[27240]: time=2025-04-23T22:05:40.970+10:00 level=WARN source=ggml.go:152 msg="key not found" key=mistral3.vision.longest_edge default=1540 Apr 23 22:05:40 volta ollama[27240]: time=2025-04-23T22:05:40.970+10:00 level=WARN source=ggml.go:152 msg="key not found" key=mistral3.text_config.rms_norm_eps default=9.999999747378752e-06 ``` Looking at the CUDA code, it looks like it's trying to perform a scale operation on a vector of floats and write the output to a destination array. My guess is that the destination pointer is either unitintialised, or points to too small a buffer. Any hints on how to debug the CUDA code?
Author
Owner

@codearranger commented on GitHub (May 6, 2025):

I have an image attached to https://github.com/ollama/ollama/issues/10377 that causes a failure every time that you can use for testing a fix.

<!-- gh-comment-id:2855481009 --> @codearranger commented on GitHub (May 6, 2025): I have an image attached to https://github.com/ollama/ollama/issues/10377 that causes a failure every time that you can use for testing a fix.
Author
Owner

@jhony1104 commented on GitHub (Jun 27, 2025):

this issue also occures with mistral-small3.2

<!-- gh-comment-id:3013161520 --> @jhony1104 commented on GitHub (Jun 27, 2025): this issue also occures with mistral-small3.2
Author
Owner

@jessegross commented on GitHub (Sep 30, 2025):

Fixed by #12400

<!-- gh-comment-id:3353946718 --> @jessegross commented on GitHub (Sep 30, 2025): Fixed by #12400
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#53228