[GH-ISSUE #14426] . #71426

Closed
opened 2026-05-05 01:37:48 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @ghost on GitHub (Feb 25, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14426

.

Originally created by @ghost on GitHub (Feb 25, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14426 .
GiteaMirror added the bug label 2026-05-05 01:37:48 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 25, 2026):

Server logs will aid in debugging.

<!-- gh-comment-id:3961787049 --> @rick-github commented on GitHub (Feb 25, 2026): [Server logs](https://docs.ollama.com/troubleshooting) will aid in debugging.
Author
Owner

@xXMrNidaXx commented on GitHub (Feb 25, 2026):

Diagnostic Steps for AMD GPU + 0.17.0

A few things to check:

1. You mentioned "CudaMemory error" but have an AMD GPU

Ollama uses ROCm for AMD, not CUDA. If you're seeing CUDA errors, something's misconfigured:

# Check what Ollama detects
ollama run --verbose llama3.2:1b 2>&1 | head -20

# Verify ROCm is working
rocminfo | grep "Agent" 

2. The model recreation might have introduced issues

When you did ollama create ... -f glm4.7, did the modelfile have correct parameters for your VRAM? Check:

cat glm4.7 | grep -E "PARAMETER|num_gpu"

For 16GB VRAM with a Q3_K_XL quant, you should have headroom, but try:

# Force CPU-only to isolate GPU issues
OLLAMA_NUM_GPU=0 ollama run glm-4.7-flash:UD-Q3_K_XL

If CPU works, the issue is GPU detection/allocation.

3. 0.17.0 regression vs 0.16.3

If 0.16.3 worked with occasional 500s and 0.17.0 fails 100%, try:

# Downgrade
curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.16.3 sh

# Or check logs for what changed
journalctl -u ollama --since "1 hour ago" | grep -i error

4. Specific to AMD ROCm:

Which ROCm version and GPU model?

rocm-smi --showproductname
cat /opt/rocm/.info/version

0.17.0 may have changed ROCm compatibility — there have been reports with older ROCm versions.


If you can share the actual error from ollama run --verbose glm-4.7-flash:UD-Q3_K_XL 2>&1, that would help narrow it down. The "too few bytes" error suggests model layer offloading is failing mid-inference.

<!-- gh-comment-id:3962067103 --> @xXMrNidaXx commented on GitHub (Feb 25, 2026): ### Diagnostic Steps for AMD GPU + 0.17.0 A few things to check: **1. You mentioned "CudaMemory error" but have an AMD GPU** Ollama uses ROCm for AMD, not CUDA. If you're seeing CUDA errors, something's misconfigured: ```bash # Check what Ollama detects ollama run --verbose llama3.2:1b 2>&1 | head -20 # Verify ROCm is working rocminfo | grep "Agent" ``` **2. The model recreation might have introduced issues** When you did `ollama create ... -f glm4.7`, did the modelfile have correct parameters for your VRAM? Check: ```bash cat glm4.7 | grep -E "PARAMETER|num_gpu" ``` For 16GB VRAM with a Q3_K_XL quant, you should have headroom, but try: ```bash # Force CPU-only to isolate GPU issues OLLAMA_NUM_GPU=0 ollama run glm-4.7-flash:UD-Q3_K_XL ``` If CPU works, the issue is GPU detection/allocation. **3. 0.17.0 regression vs 0.16.3** If 0.16.3 worked with occasional 500s and 0.17.0 fails 100%, try: ```bash # Downgrade curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.16.3 sh # Or check logs for what changed journalctl -u ollama --since "1 hour ago" | grep -i error ``` **4. Specific to AMD ROCm:** Which ROCm version and GPU model? ```bash rocm-smi --showproductname cat /opt/rocm/.info/version ``` 0.17.0 may have changed ROCm compatibility — there have been reports with older ROCm versions. --- If you can share the actual error from `ollama run --verbose glm-4.7-flash:UD-Q3_K_XL 2>&1`, that would help narrow it down. The "too few bytes" error suggests model layer offloading is failing mid-inference.
Author
Owner

@rick-github commented on GitHub (Feb 25, 2026):

@xXMrNidaXx

OLLAMA_NUM_GPU=0 ollama run glm-4.7-flash:UD-Q3_K_XL

Knock it off with the AI slop.

<!-- gh-comment-id:3962083797 --> @rick-github commented on GitHub (Feb 25, 2026): @xXMrNidaXx > OLLAMA_NUM_GPU=0 ollama run glm-4.7-flash:UD-Q3_K_XL Knock it off with the AI slop.
Author
Owner

@rick-github commented on GitHub (Feb 26, 2026):

Feb 26 13:44:33 lain ollama[31676]: time=2026-02-26T13:44:33.908+01:00 level=INFO source=routes.go:1768
msg="vram-based default context" total_vram="23.7 GiB" default_num_ctx=32768

Feb 26 13:45:10 lain ollama[31676]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 3196.00 MiB on device 0:
 cudaMalloc failed: out of memory

A change in the default context (#14116) results in ollama trying to allocate too much VRAM. Set OLLAMA_CONTEXT_LENGTH=4096 in the server environment to restore the previous behaviour.

<!-- gh-comment-id:3967199454 --> @rick-github commented on GitHub (Feb 26, 2026): ``` Feb 26 13:44:33 lain ollama[31676]: time=2026-02-26T13:44:33.908+01:00 level=INFO source=routes.go:1768 msg="vram-based default context" total_vram="23.7 GiB" default_num_ctx=32768 Feb 26 13:45:10 lain ollama[31676]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 3196.00 MiB on device 0: cudaMalloc failed: out of memory ``` A change in the default context (#14116) results in ollama trying to allocate too much VRAM. Set `OLLAMA_CONTEXT_LENGTH=4096` in the server environment to restore the previous behaviour.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#71426