[GH-ISSUE #12696] Model runner crashes with SIGABRT when prompt exceeds ~33,272 tokens on deepseek-v3.1:671b with num_ctx >= 40960 #70483

Open
opened 2026-05-04 21:41:04 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @andrewwutw on GitHub (Oct 19, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12696

Originally assigned to: @gr4ceG on GitHub.

What is the issue?

Summary

The ollama server crashes with a SIGABRT error when running the deepseek-v3.1:671b model with long prompts exceeding approximately 33,272 tokens when num_ctx is set to 40,960 or higher. The issue is reproducible and appears to be specific to the deepseek model.

Environment

  • Hardware: Mac Studio M3 Ultra, 512 GB RAM
  • OS: macOS 15.7
  • Ollama Version: 0.12.6
  • Installation Method: MacPorts
  • Model: deepseek-v3.1:671b (hash: 044d50a3d79c, size: 404 GB)

Server Configuration

export OLLAMA_HOST="0.0.0.0:11434"
export OLLAMA_KEEP_ALIVE="15m"
export OLLAMA_NUM_PARALLEL=2
ollama serve

Steps to Reproduce

  1. Create a custom model with increased context size:
cat > Modelfile << EOF
FROM deepseek-v3.1:671b
PARAMETER num_ctx 40960
EOF

ollama create deepseek-v3.1:671b-token-40k -f Modelfile
  1. Run with a prompt containing 33,273 repetitions of "hello ":
ollama stop deepseek-v3.1:671b-token-40k
ollama run deepseek-v3.1:671b-token-40k $(python3 -c "print('hello ' * 33273)")

Expected Behavior

The model should process the prompt successfully or gracefully handle prompts that exceed internal limits.

Actual Behavior

The model runner crashes with the following error:

Error: 500 Internal Server Error: model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details

Server logs show :

Assertion failed: (found), function llama_sampler_dist_apply, file llama-sampling.cpp, line 662.
SIGABRT: abort

The ollama server itself continues running and can accept new requests after the crash.

Additional Observations

Works:

  • Prompt with 33,272 repetitions of "hello " (prompt_eval_count = 33,276)
  • Any prompt length below this threshold
  • When num_ctx is set to 32,768 (prompt is truncated to 32,768 tokens, prompt_eval_count = 32,768)

Fails:

  • Prompt with 33,273 or more repetitions of "hello "
  • Same behavior observed with num_ctx set to 80,920

Other Models:

  • No similar issue observed with other models such as qwen3:235b-a22b-instruct-2507-q8_0
  • Successfully tested with num_ctx = 262,144 and 120,000 repetitions of "hello " on qwen3 model

Resource Usage

Memory usage monitored with htop shows approximately 416 GB out of 512 GB total RAM in use, indicating sufficient available memory.

Reproducibility

This issue is 100% reproducible with the steps outlined above.

Server Log

The complete server error log has been attached separately due to length. The critical error occurs at:

Assertion failed: (found), function llama_sampler_dist_apply, file llama-sampling.cpp, line 662.
SIGABRT: abort

ollama-error-log.txt

This suggests an internal assertion failure in the llama.cpp sampling logic when processing prompts above a certain token threshold specific to the deepseek-v3.1:671b model configuration.

Relevant log output


OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.12.6

Originally created by @andrewwutw on GitHub (Oct 19, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12696 Originally assigned to: @gr4ceG on GitHub. ### What is the issue? ### Summary The ollama server crashes with a SIGABRT error when running the `deepseek-v3.1:671b` model with long prompts exceeding approximately 33,272 tokens when `num_ctx` is set to 40,960 or higher. The issue is reproducible and appears to be specific to the deepseek model. ### Environment - **Hardware:** Mac Studio M3 Ultra, 512 GB RAM - **OS:** macOS 15.7 - **Ollama Version:** 0.12.6 - **Installation Method:** MacPorts - **Model:** deepseek-v3.1:671b (hash: 044d50a3d79c, size: 404 GB) ### Server Configuration ```bash export OLLAMA_HOST="0.0.0.0:11434" export OLLAMA_KEEP_ALIVE="15m" export OLLAMA_NUM_PARALLEL=2 ollama serve ``` ### Steps to Reproduce 1. Create a custom model with increased context size: ```bash cat > Modelfile << EOF FROM deepseek-v3.1:671b PARAMETER num_ctx 40960 EOF ollama create deepseek-v3.1:671b-token-40k -f Modelfile ``` 2. Run with a prompt containing 33,273 repetitions of "hello ": ```bash ollama stop deepseek-v3.1:671b-token-40k ollama run deepseek-v3.1:671b-token-40k $(python3 -c "print('hello ' * 33273)") ``` ### Expected Behavior The model should process the prompt successfully or gracefully handle prompts that exceed internal limits. ### Actual Behavior The model runner crashes with the following error: ``` Error: 500 Internal Server Error: model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details ``` Server logs show : ``` Assertion failed: (found), function llama_sampler_dist_apply, file llama-sampling.cpp, line 662. SIGABRT: abort ``` The ollama server itself continues running and can accept new requests after the crash. ### Additional Observations **Works:** - Prompt with 33,272 repetitions of "hello " (prompt_eval_count = 33,276) - Any prompt length below this threshold - When `num_ctx` is set to 32,768 (prompt is truncated to 32,768 tokens, prompt_eval_count = 32,768) **Fails:** - Prompt with 33,273 or more repetitions of "hello " - Same behavior observed with `num_ctx` set to 80,920 **Other Models:** - No similar issue observed with other models such as `qwen3:235b-a22b-instruct-2507-q8_0` - Successfully tested with `num_ctx` = 262,144 and 120,000 repetitions of "hello " on qwen3 model ### Resource Usage Memory usage monitored with htop shows approximately 416 GB out of 512 GB total RAM in use, indicating sufficient available memory. ### Reproducibility This issue is 100% reproducible with the steps outlined above. ### Server Log The complete server error log has been attached separately due to length. The critical error occurs at: ``` Assertion failed: (found), function llama_sampler_dist_apply, file llama-sampling.cpp, line 662. SIGABRT: abort ``` [ollama-error-log.txt](https://github.com/user-attachments/files/22991943/ollama-error-log.txt) This suggests an internal assertion failure in the llama.cpp sampling logic when processing prompts above a certain token threshold specific to the deepseek-v3.1:671b model configuration. ### Relevant log output ```shell ``` ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version 0.12.6
GiteaMirror added the bug label 2026-05-04 21:41:04 -05:00
Author
Owner

@andrewwutw commented on GitHub (Nov 3, 2025):

Tested on ollama 0.12.9, but the problem persists.

<!-- gh-comment-id:3478823931 --> @andrewwutw commented on GitHub (Nov 3, 2025): Tested on **ollama 0.12.9**, but the problem persists.
Author
Owner

@andrewwutw commented on GitHub (Jan 12, 2026):

I am no longer using the MacPorts version of Ollama. I have switched to v0.14.0-rc2 (specifically v0.14.0-rc2/ollama-darwin.tgz) downloaded directly from the GitHub releases page.

I found a specific threshold where the model runner crashes.

Test Case 1: Success

When running the following command with 33271 repetitions of "hello ":

python3 -c "print('hello ' * 33271)" | ollama run deepseek-v3.1:671b-token-40k

Ollama works as expected and outputs:

Thinking...
hello hello hello hello hello hello hello hello hello hello hello hello
hello hello hello hello hello hello hello hello hello hello hello hello
hello hello hello hello hello hello hello

Test Case 2: Failure

However, if I increase the count by just one to 33272:

python3 -c "print('hello ' * 33272)" | ollama run deepseek-v3.1:671b-token-40k

Ollama fails with the following error:

Error: 500 Internal Server Error: model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details

Server Logs

The logs show a panic: panic: failed to sample token

time=2026-01-12T16:35:50.146+08:00 level=INFO source=routes.go:1601 msg="server config" env="map[HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:15m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/Users/andrew/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:2 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false http_proxy: https_proxy: no_proxy:]"
time=2026-01-12T16:35:50.154+08:00 level=INFO source=images.go:499 msg="total blobs: 171"
time=2026-01-12T16:35:50.156+08:00 level=INFO source=images.go:506 msg="total unused blobs removed: 0"
time=2026-01-12T16:35:50.157+08:00 level=INFO source=routes.go:1654 msg="Listening on [::]:11434 (version 0.14.0-rc2)"
time=2026-01-12T16:35:50.158+08:00 level=INFO source=runner.go:67 msg="discovering available GPUs..."
time=2026-01-12T16:35:50.159+08:00 level=INFO source=server.go:429 msg="starting runner" cmd="/Users/andrew/bin/ollama-bin/v0.14.0-rc2/ollama runner --ollama-engine --port 63645"
time=2026-01-12T16:35:50.222+08:00 level=INFO source=types.go:42 msg="inference compute" id=0 filter_id=0 library=Metal compute=0.0 name=Metal description="Apple M3 Ultra" libdirs="" driver=0.0 pci_id="" type=discrete total="464.0 GiB" available="464.0 GiB"
[GIN] 2026/01/12 - 16:35:56 | 200 |      27.084µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/01/12 - 16:35:56 | 200 |   25.914041ms |       127.0.0.1 | POST     "/api/show"
time=2026-01-12T16:35:56.289+08:00 level=INFO source=server.go:429 msg="starting runner" cmd="/Users/andrew/bin/ollama-bin/v0.14.0-rc2/ollama runner --ollama-engine --model /Users/andrew/.ollama/models/blobs/sha256-8eeb1709986060613eb794d3fbbbf4ce7f2120cd174c95b64ee9f0c906c48910 --port 63723"
time=2026-01-12T16:35:56.291+08:00 level=INFO source=sched.go:452 msg="system memory" total="512.0 GiB" free="569.3 GiB" free_swap="0 B"
time=2026-01-12T16:35:56.291+08:00 level=INFO source=sched.go:459 msg="gpu memory" id=0 library=Metal available="463.5 GiB" free="464.0 GiB" minimum="512.0 MiB" overhead="0 B"
time=2026-01-12T16:35:56.291+08:00 level=INFO source=server.go:755 msg="loading model" "model layers"=62 requested=-1
time=2026-01-12T16:35:56.313+08:00 level=INFO source=runner.go:1405 msg="starting ollama engine"
time=2026-01-12T16:35:56.314+08:00 level=INFO source=runner.go:1440 msg="Server listening on 127.0.0.1:63723"
time=2026-01-12T16:35:56.325+08:00 level=INFO source=runner.go:1278 msg=load request="{Operation:fit LoraPath:[] Parallel:2 BatchSize:512 FlashAttention:Disabled KvSize:81920 KvCacheType: NumThreads:24 GPULayers:62[ID:0 Layers:62(0..61)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-01-12T16:35:56.338+08:00 level=INFO source=ggml.go:136 msg="" architecture=deepseek2 file_type=Q4_K_M name="" description="" num_tensors=1086 num_key_values=45
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.006 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   Apple M3 Ultra
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 498216.21 MB
time=2026-01-12T16:35:56.339+08:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang)
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple M3 Ultra
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
time=2026-01-12T16:35:56.458+08:00 level=INFO source=runner.go:1278 msg=load request="{Operation:alloc LoraPath:[] Parallel:2 BatchSize:512 FlashAttention:Disabled KvSize:81920 KvCacheType: NumThreads:24 GPULayers:62[ID:0 Layers:62(0..61)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-01-12T16:35:57.068+08:00 level=INFO source=runner.go:1278 msg=load request="{Operation:commit LoraPath:[] Parallel:2 BatchSize:512 FlashAttention:Disabled KvSize:81920 KvCacheType: NumThreads:24 GPULayers:62[ID:0 Layers:62(0..61)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-01-12T16:35:57.068+08:00 level=INFO source=ggml.go:482 msg="offloading 61 repeating layers to GPU"
time=2026-01-12T16:35:57.068+08:00 level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
time=2026-01-12T16:35:57.068+08:00 level=INFO source=ggml.go:494 msg="offloaded 62/62 layers to GPU"
time=2026-01-12T16:35:57.068+08:00 level=INFO source=device.go:240 msg="model weights" device=Metal size="376.2 GiB"
time=2026-01-12T16:35:57.068+08:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="497.1 MiB"
time=2026-01-12T16:35:57.068+08:00 level=INFO source=device.go:251 msg="kv cache" device=Metal size="10.1 GiB"
time=2026-01-12T16:35:57.068+08:00 level=INFO source=device.go:262 msg="compute graph" device=Metal size="20.5 GiB"
time=2026-01-12T16:35:57.068+08:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="14.0 MiB"
time=2026-01-12T16:35:57.068+08:00 level=INFO source=device.go:272 msg="total memory" size="407.4 GiB"
time=2026-01-12T16:35:57.068+08:00 level=INFO source=sched.go:526 msg="loaded runners" count=1
time=2026-01-12T16:35:57.068+08:00 level=INFO source=server.go:1347 msg="waiting for llama runner to start responding"
time=2026-01-12T16:35:57.068+08:00 level=INFO source=server.go:1381 msg="waiting for server to become available" status="llm server loading model"
time=2026-01-12T16:37:02.002+08:00 level=INFO source=server.go:1385 msg="llama runner started in 65.71 seconds"
panic: failed to sample token

goroutine 1737 [running]:
github.com/ollama/ollama/runner/ollamarunner.(*Server).computeBatch(0x140002610e0, {0x40, {0x1045b8850, 0x14000518000}, {0x1045c3820, 0x1400ba63d58}, {0x14000196008, 0x1fd, 0x25f}, {{0x1045c3820, ...}, ...}, ...})
	/Users/runner/work/ollama/ollama/runner/ollamarunner/runner.go:763 +0x1554
created by github.com/ollama/ollama/runner/ollamarunner.(*Server).run in goroutine 69
	/Users/runner/work/ollama/ollama/runner/ollamarunner/runner.go:458 +0x22c
time=2026-01-12T16:49:52.161+08:00 level=ERROR source=server.go:1592 msg="post predict" error="Post \"http://127.0.0.1:63723/completion\": EOF"
[GIN] 2026/01/12 - 16:49:52 | 500 |        13m55s |       127.0.0.1 | POST     "/api/generate"

Additionally, I encountered similar errors when testing with v0.13.x downloaded from GitHub releases.

<!-- gh-comment-id:3738592782 --> @andrewwutw commented on GitHub (Jan 12, 2026): I am no longer using the MacPorts version of Ollama. I have switched to **v0.14.0-rc2** (specifically `v0.14.0-rc2/ollama-darwin.tgz`) downloaded directly from the GitHub releases page. I found a specific threshold where the model runner crashes. ### Test Case 1: Success When running the following command with 33271 repetitions of "hello ": ```bash python3 -c "print('hello ' * 33271)" | ollama run deepseek-v3.1:671b-token-40k ``` Ollama works as expected and outputs: ```text Thinking... hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello ``` ### Test Case 2: Failure However, if I increase the count by just one to **33272**: ```bash python3 -c "print('hello ' * 33272)" | ollama run deepseek-v3.1:671b-token-40k ``` Ollama fails with the following error: > `Error: 500 Internal Server Error: model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details` ### Server Logs The logs show a panic: `panic: failed to sample token` ```text time=2026-01-12T16:35:50.146+08:00 level=INFO source=routes.go:1601 msg="server config" env="map[HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:15m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/Users/andrew/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:2 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false http_proxy: https_proxy: no_proxy:]" time=2026-01-12T16:35:50.154+08:00 level=INFO source=images.go:499 msg="total blobs: 171" time=2026-01-12T16:35:50.156+08:00 level=INFO source=images.go:506 msg="total unused blobs removed: 0" time=2026-01-12T16:35:50.157+08:00 level=INFO source=routes.go:1654 msg="Listening on [::]:11434 (version 0.14.0-rc2)" time=2026-01-12T16:35:50.158+08:00 level=INFO source=runner.go:67 msg="discovering available GPUs..." time=2026-01-12T16:35:50.159+08:00 level=INFO source=server.go:429 msg="starting runner" cmd="/Users/andrew/bin/ollama-bin/v0.14.0-rc2/ollama runner --ollama-engine --port 63645" time=2026-01-12T16:35:50.222+08:00 level=INFO source=types.go:42 msg="inference compute" id=0 filter_id=0 library=Metal compute=0.0 name=Metal description="Apple M3 Ultra" libdirs="" driver=0.0 pci_id="" type=discrete total="464.0 GiB" available="464.0 GiB" [GIN] 2026/01/12 - 16:35:56 | 200 | 27.084µs | 127.0.0.1 | HEAD "/" [GIN] 2026/01/12 - 16:35:56 | 200 | 25.914041ms | 127.0.0.1 | POST "/api/show" time=2026-01-12T16:35:56.289+08:00 level=INFO source=server.go:429 msg="starting runner" cmd="/Users/andrew/bin/ollama-bin/v0.14.0-rc2/ollama runner --ollama-engine --model /Users/andrew/.ollama/models/blobs/sha256-8eeb1709986060613eb794d3fbbbf4ce7f2120cd174c95b64ee9f0c906c48910 --port 63723" time=2026-01-12T16:35:56.291+08:00 level=INFO source=sched.go:452 msg="system memory" total="512.0 GiB" free="569.3 GiB" free_swap="0 B" time=2026-01-12T16:35:56.291+08:00 level=INFO source=sched.go:459 msg="gpu memory" id=0 library=Metal available="463.5 GiB" free="464.0 GiB" minimum="512.0 MiB" overhead="0 B" time=2026-01-12T16:35:56.291+08:00 level=INFO source=server.go:755 msg="loading model" "model layers"=62 requested=-1 time=2026-01-12T16:35:56.313+08:00 level=INFO source=runner.go:1405 msg="starting ollama engine" time=2026-01-12T16:35:56.314+08:00 level=INFO source=runner.go:1440 msg="Server listening on 127.0.0.1:63723" time=2026-01-12T16:35:56.325+08:00 level=INFO source=runner.go:1278 msg=load request="{Operation:fit LoraPath:[] Parallel:2 BatchSize:512 FlashAttention:Disabled KvSize:81920 KvCacheType: NumThreads:24 GPULayers:62[ID:0 Layers:62(0..61)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-01-12T16:35:56.338+08:00 level=INFO source=ggml.go:136 msg="" architecture=deepseek2 file_type=Q4_K_M name="" description="" num_tensors=1086 num_key_values=45 ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices ggml_metal_library_init: using embedded metal library ggml_metal_library_init: loaded in 0.006 sec ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s) ggml_metal_device_init: GPU name: Apple M3 Ultra ggml_metal_device_init: GPU family: MTLGPUFamilyApple9 (1009) ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4 (5002) ggml_metal_device_init: simdgroup reduction = true ggml_metal_device_init: simdgroup matrix mul. = true ggml_metal_device_init: has unified memory = true ggml_metal_device_init: has bfloat = true ggml_metal_device_init: has tensor = false ggml_metal_device_init: use residency sets = true ggml_metal_device_init: use shared buffers = true ggml_metal_device_init: recommendedMaxWorkingSetSize = 498216.21 MB time=2026-01-12T16:35:56.339+08:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang) ggml_metal_init: allocating ggml_metal_init: picking default device: Apple M3 Ultra ggml_metal_init: use fusion = true ggml_metal_init: use concurrency = true ggml_metal_init: use graph optimize = true time=2026-01-12T16:35:56.458+08:00 level=INFO source=runner.go:1278 msg=load request="{Operation:alloc LoraPath:[] Parallel:2 BatchSize:512 FlashAttention:Disabled KvSize:81920 KvCacheType: NumThreads:24 GPULayers:62[ID:0 Layers:62(0..61)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-01-12T16:35:57.068+08:00 level=INFO source=runner.go:1278 msg=load request="{Operation:commit LoraPath:[] Parallel:2 BatchSize:512 FlashAttention:Disabled KvSize:81920 KvCacheType: NumThreads:24 GPULayers:62[ID:0 Layers:62(0..61)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-01-12T16:35:57.068+08:00 level=INFO source=ggml.go:482 msg="offloading 61 repeating layers to GPU" time=2026-01-12T16:35:57.068+08:00 level=INFO source=ggml.go:489 msg="offloading output layer to GPU" time=2026-01-12T16:35:57.068+08:00 level=INFO source=ggml.go:494 msg="offloaded 62/62 layers to GPU" time=2026-01-12T16:35:57.068+08:00 level=INFO source=device.go:240 msg="model weights" device=Metal size="376.2 GiB" time=2026-01-12T16:35:57.068+08:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="497.1 MiB" time=2026-01-12T16:35:57.068+08:00 level=INFO source=device.go:251 msg="kv cache" device=Metal size="10.1 GiB" time=2026-01-12T16:35:57.068+08:00 level=INFO source=device.go:262 msg="compute graph" device=Metal size="20.5 GiB" time=2026-01-12T16:35:57.068+08:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="14.0 MiB" time=2026-01-12T16:35:57.068+08:00 level=INFO source=device.go:272 msg="total memory" size="407.4 GiB" time=2026-01-12T16:35:57.068+08:00 level=INFO source=sched.go:526 msg="loaded runners" count=1 time=2026-01-12T16:35:57.068+08:00 level=INFO source=server.go:1347 msg="waiting for llama runner to start responding" time=2026-01-12T16:35:57.068+08:00 level=INFO source=server.go:1381 msg="waiting for server to become available" status="llm server loading model" time=2026-01-12T16:37:02.002+08:00 level=INFO source=server.go:1385 msg="llama runner started in 65.71 seconds" panic: failed to sample token goroutine 1737 [running]: github.com/ollama/ollama/runner/ollamarunner.(*Server).computeBatch(0x140002610e0, {0x40, {0x1045b8850, 0x14000518000}, {0x1045c3820, 0x1400ba63d58}, {0x14000196008, 0x1fd, 0x25f}, {{0x1045c3820, ...}, ...}, ...}) /Users/runner/work/ollama/ollama/runner/ollamarunner/runner.go:763 +0x1554 created by github.com/ollama/ollama/runner/ollamarunner.(*Server).run in goroutine 69 /Users/runner/work/ollama/ollama/runner/ollamarunner/runner.go:458 +0x22c time=2026-01-12T16:49:52.161+08:00 level=ERROR source=server.go:1592 msg="post predict" error="Post \"http://127.0.0.1:63723/completion\": EOF" [GIN] 2026/01/12 - 16:49:52 | 500 | 13m55s | 127.0.0.1 | POST "/api/generate" ``` Additionally, I encountered similar errors when testing with **v0.13.x** downloaded from GitHub releases.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#70483