[GH-ISSUE #15334] mxfp8/nvfp4 generation hits strict 10m timeout ("context canceled") #71869

Closed
opened 2026-05-05 02:46:45 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @miguelpark on GitHub (Apr 5, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15334

Originally assigned to: @pdevine on GitHub.

What is the issue?

When generating a response with a large context size using specific model formats (mxfp8 and nvfp4), the request is abruptly terminated exactly at the 10-minute mark with an error="context canceled" message.

This issue seems entirely specific to the runner handling these newer formats (e.g., MLX). When running the exact same workload using the standard q8_0 format (llama.cpp runner), the generation completes successfully without any timeouts.

All tests were conducted with the model's Thinking mode enabled (think = true), which inherently requires a longer generation time. Memory usage is well within safe limits, confirming it's an artificial timeout issue, not an Out-Of-Memory (OOM) panic.

It appears there might be a hardcoded 10-minute HTTP client timeout between the Ollama main server and the subprocess handling mxfp8/nvfp4, or the runner is failing to send proper streaming/heartbeat signals to keep the connection alive compared to the q8_0 runner.

Environment

  • OS: macOS
  • Hardware: Apple Silicon Mac mini M4 (24GB Unified Memory)
  • Ollama Version: 0.20.0
  • Models Tested:
    • qwen3.5:9b-mxfp8 (Issue occurs - drops at 10m)
    • qwen3.5:9b-nvfp4 (Issue occurs - drops at 10m)
    • qwen3.5:9b-q8_0 (Works perfectly - completes generation without dropping)

Steps to Reproduce

  1. Load either the qwen3.5:9b-mxfp8 or qwen3.5:9b-nvfp4 model.
  2. Send an API request (/api/generate or /api/chat) with:
    • A large context size (e.g., num_ctx: 262144)
    • A prompt that requires long generation/reasoning
    • Thinking mode enabled (think = true or equivalent system prompt structure causing long chain-of-thought)
  3. Wait for the generation process.
  4. Exactly at 10 minutes (600 seconds) from the generation start, the request drops.
  5. Repeat the exact same request using the qwen3.5:9b-q8_0 model -> It completes successfully.

Expected Behavior

The runner handling mxfp8 and nvfp4 formats should continue processing and streaming the response until completion, regardless of the time taken, just like how the q8_0 runner handles it without timing out.

Actual Behavior

The request gets killed exactly at 10 minutes (took=10m3s). Peak memory was only 11.90 GiB out of 24 GiB, so it is clearly a timeout issue, not a hardware limitation.

Server Logs

[GIN] 2026/04/05 - 13:44:08 | 200 |   22.270125ms |    192.168.0.56 | GET      "/api/tags"
time=2026-04-05T13:44:08.595+09:00 level=INFO source=sched.go:484 msg="system memory" total="24.0 GiB" free="19.9 GiB" free_swap="0 B"
time=2026-04-05T13:44:08.591+09:00 level=INFO source=sched.go:491 msg="gpu memory" id=0 library=Metal available="17.3 GiB" free="17.8 GiB" minimum="512.0 MiB" overhead="0 B"
time=2026-04-05T13:44:08.596+09:00 level=INFO source=client.go:432 msg="starting mlx runner subprocess" model=qwen3.5:9b-mxfp8 port=64819
time=2026-04-05T13:44:08.599+09:00 level=INFO source=sched.go:561 msg="loaded runners" count=1
time=2026-04-05T13:44:08.619+09:00 level=INFO source=server.go:32 msg="MLX engine initialized" "MLX version"=0.31.1 device=gpu
time=2026-04-05T13:44:08.763+09:00 level=INFO source=base.go:67 msg="Model architecture" arch=Qwen3_5ForConditionalGeneration
time=2026-04-05T13:44:08.924+09:00 level=INFO source=runner.go:135 msg="Loaded tensors from manifest" count=960
time=2026-04-05T13:44:12.208+09:00 level=INFO source=runner.go:169 msg="Starting HTTP server" host=127.0.0.1 port=64819
time=2026-04-05T13:44:12.300+09:00 level=INFO source=server.go:183 msg=ServeHTTP method=GET path=/v1/status took=39.375µs status="200 OK"
time=2026-04-05T13:44:12.300+09:00 level=INFO source=client.go:147 msg="mlx runner is ready" port=64819
time=2026-04-05T13:44:12.302+09:00 level=INFO source=cache.go:126 msg="cache miss" total=2464 matched=0 cached=0 left=2464
time=2026-04-05T13:44:21.769+09:00 level=INFO source=pipeline.go:134 msg="Prompt processing progress" processed=2048 total=2464
time=2026-04-05T13:44:23.715+09:00 level=INFO source=pipeline.go:134 msg="Prompt processing progress" processed=2460 total=2464
time=2026-04-05T13:44:23.848+09:00 level=INFO source=pipeline.go:134 msg="Prompt processing progress" processed=2463 total=2464
time=2026-04-05T13:54:12.306+09:00 level=INFO source=server.go:183 msg=ServeHTTP method=POST path=/v1/completions took=10m0.002051375s status="200 OK"
[GIN] 2026/04/05 - 13:54:12 | 500 |         10m3s |    192.168.0.56 | POST     "/api/generate"
time=2026-04-05T13:54:12.418+09:00 level=INFO source=pipeline.go:55 msg="peak memory" size="11.90 GiB"
time=2026-04-05T13:54:12.418+09:00 level=INFO source=runner.go:149 msg="Request terminated" error="context canceled"

Additional Context

As seen in the logs, prompt processing finishes incredibly fast (~11 seconds for 2.4k tokens). However, during the long generation phase triggered by the Thinking mode, it drops at exactly 10m0s. Since the standard q8_0 engine does not suffer from this issue on the same hardware, context parameters, and think=true state, it strongly points to a timeout discrepancy or buffering issue within the specific runner used for mxfp8/nvfp4.

Relevant log output


OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.20.0

Originally created by @miguelpark on GitHub (Apr 5, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15334 Originally assigned to: @pdevine on GitHub. ### What is the issue? When generating a response with a large context size using specific model formats (`mxfp8` and `nvfp4`), the request is abruptly terminated exactly at the 10-minute mark with an `error="context canceled"` message. This issue seems entirely specific to the runner handling these newer formats (e.g., MLX). When running the exact same workload using the standard `q8_0` format (llama.cpp runner), the generation completes successfully without any timeouts. All tests were conducted with the model's **Thinking mode enabled (`think = true`)**, which inherently requires a longer generation time. Memory usage is well within safe limits, confirming it's an artificial timeout issue, not an Out-Of-Memory (OOM) panic. It appears there might be a hardcoded 10-minute HTTP client timeout between the Ollama main server and the subprocess handling `mxfp8`/`nvfp4`, or the runner is failing to send proper streaming/heartbeat signals to keep the connection alive compared to the `q8_0` runner. ### **Environment** * **OS:** macOS * **Hardware:** Apple Silicon Mac mini M4 (24GB Unified Memory) * **Ollama Version:** 0.20.0 * **Models Tested:** * ❌ `qwen3.5:9b-mxfp8` (Issue occurs - drops at 10m) * ❌ `qwen3.5:9b-nvfp4` (Issue occurs - drops at 10m) * ✅ `qwen3.5:9b-q8_0` (Works perfectly - completes generation without dropping) ### **Steps to Reproduce** 1. Load either the `qwen3.5:9b-mxfp8` or `qwen3.5:9b-nvfp4` model. 2. Send an API request (`/api/generate` or `/api/chat`) with: * A large context size (e.g., `num_ctx: 262144`) * A prompt that requires long generation/reasoning * **Thinking mode enabled (`think = true` or equivalent system prompt structure causing long chain-of-thought)** 3. Wait for the generation process. 4. Exactly at 10 minutes (600 seconds) from the generation start, the request drops. 5. Repeat the exact same request using the `qwen3.5:9b-q8_0` model -> It completes successfully. ### **Expected Behavior** The runner handling `mxfp8` and `nvfp4` formats should continue processing and streaming the response until completion, regardless of the time taken, just like how the `q8_0` runner handles it without timing out. ### **Actual Behavior** The request gets killed exactly at 10 minutes (`took=10m3s`). Peak memory was only `11.90 GiB` out of 24 GiB, so it is clearly a timeout issue, not a hardware limitation. ### **Server Logs** ```log [GIN] 2026/04/05 - 13:44:08 | 200 | 22.270125ms | 192.168.0.56 | GET "/api/tags" time=2026-04-05T13:44:08.595+09:00 level=INFO source=sched.go:484 msg="system memory" total="24.0 GiB" free="19.9 GiB" free_swap="0 B" time=2026-04-05T13:44:08.591+09:00 level=INFO source=sched.go:491 msg="gpu memory" id=0 library=Metal available="17.3 GiB" free="17.8 GiB" minimum="512.0 MiB" overhead="0 B" time=2026-04-05T13:44:08.596+09:00 level=INFO source=client.go:432 msg="starting mlx runner subprocess" model=qwen3.5:9b-mxfp8 port=64819 time=2026-04-05T13:44:08.599+09:00 level=INFO source=sched.go:561 msg="loaded runners" count=1 time=2026-04-05T13:44:08.619+09:00 level=INFO source=server.go:32 msg="MLX engine initialized" "MLX version"=0.31.1 device=gpu time=2026-04-05T13:44:08.763+09:00 level=INFO source=base.go:67 msg="Model architecture" arch=Qwen3_5ForConditionalGeneration time=2026-04-05T13:44:08.924+09:00 level=INFO source=runner.go:135 msg="Loaded tensors from manifest" count=960 time=2026-04-05T13:44:12.208+09:00 level=INFO source=runner.go:169 msg="Starting HTTP server" host=127.0.0.1 port=64819 time=2026-04-05T13:44:12.300+09:00 level=INFO source=server.go:183 msg=ServeHTTP method=GET path=/v1/status took=39.375µs status="200 OK" time=2026-04-05T13:44:12.300+09:00 level=INFO source=client.go:147 msg="mlx runner is ready" port=64819 time=2026-04-05T13:44:12.302+09:00 level=INFO source=cache.go:126 msg="cache miss" total=2464 matched=0 cached=0 left=2464 time=2026-04-05T13:44:21.769+09:00 level=INFO source=pipeline.go:134 msg="Prompt processing progress" processed=2048 total=2464 time=2026-04-05T13:44:23.715+09:00 level=INFO source=pipeline.go:134 msg="Prompt processing progress" processed=2460 total=2464 time=2026-04-05T13:44:23.848+09:00 level=INFO source=pipeline.go:134 msg="Prompt processing progress" processed=2463 total=2464 time=2026-04-05T13:54:12.306+09:00 level=INFO source=server.go:183 msg=ServeHTTP method=POST path=/v1/completions took=10m0.002051375s status="200 OK" [GIN] 2026/04/05 - 13:54:12 | 500 | 10m3s | 192.168.0.56 | POST "/api/generate" time=2026-04-05T13:54:12.418+09:00 level=INFO source=pipeline.go:55 msg="peak memory" size="11.90 GiB" time=2026-04-05T13:54:12.418+09:00 level=INFO source=runner.go:149 msg="Request terminated" error="context canceled" ``` ### **Additional Context** As seen in the logs, prompt processing finishes incredibly fast (~11 seconds for 2.4k tokens). However, during the long generation phase triggered by the Thinking mode, it drops at exactly `10m0s`. Since the standard `q8_0` engine does not suffer from this issue on the same hardware, context parameters, and `think=true` state, it strongly points to a timeout discrepancy or buffering issue within the specific runner used for `mxfp8`/`nvfp4`. ### Relevant log output ```shell ``` ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version 0.20.0
GiteaMirror added the bug label 2026-05-05 02:46:45 -05:00
Author
Owner

@chigkim commented on GitHub (Apr 5, 2026):

I assume you specified the timeout from the api request?

<!-- gh-comment-id:4188847569 --> @chigkim commented on GitHub (Apr 5, 2026): I assume you specified the timeout from the api request?
Author
Owner

@miguelpark commented on GitHub (Apr 5, 2026):

@chigkim Same setup, different results: this model hits a 10min limit, while others (non-mlx) work fine even for much longer tasks.

<!-- gh-comment-id:4188861036 --> @miguelpark commented on GitHub (Apr 5, 2026): @chigkim Same setup, different results: this model hits a 10min limit, while others (non-mlx) work fine even for much longer tasks.
Author
Owner

@aboutlo commented on GitHub (Apr 5, 2026):

I have the same issue when I call ollama from a pi.dev tool call. It seems Ollama buffers the entire tool call arguments before sending them. basically no incremental streaming.

When the total request time exceeds 10 minutes it gets terminated with: "error="context canceled"

<!-- gh-comment-id:4189630476 --> @aboutlo commented on GitHub (Apr 5, 2026): I have the same issue when I call ollama from a pi.dev tool call. It seems Ollama buffers the entire tool call arguments before sending them. basically no incremental streaming. When the total request time exceeds 10 minutes it gets terminated with: "error="context canceled"
Author
Owner

@t-fritsch commented on GitHub (Apr 23, 2026):

I still have the issue with latest version. After 10m, the request stops (model : qwen3.6:35b-a3b-coding-nvfp4)

level=INFO source=runner.go:149 msg="Request terminated" error="context canceled"

tried downgrading to ollama 0.20.7 with no luck

<!-- gh-comment-id:4304833702 --> @t-fritsch commented on GitHub (Apr 23, 2026): I still have the issue with latest version. After 10m, the request stops (model : qwen3.6:35b-a3b-coding-nvfp4) ``` level=INFO source=runner.go:149 msg="Request terminated" error="context canceled" ``` tried downgrading to ollama 0.20.7 with no luck
Author
Owner

@pdevine commented on GitHub (Apr 23, 2026):

@t-fritsch can you post the log lines around that? I just want to be able to see the time stamps.

<!-- gh-comment-id:4306152058 --> @pdevine commented on GitHub (Apr 23, 2026): @t-fritsch can you post the log lines around that? I just want to be able to see the time stamps.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#71869