[GH-ISSUE #14858] Feature Request: Stream tool call arguments in anthropic/anthropic.go:Process() to avoid triggering a Claude Code timeout #35344

Closed
opened 2026-04-22 19:46:55 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @egtung on GitHub (Mar 15, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14858

Environment

If it matters, I'm running everything locally, ethernet cable unplugged.

$ ollama ps
NAME                    ID              SIZE     PROCESSOR          CONTEXT    UNTIL               
glm-4.7-flash:latest    d1a8a26252f1    21 GB    46%/54% CPU/GPU    65536      59 minutes from now

Reproduction

Disclaimer

I don't have a good, simple repro case - I'm not great at getting LLMs to do exactly what I want, so provoking the problematic behavior is hit or miss. What I'm trying to do is to get Claude to write out a long-ish file (specifically, something that takes more than ~255 seconds to generate the tool call for). Claude sometimes tries to do other things (e.g. search the web, or summarize the story) before writing out the file; I'm not sure how those interact but was trying to avoid them to keep things simple. Reproducing this might require some retries or prompt fiddling, but any way of generating a lot of text as a tool call argument should work, I'm just not sure how to capture that concisely.

Steps

  1. Run a local Ollama build: <various Ollama env vars> go run . serve
  2. Run Claude Code ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_BASE_URL=http://localhost:11434 ANTHROPIC_API_KEY="" claude --model glm-4.7-flash
  3. At the prompt, enter something like Write the public domain short story "Luck" by Mark Twain to luck.txt.

While this is sufficient to reproduce the behavior on my system, the underlying issue is a timeout; a more powerful computer might be able to successfully complete the task quickly enough. Setting CUDA_VISIBLE_DEVICES="" to force CPU inference or specifying a longer text should make the issue more obvious.

Expected behavior

luck.txt should be created and contain text.

Observed behavior

When the request finishes, grep ERROR ~/.claude/debug/latest shows [ERROR] Error streaming, falling back to non-streaming mode: The operation timed out.. No file is created.

Observations

https://ollama.com/blog/streaming-tool initially made me think this was expected to work, but reading it more closely I see that it only talks about streaming tool responses, not composing the arguments, so I'm filing this as a feature request instead of as a bug.

Packet trace behavior

When I've been able to get this to reproduce (which is not 100%), inspecting a packet trace suggests that Claude's streaming timeout seems to be about 255 seconds since the last content block was received. Looking at port 11434 (i.e. Claude accessing Ollama), the thinking block is streamed as expected, but then the stream pauses when it reaches the tool call - there are still keep-alive packets being exchanged, but no content block deltas. Looking at the other interesting TCP stream (which I believe is Ollama internally talking to the LLM), I see that it's in the middle of streaming a tool call with the entire text of Luck; this stream makes continuous progress but the 11434 stream is stalled.

Source code

From what I can see in the source, Process() is not set up to stream tool call arguments. f8b657c967/anthropic/anthropic.go (L886-L898) sets up a tool call, f8b657c967/anthropic/anthropic.go (L900-L910) provides all the arguments, and f8b657c967/anthropic/anthropic.go (L912-L918) finishes the tool call. There is no loop around the deltas, nor is there any state tracking; as far as I can tell, the code expects to be presented with the tool arguments all at once, i.e. it must be blocking until they're all ready. This is a problem if the arguments are large and take time to generate.

Happy case

Prompt Claude with something like Write "Hello world" to hello.md

This works as expected - the tool arguments are small, so easily complete before Claude times out.

Impact

Any tool call that takes a substantial amount of time to compose (not necessarily execute) may time out and fail, so any complex task involving a tool should be able to trigger this.

While the issue could be mitigated if Claude Code allowed for a longer timeout (I don't believe it does), it also doesn't seem great that a streaming response stops passing data through even though that data seems like it's theoretically available.

Feature Request

Ollama could reformat the tool call arguments on the fly, similar to what it does with thinking or text blocks, allowing it to be streamed instead of blocking. This would avoid provoking Claude's timeout and allow longer tool calls to succeed. This seems like it would change how Message.ToolCalls are generated and used, so it's probably not a trivial.

Until this feature is added, docs/api/anthropic-compatibility.mdx could be updated to call out this limitation.

Originally created by @egtung on GitHub (Mar 15, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14858 # Environment * Ollama 0.17.7 + a patch for https://github.com/ollama/ollama/issues/14816 * Claude Code 2.1.70 * OS/hardware: Ubuntu 24.04, AMD Ryzen 5, GeForce RTX 3060 If it matters, I'm running everything locally, ethernet cable unplugged. ``` $ ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL glm-4.7-flash:latest d1a8a26252f1 21 GB 46%/54% CPU/GPU 65536 59 minutes from now ``` # Reproduction ## Disclaimer I don't have a good, simple repro case - I'm not great at getting LLMs to do exactly what I want, so provoking the problematic behavior is hit or miss. What I'm trying to do is to get Claude to write out a long-ish file (specifically, something that takes more than ~255 seconds to generate the tool call for). Claude sometimes tries to do other things (e.g. search the web, or summarize the story) before writing out the file; I'm not sure how those interact but was trying to avoid them to keep things simple. Reproducing this might require some retries or prompt fiddling, but **any way of generating a lot of text as a tool call argument should work**, I'm just not sure how to capture that concisely. ## Steps 1. Run a local Ollama build: `<various Ollama env vars> go run . serve` 2. Run Claude Code `ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_BASE_URL=http://localhost:11434 ANTHROPIC_API_KEY="" claude --model glm-4.7-flash` 3. At the prompt, enter something like `Write the public domain short story "Luck" by Mark Twain to luck.txt.` While this is sufficient to reproduce the behavior on my system, the underlying issue is a timeout; a more powerful computer might be able to successfully complete the task quickly enough. Setting `CUDA_VISIBLE_DEVICES=""` to force CPU inference or specifying a longer text should make the issue more obvious. ## Expected behavior luck.txt should be created and contain text. ## Observed behavior When the request finishes, `grep ERROR ~/.claude/debug/latest` shows `[ERROR] Error streaming, falling back to non-streaming mode: The operation timed out.`. No file is created. # Observations https://ollama.com/blog/streaming-tool initially made me think this was expected to work, but reading it more closely I see that it only talks about streaming tool *responses*, not composing the *arguments*, so I'm filing this as a feature request instead of as a bug. ## Packet trace behavior When I've been able to get this to reproduce (which is not 100%), inspecting a packet trace suggests that Claude's streaming timeout seems to be about 255 seconds since the last content block was received. Looking at port 11434 (i.e. Claude accessing Ollama), the thinking block is streamed as expected, but then the stream pauses when it reaches the tool call - there are still keep-alive packets being exchanged, but no content block deltas. Looking at the other interesting TCP stream (which I believe is Ollama internally talking to the LLM), I see that it's in the middle of streaming a tool call with the entire text of Luck; this stream makes continuous progress but the 11434 stream is stalled. ## Source code From what I can see in the source, Process() is not set up to stream tool call arguments. https://github.com/ollama/ollama/blob/f8b657c9670a4319930e8d7e5444460df91a7b5d/anthropic/anthropic.go#L886-L898 sets up a tool call, https://github.com/ollama/ollama/blob/f8b657c9670a4319930e8d7e5444460df91a7b5d/anthropic/anthropic.go#L900-L910 provides all the arguments, and https://github.com/ollama/ollama/blob/f8b657c9670a4319930e8d7e5444460df91a7b5d/anthropic/anthropic.go#L912-L918 finishes the tool call. There is no loop around the deltas, nor is there any state tracking; as far as I can tell, the code expects to be presented with the tool arguments all at once, i.e. it must be blocking until they're all ready. This is a problem if the arguments are large and take time to generate. # Happy case Prompt Claude with something like `Write "Hello world" to hello.md` This works as expected - the tool arguments are small, so easily complete before Claude times out. # Impact Any tool call that takes a substantial amount of time to compose (not necessarily execute) may time out and fail, so any complex task involving a tool should be able to trigger this. While the issue could be mitigated if Claude Code allowed for a longer timeout (I don't believe it does), it also doesn't seem great that a streaming response stops passing data through even though that data seems like it's theoretically available. # Feature Request Ollama could reformat the tool call arguments on the fly, similar to what it does with thinking or text blocks, allowing it to be streamed instead of blocking. This would avoid provoking Claude's timeout and allow longer tool calls to succeed. This seems like it would change how Message.ToolCalls are generated and used, so it's probably not a trivial. Until this feature is added, docs/api/anthropic-compatibility.mdx could be updated to call out this limitation.
GiteaMirror added the feature request label 2026-04-22 19:46:55 -05:00
Author
Owner

@miguelmanlyx commented on GitHub (Mar 15, 2026):

Just chiming in, getting some timeout vibes from anthropic here. npx ai-doctor sets up automatic failover without much fuss.

<!-- gh-comment-id:4062435893 --> @miguelmanlyx commented on GitHub (Mar 15, 2026): Just chiming in, getting some timeout vibes from anthropic here. npx ai-doctor sets up automatic failover without much fuss.
Author
Owner

@hailmary-ship-it commented on GitHub (Mar 15, 2026):

Just chiming in, looks like a timeout thing happening with anthropic. Might be worth trying npx ai-doctor for a fallback.

<!-- gh-comment-id:4062484888 --> @hailmary-ship-it commented on GitHub (Mar 15, 2026): Just chiming in, looks like a timeout thing happening with anthropic. Might be worth trying npx ai-doctor for a fallback.
Author
Owner

@egtung commented on GitHub (Mar 17, 2026):

The feature request in #14902 addresses the same problem while being simpler and more robust, so I'm withdrawing this one.

<!-- gh-comment-id:4073358348 --> @egtung commented on GitHub (Mar 17, 2026): The feature request in #14902 addresses the same problem while being simpler and more robust, so I'm withdrawing this one.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#35344