[GH-ISSUE #3851] Why Ollama is so terribly slow when I set format="json" #2386

Open
opened 2026-04-12 12:42:16 -05:00 by GiteaMirror · 8 comments
Owner

Originally created by @marksalpeter on GitHub (Apr 23, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3851

What is the issue?

This is a duplicate of #3154, which was closed, I'm assuming, by mistake.
The performance of the format="json" param is 10x slower than regular inference when additional context is included

A prompt like this takes ~24s to return on an NVIDIA T4 with CUDA enabled and format="json". The same exact prompt without format json takes ~2s to return. This has got to be a bug right?

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

${context}

Please respond in the following JSON schema
{
   "${schema.fieldName}": {
      "type": ${schema.type},
      "description": ${schema.description}
     }
}

Question: ${schema.description}
Helpful Answer:

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.1.32

Originally created by @marksalpeter on GitHub (Apr 23, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3851 ### What is the issue? This is a duplicate of #3154, which was closed, I'm assuming, by mistake. The performance of the `format="json"` param is 10x slower than regular inference when additional context is included A prompt like this takes ~24s to return on an NVIDIA T4 with CUDA enabled and ` format="json"`. The same exact prompt without format json takes ~2s to return. This has got to be a bug right? ``` Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. ${context} Please respond in the following JSON schema { "${schema.fieldName}": { "type": ${schema.type}, "description": ${schema.description} } } Question: ${schema.description} Helpful Answer: ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.1.32
GiteaMirror added the bugperformanceapi labels 2026-04-12 12:42:17 -05:00
Author
Owner

@sebdg commented on GitHub (Apr 23, 2024):

I think this is related to a loop detection piece of code in the server.go code, the detection code allows the model to cycle for a number of tokens over whitespace, if the last token is repeated a number of times or only whitespace is detected for like 30 iterations it aborts the prediction, it would be useful if you could run your request while setting ollama in debug mode more about this here : troubleshooting

if this relates to the loop detection logic you will a line like 'prediction aborted, token repeat limit reached' in the log.
on the other side some other bugs on stop detection relate to llama.cpp and could also be a cause.

I would be useful to try your request with streaming enabled this will show what the model returns

curl http://127.0.0.1:11434/api/generate -d '{ 
   "model": "llama3:8b", 
   "prompt": "You are a helpful writer, respond with an address in the US in JSON format.", 
   "stream": true, "format": "json" }'

Now as a workaround I would recommend on not using the format=json for now and just mention it in the prompt itself, depending on what your integration is you might be better by capturing the json part of the response by using a regex orso, I've had some flaws and inconsistent behaviors using format=json across different model, the regex might be a more robust solution to this

<!-- gh-comment-id:2073507410 --> @sebdg commented on GitHub (Apr 23, 2024): I think this is related to a loop detection piece of code in the server.go code, the detection code allows the model to cycle for a number of tokens over whitespace, if the last token is repeated a number of times or only whitespace is detected for like 30 iterations it aborts the prediction, it would be useful if you could run your request while setting ollama in debug mode more about this here : [troubleshooting](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) if this relates to the loop detection logic you will a line like 'prediction aborted, token repeat limit reached' in the log. on the other side some other bugs on stop detection relate to llama.cpp and could also be a cause. I would be useful to try your request with streaming enabled this will show what the model returns ```sh curl http://127.0.0.1:11434/api/generate -d '{ "model": "llama3:8b", "prompt": "You are a helpful writer, respond with an address in the US in JSON format.", "stream": true, "format": "json" }' ``` Now as a workaround I would recommend on not using the format=json for now and just mention it in the prompt itself, depending on what your integration is you might be better by capturing the json part of the response by using a regex orso, I've had some flaws and inconsistent behaviors using format=json across different model, the regex might be a more robust solution to this
Author
Owner

@coder543 commented on GitHub (Apr 24, 2024):

I will say... I've observed that some models are slower with json mode than others. I'm not sure if it is a bug in the implementation, or if the models themselves are just trained in interesting ways.

Observing the streaming response, it seems to respond quickly, but then it waits around for awhile before deciding the message is complete. A well-defined grammar would realize that the JSON message is over, and immediately terminate, rather than waiting for some kind of end-of-stream token, and this could be the issue here.

<!-- gh-comment-id:2073883100 --> @coder543 commented on GitHub (Apr 24, 2024): I will say... I've observed that some models [are slower](https://github.com/open-webui/open-webui/discussions/1692#discussioncomment-9196102) with json mode than others. I'm not sure if it is a bug in the implementation, or if the models themselves are just trained in interesting ways. Observing the streaming response, it seems to respond quickly, but then it waits around for awhile before deciding the message is complete. A well-defined grammar would realize that the JSON message is over, and immediately terminate, rather than waiting for some kind of end-of-stream token, and this could be the issue here.
Author
Owner

@not-nullptr commented on GitHub (Apr 24, 2024):

getting this exact issue on llama3:8b but not with mistral:latest, weirdly enough? speeds for regular text between both models are the exact same on my 3080. i think @coder543 is correct, except this is a bug in the implementation. why are we outputting whitespace in the json in the first place?

<!-- gh-comment-id:2075771062 --> @not-nullptr commented on GitHub (Apr 24, 2024): getting this exact issue on `llama3:8b` but not with `mistral:latest`, weirdly enough? speeds for regular text between both models are the exact same on my 3080. i think @coder543 is correct, except this is a bug in the implementation. why are we outputting whitespace in the json in the first place?
Author
Owner

@not-nullptr commented on GitHub (Apr 25, 2024):

https://github.com/ollama/ollama/assets/62841684/a91cb579-4160-445d-ad47-caf888f17a39

https://github.com/ollama/ollama/assets/62841684/fbc5a9b4-0113-4d2a-8467-5b24083433f7

the first video demonstrates my function calling without "format": "json", and the second demonstrates it with "format": "json". you can see the speed difference is insane; same prompt and everything.

<!-- gh-comment-id:2076577295 --> @not-nullptr commented on GitHub (Apr 25, 2024): https://github.com/ollama/ollama/assets/62841684/a91cb579-4160-445d-ad47-caf888f17a39 https://github.com/ollama/ollama/assets/62841684/fbc5a9b4-0113-4d2a-8467-5b24083433f7 the first video demonstrates my function calling without `"format": "json"`, and the second demonstrates it with `"format": "json"`. you can see the speed difference is insane; same prompt and everything.
Author
Owner

@coder543 commented on GitHub (Apr 25, 2024):

Unfortunately your video isn’t visually showing it generating JSON in both modes. If the model can’t respond with the correct JSON without JSON mode at least some of the time, it makes it harder to know for sure where the issue is in a piece of software like this.

It would also be helpful if the response were streaming (with a clear visual indication of when the streaming response has finished) so we could see if it pauses after generating the JSON, or if it is just generating the JSON really slowly character by character.

I would definitely like this slow JSON situation to be fixed.

<!-- gh-comment-id:2076958040 --> @coder543 commented on GitHub (Apr 25, 2024): Unfortunately your video isn’t visually showing it generating JSON in both modes. If the model can’t respond with the correct JSON without JSON mode at least some of the time, it makes it harder to know for sure where the issue is in a piece of software like this. It would also be helpful if the response were streaming (with a clear visual indication of when the streaming response has finished) so we could see if it pauses after generating the JSON, or if it is just generating the JSON really slowly character by character. I would definitely like this slow JSON situation to be fixed.
Author
Owner

@mitar commented on GitHub (May 29, 2024):

There are some known upstream issues with grammar restrictions (which JSON format uses): https://github.com/ggerganov/llama.cpp/issues/4218

<!-- gh-comment-id:2138275507 --> @mitar commented on GitHub (May 29, 2024): There are some known upstream issues with grammar restrictions (which JSON format uses): https://github.com/ggerganov/llama.cpp/issues/4218
Author
Owner

@ZeyBal commented on GitHub (Nov 25, 2025):

Hello! Is it fixed?

<!-- gh-comment-id:3576239629 --> @ZeyBal commented on GitHub (Nov 25, 2025): Hello! Is it fixed?
Author
Owner

@willson556 commented on GitHub (Feb 10, 2026):

It seems like llama.cpp has moved forward in supporting faster grammar restrictions (https://github.com/ggml-org/llama.cpp/pull/10224 for example has merged) but it doesn't seem like that's accessible through Ollama yet.

<!-- gh-comment-id:3880083831 --> @willson556 commented on GitHub (Feb 10, 2026): It seems like llama.cpp has moved forward in supporting faster grammar restrictions (https://github.com/ggml-org/llama.cpp/pull/10224 for example has merged) but it doesn't seem like that's accessible through Ollama yet.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#2386