[GH-ISSUE #3154] Why Ollama is so terribly slow when I set format="json" #63977

Closed
opened 2026-05-03 15:37:56 -05:00 by GiteaMirror · 14 comments
Owner

Originally created by @eliranwong on GitHub (Mar 14, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3154

When I use format="json" the speed is extremely slow. However, I just tried llamafile with JSON output with the same prompt. What takes Ollama to response in two minutes, takes llamafile of the same model a few seconds. Please advise, if this issue is not to be sorted, obviously Ollama is not a suitable choice for developing applications that need JSON output. I really like Ollama as it is easy to be set up.

            completion = ollama.chat(
                model="mistral",
                messages=messages,
                format="json",
                options=Options(
                    temperature=0.0,
                    num_ctx=100000,
                    num_predict=-1,
                ),
Originally created by @eliranwong on GitHub (Mar 14, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3154 When I use format="json" the speed is extremely slow. However, I just tried llamafile with JSON output with the same prompt. What takes Ollama to response in two minutes, takes llamafile of the same model a few seconds. Please advise, if this issue is not to be sorted, obviously Ollama is not a suitable choice for developing applications that need JSON output. I really like Ollama as it is easy to be set up. ``` completion = ollama.chat( model="mistral", messages=messages, format="json", options=Options( temperature=0.0, num_ctx=100000, num_predict=-1, ), ```
GiteaMirror added the question label 2026-05-03 15:37:56 -05:00
Author
Owner

@igorschlum commented on GitHub (Mar 14, 2024):

Hi @eliranwong JSON should not be slow. Did you try with other models? What OS do you use?
Generally, when it's slow, that means that ollama doesn't have enough memory to run.

<!-- gh-comment-id:1998638391 --> @igorschlum commented on GitHub (Mar 14, 2024): Hi @eliranwong JSON should not be slow. Did you try with other models? What OS do you use? Generally, when it's slow, that means that ollama doesn't have enough memory to run.
Author
Owner

@igorschlum commented on GitHub (Mar 15, 2024):

Version 0.1.29 fix an issue with json.

Fixed issues where Ollama would hang when using JSON mode.

Can you upgrade and tell if it works know?

<!-- gh-comment-id:1999057073 --> @igorschlum commented on GitHub (Mar 15, 2024): Version 0.1.29 fix an issue with json. Fixed issues where Ollama would hang when using JSON mode. Can you upgrade and tell if it works know?
Author
Owner

@eliranwong commented on GitHub (Mar 15, 2024):

Hi @eliranwong JSON should not be slow. Did you try with other models? What OS do you use? Generally, when it's slow, that means that ollama doesn't have enough memory to run.

I use Linux. I tasted on the same machine. Ollama is obviously slow on Linux

<!-- gh-comment-id:1999218292 --> @eliranwong commented on GitHub (Mar 15, 2024): > Hi @eliranwong JSON should not be slow. Did you try with other models? What OS do you use? Generally, when it's slow, that means that ollama doesn't have enough memory to run. I use Linux. I tasted on the same machine. Ollama is obviously slow on Linux
Author
Owner

@eliranwong commented on GitHub (Mar 15, 2024):

Version 0.1.29 fix an issue with json.

Fixed issues where Ollama would hang when using JSON mode.

Can you upgrade and tell if it works know?

I upgraded to the latest 0.1.29, still slow, but it seems the hanging issues improved, at least less frequent to happen.

I tested on macOS today, even with much less memory, it is faster on macOS when json mode is on. When json mode is off, it runs slower on the same machine, compared to the same Linux device. It appears to me that the issue is with Linux.

In sum:

when JSON is off, my Linux is faster than my macOS

when JSON is on, my Linux is slower than my macOS

<!-- gh-comment-id:1999224228 --> @eliranwong commented on GitHub (Mar 15, 2024): > Version 0.1.29 fix an issue with json. > > Fixed issues where Ollama would hang when using JSON mode. > > Can you upgrade and tell if it works know? I upgraded to the latest 0.1.29, still slow, but it seems the hanging issues improved, at least less frequent to happen. I tested on macOS today, even with much less memory, it is faster on macOS when json mode is on. When json mode is off, it runs slower on the same machine, compared to the same Linux device. It appears to me that the issue is with Linux. In sum: when JSON is off, my Linux is faster than my macOS when JSON is on, my Linux is slower than my macOS
Author
Owner

@igorschlum commented on GitHub (Mar 15, 2024):

Do you have a sample or a script I can run on a mac with large amount of memory to see if I can reproduce the slowness?
I can only test on MacOS.

<!-- gh-comment-id:1999270563 --> @igorschlum commented on GitHub (Mar 15, 2024): Do you have a sample or a script I can run on a mac with large amount of memory to see if I can reproduce the slowness? I can only test on MacOS.
Author
Owner

@eliranwong commented on GitHub (Mar 15, 2024):

After lots of trails and errors, I finally what can make it very slow or make it faster.

Previously, I included BOTH a JSON schema AND some other important information in the system messages. The results were very slow.

Now, I took away all non-schema information in the system message, i.e. leaving ONLY schema in the system message like the workable example below. The time becomes acceptable.

However, it is not ideal, as I need to include some other information in the system message for developing my application. If possible, I would suggest Ollama to support placing the schema as part of the parameter "response_format", separating it from system message, just like the implementation of llama cpp, which I read the section "JSON Schema Mode" at https://llama-cpp-python.readthedocs.io/en/latest/

Bty, the example below is which I finally find to work with Ollama with reasonable time. However, as my app needs to add some other information to system message, other than the JSON schema. I am still looking for a better solution. I may need to switch to llama cpp if there is no better solution in Ollama.

import ollama, json, traceback
from ollama import Options

def getResponseDict(messages, **kwargs):
    try:
        completion = ollama.chat(
            #keep_alive=0,
            model="mistral",
            messages=messages,
            format="json",
            stream=False,
            options=Options(
                temperature=0.0,
                num_ctx=100000,
                num_predict=-1,
            ),
            **kwargs,
        )
        jsonOutput = completion["message"]["content"]
        responseDict = json.loads(jsonOutput)
        return responseDict
    except:
        print(traceback.format_exc())
        return {}

schema = {
    "name": "add_calendar_event",
    "description": "add calendar event",
    "parameters": {
        "type": "object",
        "properties": {
            "calendar": {
                "type": "string",
                "description": "The calendar application. Return 'google' if not given.",
                "enum": ['google', 'outlook'],
            },
            "title": {
                "type": "string",
                "description": "The title of the event.",
            },
            "description": {
                "type": "string",
                "description": "The detailed description of the event, including the people involved and their roles, if any.",
            },
            "url": {
                "type": "string",
                "description": "Event url",
            },
            "start_time": {
                "type": "string",
                "description": "The start date and time of the event in the format `YYYYMMDDTHHmmss`. For example, `20220101T100000` represents January 1, 2022, at 10:00 AM.",
            },
            "end_time": {
                "type": "string",
                "description": "The end date and time of the event in the format `YYYYMMDDTHHmmss`. For example, `20220101T100000` represents January 1, 2022, at 10:00 AM. If not given, return 1 hour later than the start_time",
            },
            "location": {
                "type": "string",
                "description": "The location or venue of the event.",
            },
        },
        "required": ["calendar", "title", "description", "start_time", "end_time"],
    },
}

template = {
    "calendar": "",
    "title": "",
    "description": "",
    "url": "",
    "start_time": "",
    "end_time": "",
    "location": "",
}

messages = [
    {
        "role": "system",
        "content": f"""You are a JSON builder expert. You response to my input according to the following schema:

{schema}""",
    },
    {
        "role": "user",
        "content": f"""Use the following template in your response:

{template}

Base the value of each key, in the template, on the following content:

Add an event to my calendar: I am having a dinner with my wife on 3March2024, from 7pm to 9pm, at Central London.  Table booked at: https://letmedoit.ai

Remember, answer in JSON with the filled template ONLY.""",
    },
]

import timeit

for i in range(10):
    print(timeit.timeit(lambda: print(getResponseDict(messages)), number=1))
<!-- gh-comment-id:2000181589 --> @eliranwong commented on GitHub (Mar 15, 2024): After lots of trails and errors, I finally what can make it very slow or make it faster. Previously, I included BOTH a JSON schema AND some other important information in the system messages. The results were very slow. Now, I took away all non-schema information in the system message, i.e. leaving ONLY schema in the system message like the workable example below. The time becomes acceptable. However, it is not ideal, as I need to include some other information in the system message for developing my application. If possible, I would suggest Ollama to support placing the schema as part of the parameter "response_format", separating it from system message, just like the implementation of llama cpp, which I read the section "JSON Schema Mode" at https://llama-cpp-python.readthedocs.io/en/latest/ Bty, the example below is which I finally find to work with Ollama with reasonable time. However, as my app needs to add some other information to system message, other than the JSON schema. I am still looking for a better solution. I may need to switch to llama cpp if there is no better solution in Ollama. ``` import ollama, json, traceback from ollama import Options def getResponseDict(messages, **kwargs): try: completion = ollama.chat( #keep_alive=0, model="mistral", messages=messages, format="json", stream=False, options=Options( temperature=0.0, num_ctx=100000, num_predict=-1, ), **kwargs, ) jsonOutput = completion["message"]["content"] responseDict = json.loads(jsonOutput) return responseDict except: print(traceback.format_exc()) return {} schema = { "name": "add_calendar_event", "description": "add calendar event", "parameters": { "type": "object", "properties": { "calendar": { "type": "string", "description": "The calendar application. Return 'google' if not given.", "enum": ['google', 'outlook'], }, "title": { "type": "string", "description": "The title of the event.", }, "description": { "type": "string", "description": "The detailed description of the event, including the people involved and their roles, if any.", }, "url": { "type": "string", "description": "Event url", }, "start_time": { "type": "string", "description": "The start date and time of the event in the format `YYYYMMDDTHHmmss`. For example, `20220101T100000` represents January 1, 2022, at 10:00 AM.", }, "end_time": { "type": "string", "description": "The end date and time of the event in the format `YYYYMMDDTHHmmss`. For example, `20220101T100000` represents January 1, 2022, at 10:00 AM. If not given, return 1 hour later than the start_time", }, "location": { "type": "string", "description": "The location or venue of the event.", }, }, "required": ["calendar", "title", "description", "start_time", "end_time"], }, } template = { "calendar": "", "title": "", "description": "", "url": "", "start_time": "", "end_time": "", "location": "", } messages = [ { "role": "system", "content": f"""You are a JSON builder expert. You response to my input according to the following schema: {schema}""", }, { "role": "user", "content": f"""Use the following template in your response: {template} Base the value of each key, in the template, on the following content: Add an event to my calendar: I am having a dinner with my wife on 3March2024, from 7pm to 9pm, at Central London. Table booked at: https://letmedoit.ai Remember, answer in JSON with the filled template ONLY.""", }, ] import timeit for i in range(10): print(timeit.timeit(lambda: print(getResponseDict(messages)), number=1)) ```
Author
Owner

@eliranwong commented on GitHub (Mar 16, 2024):

I found a workaround, thanks.

<!-- gh-comment-id:2002014580 --> @eliranwong commented on GitHub (Mar 16, 2024): I found a workaround, thanks.
Author
Owner

@marksalpeter commented on GitHub (Apr 23, 2024):

Was there a real resolution to this issue? I'm building a rag system and the json formatting of ollama has 10x worse performance than regular question/answer retrieval.

<!-- gh-comment-id:2073244482 --> @marksalpeter commented on GitHub (Apr 23, 2024): Was there a real resolution to this issue? I'm building a rag system and the json formatting of ollama has 10x worse performance than regular question/answer retrieval.
Author
Owner

@eliranwong commented on GitHub (Apr 23, 2024):

My temporary solution is a switch to llama.cpp when I need json output, also trying guidance package and see if faster speed can be achieved. Ollama is obviously slow with JSON output.

<!-- gh-comment-id:2073477518 --> @eliranwong commented on GitHub (Apr 23, 2024): My temporary solution is a switch to llama.cpp when I need json output, also trying guidance package and see if faster speed can be achieved. Ollama is obviously slow with JSON output.
Author
Owner

@OnurSarikaya2000 commented on GitHub (May 25, 2024):

I noticed that it is slow when using the /general endpoint but it's fast when using the /chat endpoint. Maybe this information might help

<!-- gh-comment-id:2131301473 --> @OnurSarikaya2000 commented on GitHub (May 25, 2024): I noticed that it is slow when using the /general endpoint but it's fast when using the /chat endpoint. Maybe this information might help
Author
Owner

@habanoz commented on GitHub (May 29, 2024):

I am using Windows. I upgraded Ollama and JSON decoding is not slow anymore.

<!-- gh-comment-id:2137494303 --> @habanoz commented on GitHub (May 29, 2024): I am using Windows. I upgraded Ollama and JSON decoding is not slow anymore.
Author
Owner

@igorschlum commented on GitHub (May 29, 2024):

Hi @eliranwong I know you found a workarround, but it you try with latest version of Ollama, could you tell us is the issue is solve without workarround?

<!-- gh-comment-id:2137553155 --> @igorschlum commented on GitHub (May 29, 2024): Hi @eliranwong I know you found a workarround, but it you try with latest version of Ollama, could you tell us is the issue is solve without workarround?
Author
Owner

@piovis2023 commented on GitHub (Sep 5, 2024):

is this fixed now guys?

<!-- gh-comment-id:2332076766 --> @piovis2023 commented on GitHub (Sep 5, 2024): is this fixed now guys?
Author
Owner

@eliranwong commented on GitHub (Sep 5, 2024):

It was particularly slow when I used multiple tools, but I finally found a satisfactory way to run multiple tools in one tool. It is now super fast.
In case anyone is interested https://github.com/eliranwong/freegenius/wiki/Multiple-Tools-in-One-Go

<!-- gh-comment-id:2332909794 --> @eliranwong commented on GitHub (Sep 5, 2024): It was particularly slow when I used multiple tools, but I finally found a satisfactory way to run multiple tools in one tool. It is now super fast. In case anyone is interested https://github.com/eliranwong/freegenius/wiki/Multiple-Tools-in-One-Go
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#63977