[GH-ISSUE #6002] JSON Schema conformity using Llama.cpp Grammar generation for Tool Calling #65790

Closed
opened 2026-05-03 22:42:13 -05:00 by GiteaMirror · 17 comments
Owner

Originally created by @marcnnn on GitHub (Jul 26, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6002

Originally assigned to: @ParthSareen on GitHub.

First thanks to the ollama Team, it's a pleasure to use!

I was looking into the Topic of Schema and Grammar from the tools' perspective.

I assume:

  • the tool arguments json schema, only inserted into the Prompt by ollama
  • no grammar is used to enforce the json schema for tool arguments
  • (only the json grammar is used)
  • the json-schema is not used for validation after parsing as well

Since Multiple pull request for the Topic of JSON schema and Grammars are open, I would like to address that by suggesting closing them.
For the few people that actually need Grammar support, using llama.cpp without ollama seem appropriate, since using grammars is already very low level.

ollama could focus to support tools calling with JSON Schema conformity.

One would need to combine all possible Tools into one JSON Schema.
A bit like in this https://github.com/ggerganov/llama.cpp/issues/7703 example.

Then the model can be constrained to comply to the correct tool use.
The problem is the exact tool support output depends on the model:

f5e3939220/server/testdata/tools/llama3-groq-tool-use.out (L14)

vs

f5e3939220/server/testdata/tools/firefunction.out (L17)

vs

e51c73ac63/models/llama3_1/api/tool_utils.py (L16)

How to deal with the:

--"functools" in firefunction
--"<tool_call>" token in llama3-groq
--"<function=****>" for llama 3.1

is not clear for me jet, since I think it would be best to leverage the JSON-Schema to grammar functionality in llama.cpp.

@mxyng What do think about that?

Should I spend time working or that, or is there no chance to be merged into ollama?

This should address:

https://github.com/ollama/ollama/issues/5976
https://github.com/ollama/ollama/issues/1507
https://github.com/ollama/ollama/pull/5348
https://github.com/ollama/ollama/pull/830
https://github.com/ollama/ollama/pull/565
https://github.com/ollama/ollama/pull/1606
https://github.com/ollama/ollama/pull/4525
https://github.com/ollama/ollama/pull/2404

on the consumer side:
https://github.com/thmsmlr/instructor_ex/issues/11

Originally created by @marcnnn on GitHub (Jul 26, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6002 Originally assigned to: @ParthSareen on GitHub. First thanks to the ollama Team, it's a pleasure to use! I was looking into the Topic of Schema and Grammar from the tools' perspective. I assume: - the tool arguments json schema, only inserted into the Prompt by ollama - no grammar is used to enforce the json schema for tool arguments - (only the json grammar is used) - the json-schema is not used for validation after parsing as well Since Multiple pull request for the Topic of JSON schema and Grammars are open, I would like to address that by suggesting closing them. For the few people that actually need Grammar support, using llama.cpp without ollama seem appropriate, since using grammars is already very low level. ollama could focus to support tools calling with JSON Schema conformity. One would need to combine all possible Tools into one JSON Schema. A bit like in this https://github.com/ggerganov/llama.cpp/issues/7703 example. Then the model can be constrained to comply to the correct tool use. The problem is the exact tool support output depends on the model: https://github.com/ollama/ollama/blob/f5e3939220e9cd3d7a636708bc9df031ebfd4854/server/testdata/tools/llama3-groq-tool-use.out#L14 vs https://github.com/ollama/ollama/blob/f5e3939220e9cd3d7a636708bc9df031ebfd4854/server/testdata/tools/firefunction.out#L17 vs https://github.com/meta-llama/llama-models/blob/e51c73ac639a38877da9bdfaecb4cb07dc8ba6d0/models/llama3_1/api/tool_utils.py#L16 How to deal with the: --"functools" in firefunction --"<tool_call>" token in llama3-groq --"<function=****>" for llama 3.1 is not clear for me jet, since I think it would be best to leverage the JSON-Schema to grammar functionality in llama.cpp. @mxyng What do think about that? Should I spend time working or that, or is there no chance to be merged into ollama? This should address: https://github.com/ollama/ollama/issues/5976 https://github.com/ollama/ollama/issues/1507 https://github.com/ollama/ollama/pull/5348 https://github.com/ollama/ollama/pull/830 https://github.com/ollama/ollama/pull/565 https://github.com/ollama/ollama/pull/1606 https://github.com/ollama/ollama/pull/4525 https://github.com/ollama/ollama/pull/2404 on the consumer side: https://github.com/thmsmlr/instructor_ex/issues/11
GiteaMirror added the feature request label 2026-05-03 22:42:13 -05:00
Author
Owner

@NeuralNotwerk commented on GitHub (Jul 30, 2024):

I'd love to see this feature added. I don't understand why it isn't just passed straight through to ggml via llama.cpp. We need arbitrary grammars in GBNF format.

<!-- gh-comment-id:2258904377 --> @NeuralNotwerk commented on GitHub (Jul 30, 2024): I'd love to see this feature added. I don't understand why it isn't just passed straight through to ggml via llama.cpp. We need arbitrary grammars in GBNF format.
Author
Owner

@Kinglord commented on GitHub (Aug 7, 2024):

Hey all, I know there's an automated ping here but just to better align everyone please check out and comment on my new call to the Ollama team for clarity here. As always please be civil and stay on topic! 😄 - https://github.com/ollama/ollama/issues/6237

<!-- gh-comment-id:2273914940 --> @Kinglord commented on GitHub (Aug 7, 2024): Hey all, I know there's an automated ping here but just to better align everyone please check out and comment on my new call to the Ollama team for clarity here. As always please be civil and stay on topic! 😄 - https://github.com/ollama/ollama/issues/6237
Author
Owner

@marcnnn commented on GitHub (Aug 7, 2024):

@Kinglord Thanks for creating a place for that discussion.
I tried to get around that discussion here.

Because tool use with models that are trained specifically for that, like llama3.1,
comes with its own challenges for the grammar.
Because just using the JSON schema to grammar conversion in llama.cpp will not be enough as explained.

A way to template the grammar generation in the model file could be a solution that I am thinking about.

<!-- gh-comment-id:2274335337 --> @marcnnn commented on GitHub (Aug 7, 2024): @Kinglord Thanks for creating a place for that discussion. I tried to get around that discussion here. Because tool use with models that are trained specifically for that, like llama3.1, comes with its own challenges for the grammar. Because just using the JSON schema to grammar conversion in llama.cpp will not be enough as explained. A way to template the grammar generation in the model file could be a solution that I am thinking about.
Author
Owner

@ParthSareen commented on GitHub (Dec 5, 2024):

Hi! Going to close this out as we're supporting structured outputs through https://github.com/ollama/ollama/pull/7900

Left a comment with some background as well: https://github.com/ollama/ollama/issues/6237#issuecomment-2518836758

<!-- gh-comment-id:2518841118 --> @ParthSareen commented on GitHub (Dec 5, 2024): Hi! Going to close this out as we're supporting structured outputs through https://github.com/ollama/ollama/pull/7900 Left a comment with some background as well: https://github.com/ollama/ollama/issues/6237#issuecomment-2518836758
Author
Owner

@allenporter commented on GitHub (Dec 9, 2024):

@ParthSareen I interepted this request as actually using the provided tool calling schema for the grammar schema -- which i believe is slightly different than supporting structured outputs in the response. Did this change actually support grammars for tool calls? I am assuming not because it appears to pass in the request format only, not looking at the tool format.

I want to make sure this was intentional and not a misunderstanding. Having tool calls following the schema would be a huge quality win for smaller models that don't always produce correct tool call outputs.

<!-- gh-comment-id:2526896971 --> @allenporter commented on GitHub (Dec 9, 2024): @ParthSareen I interepted this request as actually using the provided tool calling schema for the grammar schema -- which i believe is slightly different than supporting structured outputs in the response. Did this change actually support grammars for tool calls? I am assuming not because it appears to pass in the request format only, not looking at the tool format. I want to make sure this was intentional and not a misunderstanding. Having tool calls following the schema would be a huge quality win for smaller models that don't always produce correct tool call outputs.
Author
Owner

@ParthSareen commented on GitHub (Dec 9, 2024):

Thanks for the ping @allenporter. Seems like I misinterpreted on my first run through of this. Reopening the issue for now. Going to think a bit about how we can support this, what does extensibility look like, and if it makes sense for the stage of the project we're in. I'm also not sure what the exact interface would look like but am going to think through this one. I do think this could be really cool and improve accuracy. Just a bit worried about the interface as there are big updates to the engine incoming.

Will keep you all posted! Thanks!

<!-- gh-comment-id:2528877253 --> @ParthSareen commented on GitHub (Dec 9, 2024): Thanks for the ping @allenporter. Seems like I misinterpreted on my first run through of this. Reopening the issue for now. Going to think a bit about how we can support this, what does extensibility look like, and if it makes sense for the stage of the project we're in. I'm also not sure what the exact interface would look like but am going to think through this one. I do think this could be really cool and improve accuracy. Just a bit worried about the interface as there are big updates to the engine incoming. Will keep you all posted! Thanks!
Author
Owner

@allenporter commented on GitHub (Dec 9, 2024):

Ok thanks, I'm happy to help contribute (I'm familiar with some of how the tool parsing code works) but also I know it's a bit tricky as described above since it also depends on model specific formats as described above by the reporter. And as you say if the tech direction is shifting here, adds a bit more to be requirements...

Happy to keep discussing.

Just to motivate this a bit: The primary use case I'm pushing is or to improve tool use quality for home assistant device control where llama3.1 sometimes gets the tool params wrong. (We have some benchmarks tracking this)

<!-- gh-comment-id:2528983884 --> @allenporter commented on GitHub (Dec 9, 2024): Ok thanks, I'm happy to help contribute (I'm familiar with some of how the tool parsing code works) but also I know it's a bit tricky as described above since it also depends on model specific formats as described above by the reporter. And as you say if the tech direction is shifting here, adds a bit more to be requirements... Happy to keep discussing. Just to motivate this a bit: The primary use case I'm pushing is or to improve tool use quality for home assistant device control where llama3.1 sometimes gets the tool params wrong. (We have some benchmarks tracking this)
Author
Owner

@ParthSareen commented on GitHub (Dec 9, 2024):

@allenporter If you'd like you can take a crack at it for fun and see how far you get. There are a couple things I need to get to in the meantime but can pick it up from wherever you leave off. We can coauthor it if you're interested 😄 For a starting point I'd dig into the ChatHandler around here: da09488fbf/server/routes.go (L1467-L1530)

We're also currently using go templates for the tool parsing - something that I potentially want to refactor too. Would be out of scope for the PR but important to keep in mind with whatever you prototype. If you do choose to pick this up just open a draft PR and tag me!

Thanks!

<!-- gh-comment-id:2529132335 --> @ParthSareen commented on GitHub (Dec 9, 2024): @allenporter If you'd like you can take a crack at it for fun and see how far you get. There are a couple things I need to get to in the meantime but can pick it up from wherever you leave off. We can coauthor it if you're interested 😄 For a starting point I'd dig into the ChatHandler around here: https://github.com/ollama/ollama/blob/da09488fbfc437c55a94bc5374b0850d935ea09f/server/routes.go#L1467-L1530 We're also currently using go templates for the tool parsing - something that I potentially want to refactor too. Would be out of scope for the PR but important to keep in mind with whatever you prototype. If you do choose to pick this up just open a draft PR and tag me! Thanks!
Author
Owner

@allenporter commented on GitHub (Dec 10, 2024):

We're also currently using go templates for the tool parsing - something that I potentially want to refactor too.

Yeah, when I was looking at this looked at, it seemed a little difficult to specify up front as a grammar since:
(1) Tool response format depends on the model / template
(2) Tool responses are optional

My impression is the flow is something like this to prepare for a tool call:

  • 900f64e6be/server/prompt.go (L25) appends any present tool calls into the request
  • The completion request point you cited will have the tool calls in the request
  • The response is parsed 900f64e6be/server/model.go (L303) using the template.
  • Response parsing is a little more creative. It assumes that the output format must contain json inside of what is specified in the template (works for most models, but not all).
    • It instantiates the template with a fake tool call with placeholder variable names
    • It then brute force parses the output as json 900f64e6be/server/model.go (L277) iterating through each character, parsing objects as it goes
    • Then it reverse engineers the fields that contain the tool names and arguments from the placeholder names
    • Then repeats the same process for the real output collecting the tool call objects

Proposed approach to get started:

  • Maybe we start simple with a model like llama3 which has a very straight forward format: Either it's responding with text, or it's responding with json, with no other wrapping characters.
  • Define a grammar where the structure is optional. Unless it starts producing json: it has a schema where the parameters match the tool call in the request.
  • It looks possible to even try this out with the current request API by making assumptions about the model format, except for the part where it's optional
<!-- gh-comment-id:2532118713 --> @allenporter commented on GitHub (Dec 10, 2024): > We're also currently using go templates for the tool parsing - something that I potentially want to refactor too. Yeah, when I was looking at this looked at, it seemed a little difficult to specify up front as a grammar since: (1) Tool response format depends on the model / template (2) Tool responses are optional My impression is the flow is something like this to prepare for a tool call: - https://github.com/ollama/ollama/blob/900f64e6be859f52350c25032ff5b11f10509c7e/server/prompt.go#L25 appends any present tool calls into the request - The completion request point you cited will have the tool calls in the request - The response is parsed https://github.com/ollama/ollama/blob/900f64e6be859f52350c25032ff5b11f10509c7e/server/model.go#L303 using the template. - Response parsing is a little more creative. It assumes that the output format must contain json inside of what is specified in the template (works for most models, but not all). - It instantiates the template with a fake tool call with placeholder variable names - It then brute force parses the output as json https://github.com/ollama/ollama/blob/900f64e6be859f52350c25032ff5b11f10509c7e/server/model.go#L277 iterating through each character, parsing objects as it goes - Then it reverse engineers the fields that contain the tool names and arguments from the placeholder names - Then repeats the same process for the real output collecting the tool call objects Proposed approach to get started: - Maybe we start simple with a model like llama3 which has a very straight forward format: Either it's responding with text, or it's responding with json, with no other wrapping characters. - Define a grammar where the structure is optional. Unless it starts producing json: it has a schema where the `parameters` match the tool call in the request. - It looks possible to even try this out with the current request API by making assumptions about the model format, except for the part where it's optional
Author
Owner

@ParthSareen commented on GitHub (Dec 13, 2024):

@allenporter I think this sounds like a good start - would be cool to get a prototype working! Keep me posted!

<!-- gh-comment-id:2540251345 --> @ParthSareen commented on GitHub (Dec 13, 2024): @allenporter I think this sounds like a good start - would be cool to get a prototype working! Keep me posted!
Author
Owner

@allenporter commented on GitHub (Dec 13, 2024):

I made a simple notebook to call llama3.1 w/ tools while also passing the json schema of the tools to the format parameter as an anyOf:
https://github.com/allenporter/ml-papers/blob/main/function-calling/ollama/json-schema.ipynb -- the notebook generates the schema by iterating over the tools.

It works well (it's easier to tell if its working by changing around the tools and mismatching the questions since you can see it force tool calls it would not otherwise).

This does makes tool calling mandatory when using this approach. Llama does bias towards calling tools when provided generally, but when you apply the format it will always call tools, which is not always desired.

<!-- gh-comment-id:2540584144 --> @allenporter commented on GitHub (Dec 13, 2024): I made a simple notebook to call `llama3.1` w/ tools while also passing the json schema of the tools to the `format` parameter as an `anyOf`: https://github.com/allenporter/ml-papers/blob/main/function-calling/ollama/json-schema.ipynb -- the notebook generates the schema by iterating over the tools. It works well (it's easier to tell if its working by changing around the tools and mismatching the questions since you can see it force tool calls it would not otherwise). This does makes tool calling mandatory when using this approach. Llama does bias towards calling tools when provided generally, but when you apply the format it will *always* call tools, which is not always desired.
Author
Owner

@allenporter commented on GitHub (Dec 13, 2024):

I've applied this technique to the Home Assistant assist-mini benchmark using the naive approach above and llama3.1:8b appears to improve from 83% to 91% before and after confirming that using a tool schema is a worthwhile improvement to explore.

<!-- gh-comment-id:2540704611 --> @allenporter commented on GitHub (Dec 13, 2024): I've applied this technique to the Home Assistant `assist-mini` [benchmark](https://github.com/allenporter/home-assistant-datasets/tree/main/reports) using the naive approach above and `llama3.1:8b` appears to improve from 83% to 91% before and after confirming that using a tool schema is a worthwhile improvement to explore.
Author
Owner

@allenporter commented on GitHub (Dec 13, 2024):

As a next step, i'll try:

  • get a grammar working forcing a json schema inferred from tool schema (always)
  • make tools optional in the grammar
  • see if its possible to create a grammar from the template output
<!-- gh-comment-id:2540708748 --> @allenporter commented on GitHub (Dec 13, 2024): As a next step, i'll try: - get a grammar working forcing a json schema inferred from tool schema (always) - make tools optional in the grammar - see if its possible to create a grammar from the template output
Author
Owner

@davidgeorgewilliams commented on GitHub (Dec 13, 2024):

+1, structured outputs from tool calls would be extremely helpful for adoption.

<!-- gh-comment-id:2542010045 --> @davidgeorgewilliams commented on GitHub (Dec 13, 2024): +1, structured outputs from tool calls would be extremely helpful for adoption.
Author
Owner

@ParthSareen commented on GitHub (Dec 13, 2024):

The preliminary results look promising @allenporter - the next steps sound good too. Just toss whatever work you get done on a draft PR and tag me. Thanks for working on this! 🙏🏽

<!-- gh-comment-id:2542049230 --> @ParthSareen commented on GitHub (Dec 13, 2024): The preliminary results look promising @allenporter - the next steps sound good too. Just toss whatever work you get done on a draft PR and tag me. Thanks for working on this! 🙏🏽
Author
Owner

@allenporter commented on GitHub (Feb 3, 2025):

The tool calling support natively in llama.cpp implements this feature: https://github.com/ggerganov/llama.cpp/pull/9639 quoting:

Any tool_calls field returned by llama-server should always conform to the JSON schema (to the extent that it uses supported features of JSON schemas), so there's no need to use any post-processor.

I could see a world where ollama instead uses the native tool calling to get this feature, assuming template support in llama.cpp can match what ollama has.

<!-- gh-comment-id:2629847878 --> @allenporter commented on GitHub (Feb 3, 2025): The tool calling support natively in `llama.cpp` implements this feature: https://github.com/ggerganov/llama.cpp/pull/9639 quoting: > Any tool_calls field returned by llama-server should always conform to the JSON schema (to the extent that it uses [supported features of JSON schemas](https://github.com/ggerganov/llama.cpp/tree/master/grammars#json-schemas--gbnf)), so there's no need to use any post-processor. I could see a world where ollama instead uses the native tool calling to get this feature, assuming template support in llama.cpp can match what ollama has.
Author
Owner

@ParthSareen commented on GitHub (Feb 3, 2025):

@allenporter We're working on a new engine and I've been working on sampling so won't be directly using llama.cpp for that. We're already doing partial JSON parsing on the tools output so it makes sense we'll do the json schema from tools. I think we can close this issue out for now and I'll open something for myself to get to after we have some of the new engine work merged in :) Really appreciate you digging in and validating this!

<!-- gh-comment-id:2631915927 --> @ParthSareen commented on GitHub (Feb 3, 2025): @allenporter We're working on a new engine and I've been working on sampling so won't be directly using llama.cpp for that. We're already doing partial JSON parsing on the tools output so it makes sense we'll do the json schema from tools. I think we can close this issue out for now and I'll open something for myself to get to after we have some of the new engine work merged in :) Really appreciate you digging in and validating this!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#65790