[GH-ISSUE #6707] Generate endpoint intermittently misses final token before done #4223

Closed
opened 2026-04-12 15:09:31 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @tarbard on GitHub (Sep 9, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6707

Originally assigned to: @jessegross on GitHub.

What is the issue?

When using the generate endpoint it intermittently misses the last token right before the "done" message

{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-09T08:04:47.463348938Z","response":" Bear","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-09T08:04:47.475993178Z","response":",","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-09T08:04:47.488651949Z","response":" Elephant","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-09T08:04:47.50131158Z","response":",","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-09T08:04:47.51400078Z","response":" Gor","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-09T08:04:47.539481043Z","response":"","done":true,"done_reason":"stop","total_duration":8790953777,"load_duration":8080650494,"

In the above example the token that should be the end of "Gorilla" is not emitted before the done response and we just get "Gor".

here's the curl command to reproduce this

curl -H 'Host: 127.0.0.1:11434' -H 'Content-Type: application/json' -H 'Connection: Keep-Alive' --compressed -H 'Accept-Language: en-GB,*' -H 'User-Agent: Mozilla/5.0' -X POST http://127.0.0.1:11434/a
pi/generate -d '{"model": "adrienbrault/nous-hermes2theta-llama3-8b:q8_0", "prompt": "\n<|im_start|>user\nYou will think of a number. Then you will list that many animals. Do not write any other words only 
the animal. Be terse in your response.<|im_end|>\n<|im_start|>assistant", "raw": true, "stream": true, "keep_alive": -1, "options": {"seed": 99, "num_predict": 1024, "num_ctx": 4096, "stop": ["<end>", "user
:", "assistant:"], "num_batch": 1, "temperature": 0.5, "top_k": 40, "top_p": 0.9}}'

I have only seen this with one model so far (adrienbrault/nous-hermes2theta-llama3-8b:q8_0) so the model may well be a factor however I don't get this problem with the chat endpoint for that model but I do get it with the generate endpoint. I'm using raw mode and stream=true

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.3.9

Originally created by @tarbard on GitHub (Sep 9, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6707 Originally assigned to: @jessegross on GitHub. ### What is the issue? When using the generate endpoint it intermittently misses the last token right before the "done" message ```JSON {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-09T08:04:47.463348938Z","response":" Bear","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-09T08:04:47.475993178Z","response":",","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-09T08:04:47.488651949Z","response":" Elephant","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-09T08:04:47.50131158Z","response":",","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-09T08:04:47.51400078Z","response":" Gor","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-09T08:04:47.539481043Z","response":"","done":true,"done_reason":"stop","total_duration":8790953777,"load_duration":8080650494," ``` In the above example the token that should be the end of "Gorilla" is not emitted before the done response and we just get "Gor". here's the curl command to reproduce this ```sh curl -H 'Host: 127.0.0.1:11434' -H 'Content-Type: application/json' -H 'Connection: Keep-Alive' --compressed -H 'Accept-Language: en-GB,*' -H 'User-Agent: Mozilla/5.0' -X POST http://127.0.0.1:11434/a pi/generate -d '{"model": "adrienbrault/nous-hermes2theta-llama3-8b:q8_0", "prompt": "\n<|im_start|>user\nYou will think of a number. Then you will list that many animals. Do not write any other words only the animal. Be terse in your response.<|im_end|>\n<|im_start|>assistant", "raw": true, "stream": true, "keep_alive": -1, "options": {"seed": 99, "num_predict": 1024, "num_ctx": 4096, "stop": ["<end>", "user :", "assistant:"], "num_batch": 1, "temperature": 0.5, "top_k": 40, "top_p": 0.9}}' ``` I have only seen this with one model so far (adrienbrault/nous-hermes2theta-llama3-8b:q8_0) so the model may well be a factor however I don't get this problem with the chat endpoint for that model but I do get it with the generate endpoint. I'm using raw mode and stream=true ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.3.9
GiteaMirror added the nvidiabug labels 2026-04-12 15:09:31 -05:00
Author
Owner

@pdevine commented on GitHub (Sep 10, 2024):

I tried this on metal a dozen times and it was consistent:

{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:26:37.362725Z","response":"\n","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:26:37.389566Z","response":"5","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:26:37.415809Z","response":":","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:26:37.442209Z","response":" lion","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:26:37.468548Z","response":",","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:26:37.494573Z","response":" elephant","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:26:37.520736Z","response":",","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:26:37.546917Z","response":" gir","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:26:37.573029Z","response":"affe","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:26:37.599043Z","response":",","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:26:37.625321Z","response":" kang","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:26:37.651604Z","response":"aroo","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:26:37.677765Z","response":",","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:26:37.704145Z","response":" p","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:26:37.730185Z","response":"enguin","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:26:37.756331Z","response":"","done":true,"done_reason":"stop","total_duration":728805250,"load_duration":18052916,"prompt_eval_count":40,"prompt_eval_duration":316473000,"eval_count":16,"eval_duration":393564000}
<!-- gh-comment-id:2339381853 --> @pdevine commented on GitHub (Sep 10, 2024): I tried this on metal a dozen times and it was consistent: ``` {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:26:37.362725Z","response":"\n","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:26:37.389566Z","response":"5","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:26:37.415809Z","response":":","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:26:37.442209Z","response":" lion","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:26:37.468548Z","response":",","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:26:37.494573Z","response":" elephant","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:26:37.520736Z","response":",","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:26:37.546917Z","response":" gir","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:26:37.573029Z","response":"affe","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:26:37.599043Z","response":",","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:26:37.625321Z","response":" kang","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:26:37.651604Z","response":"aroo","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:26:37.677765Z","response":",","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:26:37.704145Z","response":" p","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:26:37.730185Z","response":"enguin","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:26:37.756331Z","response":"","done":true,"done_reason":"stop","total_duration":728805250,"load_duration":18052916,"prompt_eval_count":40,"prompt_eval_duration":316473000,"eval_count":16,"eval_duration":393564000} ```
Author
Owner

@pdevine commented on GitHub (Sep 10, 2024):

on linux+nvidia I was getting the same results as you, however, when I used the model template instead of raw mode I got better results:

$ curl -H 'Host: 127.0.0.1:11434' -H 'Content-Type: application/json' -H 'Connection: Keep-Alive' --compressed -H 'Accept-Language: en-GB,*' -H 'User-Agent: Mozilla/5.0' -X POST http://127.0.0.1:11434/api/generate -d '{"model": "adrienbrault/nous-hermes2theta-llama3-8b:q8_0", "prompt": "You will think of a number. Then you will list that many animals. Do not write any other words only the animal. Be terse in your response.", "stream": true, "options": {"seed": 99, "num_predict": 2048, "num_ctx": 4096, "temperature": 0.5, "top_k": 40, "top_p": 0.9}}'
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.251616238Z","response":"\n","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.251619327Z","response":"I","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.251656504Z","response":"'ve","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.259349398Z","response":" thought","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.270200376Z","response":" of","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.281057645Z","response":" ","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.2919001Z","response":"5","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.302748052Z","response":" numbers","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.313607486Z","response":",","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.32445978Z","response":" here","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.335295039Z","response":" are","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.346160415Z","response":" the","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.35697161Z","response":" corresponding","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.367712403Z","response":" animals","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.378437554Z","response":":\n\n","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.389163702Z","response":"1","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.399910268Z","response":".","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.410662594Z","response":" Tiger","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.42140644Z","response":"\n","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.432155321Z","response":"2","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.442912517Z","response":".","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.45366121Z","response":" Elephant","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.464429939Z","response":"\n","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.475180555Z","response":"3","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.485949212Z","response":".","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.496731734Z","response":" Kang","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.508470783Z","response":"aroo","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.519244652Z","response":"\n","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.530008305Z","response":"4","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.540776657Z","response":".","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.551546089Z","response":" Penguin","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.562318564Z","response":"\n","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.573073622Z","response":"5","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.583847574Z","response":".","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.594623831Z","response":" Gor","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.605408025Z","response":"illa","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.616216024Z","response":"","done":true,"done_reason":"stop","context":[128002,882,198,2675,690,1781,315,264,1396,13,5112,499,690,1160,430,1690,10099,13,3234,539,3350,904,1023,4339,1193,279,10065,13,2893,51637,304,701,2077,13,128003,198,128002,78191,198,40,3077,3463,315,220,20,5219,11,1618,527,279,12435,10099,1473,16,13,36845,198,17,13,79189,198,18,13,55376,76865,198,19,13,71244,198,20,13,47247,6374],"total_duration":483583781,"load_duration":31263742,"prompt_eval_count":39,"prompt_eval_duration":17975000,"eval_count":37,"eval_duration":389414000}
<!-- gh-comment-id:2339393354 --> @pdevine commented on GitHub (Sep 10, 2024): on linux+nvidia I was getting the same results as you, however, when I used the model template instead of raw mode I got better results: ``` $ curl -H 'Host: 127.0.0.1:11434' -H 'Content-Type: application/json' -H 'Connection: Keep-Alive' --compressed -H 'Accept-Language: en-GB,*' -H 'User-Agent: Mozilla/5.0' -X POST http://127.0.0.1:11434/api/generate -d '{"model": "adrienbrault/nous-hermes2theta-llama3-8b:q8_0", "prompt": "You will think of a number. Then you will list that many animals. Do not write any other words only the animal. Be terse in your response.", "stream": true, "options": {"seed": 99, "num_predict": 2048, "num_ctx": 4096, "temperature": 0.5, "top_k": 40, "top_p": 0.9}}' {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.251616238Z","response":"\n","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.251619327Z","response":"I","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.251656504Z","response":"'ve","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.259349398Z","response":" thought","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.270200376Z","response":" of","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.281057645Z","response":" ","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.2919001Z","response":"5","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.302748052Z","response":" numbers","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.313607486Z","response":",","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.32445978Z","response":" here","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.335295039Z","response":" are","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.346160415Z","response":" the","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.35697161Z","response":" corresponding","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.367712403Z","response":" animals","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.378437554Z","response":":\n\n","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.389163702Z","response":"1","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.399910268Z","response":".","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.410662594Z","response":" Tiger","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.42140644Z","response":"\n","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.432155321Z","response":"2","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.442912517Z","response":".","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.45366121Z","response":" Elephant","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.464429939Z","response":"\n","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.475180555Z","response":"3","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.485949212Z","response":".","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.496731734Z","response":" Kang","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.508470783Z","response":"aroo","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.519244652Z","response":"\n","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.530008305Z","response":"4","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.540776657Z","response":".","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.551546089Z","response":" Penguin","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.562318564Z","response":"\n","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.573073622Z","response":"5","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.583847574Z","response":".","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.594623831Z","response":" Gor","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.605408025Z","response":"illa","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-10T00:39:37.616216024Z","response":"","done":true,"done_reason":"stop","context":[128002,882,198,2675,690,1781,315,264,1396,13,5112,499,690,1160,430,1690,10099,13,3234,539,3350,904,1023,4339,1193,279,10065,13,2893,51637,304,701,2077,13,128003,198,128002,78191,198,40,3077,3463,315,220,20,5219,11,1618,527,279,12435,10099,1473,16,13,36845,198,17,13,79189,198,18,13,55376,76865,198,19,13,71244,198,20,13,47247,6374],"total_duration":483583781,"load_duration":31263742,"prompt_eval_count":39,"prompt_eval_duration":17975000,"eval_count":37,"eval_duration":389414000} ```
Author
Owner

@tarbard commented on GitHub (Sep 10, 2024):

Thanks for looking into it. By the way I noticed it was often the same tokens that were missing: the end of Hyena, Zebra (it usually just gave Z and missed the ebra) and the Gorilla example above. This made me wonder if the way that particular model's tokens or vocabulary were defined contributes to the problem.

The fact that it works when using the model template might also be the reason that that chat mode didn't seem to give the same problem since that was using the built in template but my software does require raw mode. I've not seen this yet on a normal llama3 model but I haven't had a chance to test it in large numbers yet.

<!-- gh-comment-id:2339629447 --> @tarbard commented on GitHub (Sep 10, 2024): Thanks for looking into it. By the way I noticed it was often the same tokens that were missing: the end of Hyena, Zebra (it usually just gave Z and missed the ebra) and the Gorilla example above. This made me wonder if the way that particular model's tokens or vocabulary were defined contributes to the problem. The fact that it works when using the model template might also be the reason that that chat mode didn't seem to give the same problem since that was using the built in template but my software does require raw mode. I've not seen this yet on a normal llama3 model but I haven't had a chance to test it in large numbers yet.
Author
Owner

@pdevine commented on GitHub (Sep 10, 2024):

@tarbard I would assume that it's something w/ the template which is causing the problem, but it's concerning that it works fine on metal and only seems to be impacting nvidia. I didn't get a chance to try it w/ rocm though. My guess is that getting the template correct would help, but there's probably something lurking in the nvidia inference code.

<!-- gh-comment-id:2342026160 --> @pdevine commented on GitHub (Sep 10, 2024): @tarbard I would assume that it's something w/ the template which is causing the problem, but it's concerning that it works fine on metal and only seems to be impacting nvidia. I didn't get a chance to try it w/ rocm though. My guess is that getting the template correct would help, but there's probably something lurking in the nvidia inference code.
Author
Owner

@jessegross commented on GitHub (Sep 11, 2024):

It turns out this is not hardware dependent, it's just that we get different output on different hardware and some outputs do not trigger the issue. If we force it to be the same problem output:

curl -H 'Host: 127.0.0.1:11434' -H 'Content-Type: application/json' -H 'Connection: Keep-Alive' --compressed -H 'Accept-Language: en-GB,*' -H 'User-Agent: Mozilla/5.0' -X POST http://127.0.0.1:11434/api/generate -d '{"model": "adrienbrault/nous-hermes2theta-llama3-8b:q8_0", "prompt": "\n<|im_start|>user\nWhat is a large ape? Do not write any other words only the animal. Be terse in your response.<|im_end|>\n<|im_start|>assistant", "raw": true, "stream": true, "keep_alive": -1, "options": {"seed": 99, "num_predict": 1024, "num_ctx": 4096, "stop": ["<end>", "user:", "assistant:"], "num_batch": 1, "temperature": 0.0, "top_k": 40, "top_p": 0.9}}'

Metal:

{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-11T23:07:59.814703Z","response":"\n\n","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-11T23:07:59.840429Z","response":"G","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-11T23:07:59.866192Z","response":"or","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-11T23:07:59.918232Z","response":"","done":true,"done_reason":"stop","total_duration":1888076459,"load_duration":1051646334,"prompt_eval_count":31,"prompt_eval_duration":730217000,"eval_count":5,"eval_duration":103643000}

Nvidia:

{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-11T23:08:55.018737321Z","response":"\n\n","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-11T23:08:55.029257868Z","response":"G","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-11T23:08:55.04160876Z","response":"or","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-11T23:08:55.062614239Z","response":"","done":true,"done_reason":"stop","total_duration":2131498867,"load_duration":1753848975,"prompt_eval_count":31,"prompt_eval_duration":284311000,"eval_count":5,"eval_duration":43930000}

With this input, outputs that end in 'a' (Gorilla, Hyena, Zebra) are matching the 'a' in the first letter of stop token 'assistant'. We are the holding the onto the token until the next one comes to see if it a potential match that we should not send back. However, if the next token is EOG then we end without sending the pending responses, which is what the patch above does.

The model template uses different stop tokens, which is why the problem does not occur there.

<!-- gh-comment-id:2344943141 --> @jessegross commented on GitHub (Sep 11, 2024): It turns out this is not hardware dependent, it's just that we get different output on different hardware and some outputs do not trigger the issue. If we force it to be the same problem output: ``` curl -H 'Host: 127.0.0.1:11434' -H 'Content-Type: application/json' -H 'Connection: Keep-Alive' --compressed -H 'Accept-Language: en-GB,*' -H 'User-Agent: Mozilla/5.0' -X POST http://127.0.0.1:11434/api/generate -d '{"model": "adrienbrault/nous-hermes2theta-llama3-8b:q8_0", "prompt": "\n<|im_start|>user\nWhat is a large ape? Do not write any other words only the animal. Be terse in your response.<|im_end|>\n<|im_start|>assistant", "raw": true, "stream": true, "keep_alive": -1, "options": {"seed": 99, "num_predict": 1024, "num_ctx": 4096, "stop": ["<end>", "user:", "assistant:"], "num_batch": 1, "temperature": 0.0, "top_k": 40, "top_p": 0.9}}' ``` Metal: ``` {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-11T23:07:59.814703Z","response":"\n\n","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-11T23:07:59.840429Z","response":"G","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-11T23:07:59.866192Z","response":"or","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-11T23:07:59.918232Z","response":"","done":true,"done_reason":"stop","total_duration":1888076459,"load_duration":1051646334,"prompt_eval_count":31,"prompt_eval_duration":730217000,"eval_count":5,"eval_duration":103643000} ``` Nvidia: ``` {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-11T23:08:55.018737321Z","response":"\n\n","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-11T23:08:55.029257868Z","response":"G","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-11T23:08:55.04160876Z","response":"or","done":false} {"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-11T23:08:55.062614239Z","response":"","done":true,"done_reason":"stop","total_duration":2131498867,"load_duration":1753848975,"prompt_eval_count":31,"prompt_eval_duration":284311000,"eval_count":5,"eval_duration":43930000} ``` With this input, outputs that end in 'a' (Gorilla, Hyena, Zebra) are matching the 'a' in the first letter of stop token 'assistant'. We are the holding the onto the token until the next one comes to see if it a potential match that we should not send back. However, if the next token is EOG then we end without sending the pending responses, which is what the patch above does. The model template uses different stop tokens, which is why the problem does not occur there.
Author
Owner

@tarbard commented on GitHub (Sep 14, 2024):

@jessegross amazing, thanks for your work on this.

<!-- gh-comment-id:2350838820 --> @tarbard commented on GitHub (Sep 14, 2024): @jessegross amazing, thanks for your work on this.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#4223