[GH-ISSUE #254] Streaming llama output #46616

Closed
opened 2026-04-27 23:13:32 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @osamanatouf2 on GitHub (Aug 1, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/254

@jmorganca @mxyng Testing llama2 output after building and serving the model and with the example why the sky is blue I noticed that the streaming does not work correctly. Using the curl or python requests either on ubuntu 22.04
I got something like:
{"model":"llama2","created_at":"2023-08-01T22:02:11.894420812Z","response":" sc","done":false}
{"model":"llama2","created_at":"2023-08-01T22:02:12.151293915Z","response":"at","done":false}
{"model":"llama2","created_at":"2023-08-01T22:02:12.409555353Z","response":"ters","done":false}

Is there anyway to disable streaming?

Originally created by @osamanatouf2 on GitHub (Aug 1, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/254 @jmorganca @mxyng Testing llama2 output after building and serving the model and with the example `why the sky is blue` I noticed that the streaming does not work correctly. Using the curl or python requests either on ubuntu 22.04 I got something like: {"model":"llama2","created_at":"2023-08-01T22:02:11.894420812Z","response":" sc","done":false} {"model":"llama2","created_at":"2023-08-01T22:02:12.151293915Z","response":"at","done":false} {"model":"llama2","created_at":"2023-08-01T22:02:12.409555353Z","response":"ters","done":false} Is there anyway to disable streaming?
Author
Owner

@mxyng commented on GitHub (Aug 2, 2023):

@osamanatouf2 this is working as expected. Each stream response is a complete message to be consumed independently. If you're only interested in the full response, you can buffer the stream responses until done. For an example of this using Python requests, check the linked Discord message.

The short version:

def unstream(model, prompt):
    r = requests.post('http://localhost:11434/api/generate',
                      json={'model': model, 'prompt': prompt},
                      stream=True)
    r.raise_for_status()
    response = ''
    for line in r.iter_lines():
        body = json.loads(line)
        response += body.get('response', '')
        if body.get('done', False):
            body['response'] = response
            return body

https://discord.com/channels/1128867683291627614/1128867684130508875/1134313869221838868

<!-- gh-comment-id:1661284834 --> @mxyng commented on GitHub (Aug 2, 2023): @osamanatouf2 this is working as expected. Each stream response is a complete message to be consumed independently. If you're only interested in the full response, you can buffer the stream responses until `done`. For an example of this using Python requests, check the linked Discord message. The short version: ```python def unstream(model, prompt): r = requests.post('http://localhost:11434/api/generate', json={'model': model, 'prompt': prompt}, stream=True) r.raise_for_status() response = '' for line in r.iter_lines(): body = json.loads(line) response += body.get('response', '') if body.get('done', False): body['response'] = response return body ``` https://discord.com/channels/1128867683291627614/1128867684130508875/1134313869221838868
Author
Owner

@osamanatouf2 commented on GitHub (Aug 2, 2023):

@mxyng My point is if I want to use stream buffer, I cannot since the words are split. The issue shows one example but there is more. I did try a workaround to get the full response and that seems to be working. But I could not get streaming to work properly. I am trying to use Python. I always end up with case as follow:
Great question, this is due to phenomeno n that caused by Ray le igh sc at tering

<!-- gh-comment-id:1661296338 --> @osamanatouf2 commented on GitHub (Aug 2, 2023): @mxyng My point is if I want to use stream buffer, I cannot since the words are split. The issue shows one example but there is more. I did try a workaround to get the full response and that seems to be working. But I could not get streaming to work properly. I am trying to use Python. I always end up with case as follow: ```Great question, this is due to phenomeno n that caused by Ray le igh sc at tering```
Author
Owner

@mxyng commented on GitHub (Aug 2, 2023):

Can you provide an example of your Python code? If the output is not being reconstructed properly, it'll be caused by either the model producing mangled outputs or the reconstruction. The stream is only a means of passing the decoded tokens from the model to the client.

Keep in mind spacing is handled by the model. The full response must be joined without any additional values, e.g. ''.join(responses)

<!-- gh-comment-id:1661320794 --> @mxyng commented on GitHub (Aug 2, 2023): Can you provide an example of your Python code? If the output is not being reconstructed properly, it'll be caused by either the model producing mangled outputs or the reconstruction. The stream is only a means of passing the decoded tokens from the model to the client. Keep in mind spacing is handled by the model. The full response must be joined without any additional values, e.g. `''.join(responses)`
Author
Owner

@osamanatouf2 commented on GitHub (Aug 2, 2023):

@mxyng you are correct. ''.join(responses) does actually solve the issue.

<!-- gh-comment-id:1661355393 --> @osamanatouf2 commented on GitHub (Aug 2, 2023): @mxyng you are correct. ```''.join(responses)``` does actually solve the issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#46616