[GH-ISSUE #1232] Streaming output #62662

Closed
opened 2026-05-03 09:54:57 -05:00 by GiteaMirror · 8 comments
Owner

Originally created by @r8bywork on GitHub (Nov 22, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1232

React + Ts
Hello, How to output messages in a stream rather than waiting for the entire object to be received?

Originally created by @r8bywork on GitHub (Nov 22, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1232 React + Ts Hello, How to output messages in a stream rather than waiting for the entire object to be received?
Author
Owner

@easp commented on GitHub (Nov 22, 2023):

Through the API? https://github.com/jmorganca/ollama/blob/main/docs/api.md#generate-a-completion
stream: true
I thought that was the default, tho.

<!-- gh-comment-id:1821937587 --> @easp commented on GitHub (Nov 22, 2023): Through the API? https://github.com/jmorganca/ollama/blob/main/docs/api.md#generate-a-completion `stream: true` I thought that was the default, tho.
Author
Owner

@r8bywork commented on GitHub (Nov 22, 2023):

Through the API? https://github.com/jmorganca/ollama/blob/main/docs/api.md#generate-a-completion stream: true I thought that was the default, tho.

It's just that no matter how I try to make a query, I always wait for the full object to be retrieved. The stream: true | false flag has no effect on the stream receiving data, it always waits for the full object.
Only the structure of the response changes.

Maybe someone has an example of a query?

<!-- gh-comment-id:1822424409 --> @r8bywork commented on GitHub (Nov 22, 2023): > Through the API? https://github.com/jmorganca/ollama/blob/main/docs/api.md#generate-a-completion `stream: true` I thought that was the default, tho. It's just that no matter how I try to make a query, I always wait for the full object to be retrieved. The `stream: true | false` flag has no effect on the stream receiving data, it always waits for the full object. Only the structure of the response changes. Maybe someone has an example of a query?
Author
Owner

@BruceMacD commented on GitHub (Nov 22, 2023):

Hey @r8bywork, sounds like you are having trouble with updating the display in javascript while the response is being streamed back? Here is some sample typescript that should help.

af0bb4eafd/src/index.ts (L40)

Then you can call this function and handle each streamed object that is returned like this:

import { Ollama } from "ollama";

const ollama = new Ollama();

for await (const token of ollama.generate("llama2", "What is a llama?")) {
	process.stdout.write(token);
}
<!-- gh-comment-id:1823027202 --> @BruceMacD commented on GitHub (Nov 22, 2023): Hey @r8bywork, sounds like you are having trouble with updating the display in javascript while the response is being streamed back? Here is some sample typescript that should help. https://github.com/jmorganca/ollamajs/blob/af0bb4eafd72a0587c31d1c31d14c517d05c1cb5/src/index.ts#L40 Then you can call this function and handle each streamed object that is returned like this: ``` import { Ollama } from "ollama"; const ollama = new Ollama(); for await (const token of ollama.generate("llama2", "What is a llama?")) { process.stdout.write(token); } ```
Author
Owner

@r8bywork commented on GitHub (Nov 23, 2023):

Hey @BruceMacD. Thank you for all the help you're giving right now.
I tried this method, but I just get errors after I do const ollama = new Ollama()

TypeError: Cannot destructure property 'stat' of 'import_node_fs.promises' as it is undefined. at from.js:8:9

Module "buffer" has been externalized for browser compatibility. Cannot access "buffer.Blob" in client code. See http://vitejs.dev/guide/troubleshooting.html#module-externalized-for-browser-compatibility for more details.

Cannot access "node:fs.promises" in client code. See http://vitejs.dev/guide/troubleshooting.html#module-externalized-for-browser-compatibility for more details.

image
<!-- gh-comment-id:1824318967 --> @r8bywork commented on GitHub (Nov 23, 2023): Hey @BruceMacD. Thank you for all the help you're giving right now. I tried this method, but I just get errors after I do const ollama = new Ollama() `TypeError: Cannot destructure property 'stat' of 'import_node_fs.promises' as it is undefined. at from.js:8:9` `Module "buffer" has been externalized for browser compatibility. Cannot access "buffer.Blob" in client code. See http://vitejs.dev/guide/troubleshooting.html#module-externalized-for-browser-compatibility for more details.` `Cannot access "node:fs.promises" in client code. See http://vitejs.dev/guide/troubleshooting.html#module-externalized-for-browser-compatibility for more details.` <img width="825" alt="image" src="https://github.com/jmorganca/ollama/assets/109221341/1cbb4143-d8ff-49c2-8245-9bf7a3aba296">
Author
Owner

@MikeyBeez commented on GitHub (Dec 11, 2023):

I'm having the same problem. When I run client.py it doesn't produce output until the entire response has been generated. I would like to see each word as it is generated the way I can see them when I run ollama run mistral from the command line. In fact, I would like to use the say command to speak each word as it is output. I tried doing this using subprocess, but that doesn't work either.

<!-- gh-comment-id:1851038602 --> @MikeyBeez commented on GitHub (Dec 11, 2023): I'm having the same problem. When I run client.py it doesn't produce output until the entire response has been generated. I would like to see each word as it is generated the way I can see them when I run ollama run mistral from the command line. In fact, I would like to use the say command to speak each word as it is output. I tried doing this using subprocess, but that doesn't work either.
Author
Owner

@MikeyBeez commented on GitHub (Dec 12, 2023):

I found python-simplegenerate/client.py. This logic works perfectly. Here's a program called conversation.py. This worked badly using curl, but it works find usin the requests library. Thank you so much for everything.

import argparse
import time
import requests
import json

def generate_response(prompt, context):
r = requests.post('http://localhost:11434/api/generate',
json={
'model': 'llama2',
'prompt': prompt,
'context': context,
},
stream=True)
r.raise_for_status()

for line in r.iter_lines():
    body = json.loads(line)
    response_part = body.get('response', '')
    print(response_part, end='', flush=True)

    if 'error' in body:
        raise Exception(body['error'])

    if body.get('done', False):
        return body['context']

def run_conversation(conversation_duration, initial_prompt):
context = [] # the context stores a conversation history, you can use this to make the model more context aware
start_time = time.time()

while time.time() - start_time < conversation_duration:
    # Agent's turn
    context = generate_response(initial_prompt, context)
    time.sleep(6)

    if time.time() - start_time >= conversation_duration:
        break

if name == "main":
# Get the command-line arguments
parser = argparse.ArgumentParser(description="Process two arguments.")
parser.add_argument("--duration", type=int, help="Duration of the conversation in seconds")
parser.add_argument("--initial_prompt", type=str, help="Initial prompt for the conversation")

# Parse command-line arguments
args = parser.parse_args()

if args.duration is None or args.initial_prompt is None:
    print("Both --duration and --initial_prompt are required.")
else:
    # Use the arguments in your program logic
    duration = args.duration
    initial_prompt = args.initial_prompt

    run_conversation(duration, initial_prompt)
<!-- gh-comment-id:1851230076 --> @MikeyBeez commented on GitHub (Dec 12, 2023): I found python-simplegenerate/client.py. This logic works perfectly. Here's a program called conversation.py. This worked badly using curl, but it works find usin the requests library. Thank you so much for everything. import argparse import time import requests import json def generate_response(prompt, context): r = requests.post('http://localhost:11434/api/generate', json={ 'model': 'llama2', 'prompt': prompt, 'context': context, }, stream=True) r.raise_for_status() for line in r.iter_lines(): body = json.loads(line) response_part = body.get('response', '') print(response_part, end='', flush=True) if 'error' in body: raise Exception(body['error']) if body.get('done', False): return body['context'] def run_conversation(conversation_duration, initial_prompt): context = [] # the context stores a conversation history, you can use this to make the model more context aware start_time = time.time() while time.time() - start_time < conversation_duration: # Agent's turn context = generate_response(initial_prompt, context) time.sleep(6) if time.time() - start_time >= conversation_duration: break if __name__ == "__main__": # Get the command-line arguments parser = argparse.ArgumentParser(description="Process two arguments.") parser.add_argument("--duration", type=int, help="Duration of the conversation in seconds") parser.add_argument("--initial_prompt", type=str, help="Initial prompt for the conversation") # Parse command-line arguments args = parser.parse_args() if args.duration is None or args.initial_prompt is None: print("Both --duration and --initial_prompt are required.") else: # Use the arguments in your program logic duration = args.duration initial_prompt = args.initial_prompt run_conversation(duration, initial_prompt)
Author
Owner

@MikeyBeez commented on GitHub (Dec 12, 2023):

Sorry about the bad formatting.

<!-- gh-comment-id:1851230393 --> @MikeyBeez commented on GitHub (Dec 12, 2023): Sorry about the bad formatting.
Author
Owner

@mxyng commented on GitHub (Jan 18, 2024):

This issue seems to be resolved. There's also a new ollama Python library you can use.

<!-- gh-comment-id:1897520973 --> @mxyng commented on GitHub (Jan 18, 2024): This issue seems to be resolved. There's also a new [ollama](https://pypi.org/project/ollama/) Python library you can use.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#62662