[GH-ISSUE #3995] Issues with Llama3:70b Model When stream is Set to False #64510

Closed
opened 2026-05-03 17:55:41 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @JIAQIA on GitHub (Apr 28, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3995

What is the issue?

When the stream parameter is set to True, the Llama3:70b model functions correctly. Here is the pytest code used for testing streaming output:

def test_ollama3_70b_original_stream_completion() -> None:
    """
    测试本地Ollama接口-流式输出
    Returns:

    """
    host = "http://localhost"
    port = 11434
    client = httpx.Client(base_url=f"{host}:{port}")
    json_info = {
        "model": "llama3:70b",
        "messages": [
            {
                "role": "user",
                "content": "Help me make a travel guide to Yosemite National Park in 200 words?",
            }
        ],
        "stream": True,
    }
    with client.stream("POST", "/api/chat", json=json_info) as response:  # noqa
        for line in response.iter_lines():
            print(line)
            print("-" * 10)

For the standard Llama3(Not 70b!) model with stream set to False, the API works as expected:

def test_ollama3_chat_completion() -> None:
   """
   测试本地Ollama接口-普通输出
   Returns:

   """
   host = "http://localhost"
   port = 11434
   client = httpx.Client(base_url=f"{host}:{port}")
   json_info = {
       "model": "llama3",
       "messages": [
           {
               "role": "user",
               "content": "Help me make a travel guide to Yosemite National Park in 200 words?",
           }
       ],
       "stream": False,
   }
   response = client.post("/api/chat", json=json_info)
   print(response.json())

However, when using the Llama3:70b model with stream set to False, the request times out, resulting in an httpx.ReadTimeout exception:

def test_ollama3_70b_chat_completion() -> None:
    """
    测试本地Ollama llama3:70b接口-普通输出
    Returns:

    """
    host = "http://localhost"
    port = 11434
    client = httpx.Client(base_url=f"{host}:{port}")
    json_info = {
        "model": "llama3:70b",
        "messages": [
            {
                "role": "user",
                "content": "Help me make a travel guide to Yosemite National Park in 200 words?",
            }
        ],
        "stream": False,
    }
    response = client.post("/api/chat", json=json_info)
    print(response.json())

FYI
Hardware:

Hardware Overview:

Model Name: Mac Studio
Model Identifier: Mac14,14
Model Number: Z1800001HCH/A
Chip: Apple M2 Ultra
Total Number of Cores: 24 (16 performance and 8 efficiency)
Memory: 192 GB
System Firmware Version: 8422.141.2
OS Loader Version: 8422.141.2
Serial Number (system): G2P4927DXG
Hardware UUID: D9B7B9C2-D50B-5EC6-A5BE-D6403AF1ADA2
Provisioning UDID: 00006022-000451D90E90A01E
Activation Lock Status: Enabled

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.1.32

Originally created by @JIAQIA on GitHub (Apr 28, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3995 ### What is the issue? When the stream parameter is set to True, the Llama3:70b model functions correctly. Here is the pytest code used for testing streaming output: ```pytest def test_ollama3_70b_original_stream_completion() -> None: """ 测试本地Ollama接口-流式输出 Returns: """ host = "http://localhost" port = 11434 client = httpx.Client(base_url=f"{host}:{port}") json_info = { "model": "llama3:70b", "messages": [ { "role": "user", "content": "Help me make a travel guide to Yosemite National Park in 200 words?", } ], "stream": True, } with client.stream("POST", "/api/chat", json=json_info) as response: # noqa for line in response.iter_lines(): print(line) print("-" * 10) ``` For the standard Llama3(Not 70b!) model with stream set to False, the API works as expected: ```pytest def test_ollama3_chat_completion() -> None: """ 测试本地Ollama接口-普通输出 Returns: """ host = "http://localhost" port = 11434 client = httpx.Client(base_url=f"{host}:{port}") json_info = { "model": "llama3", "messages": [ { "role": "user", "content": "Help me make a travel guide to Yosemite National Park in 200 words?", } ], "stream": False, } response = client.post("/api/chat", json=json_info) print(response.json()) ``` However, when using the Llama3:70b model with stream set to False, the request times out, resulting in an httpx.ReadTimeout exception: ```pytest def test_ollama3_70b_chat_completion() -> None: """ 测试本地Ollama llama3:70b接口-普通输出 Returns: """ host = "http://localhost" port = 11434 client = httpx.Client(base_url=f"{host}:{port}") json_info = { "model": "llama3:70b", "messages": [ { "role": "user", "content": "Help me make a travel guide to Yosemite National Park in 200 words?", } ], "stream": False, } response = client.post("/api/chat", json=json_info) print(response.json()) ``` FYI Hardware: Hardware Overview: Model Name: Mac Studio Model Identifier: Mac14,14 Model Number: Z1800001HCH/A Chip: Apple M2 Ultra Total Number of Cores: 24 (16 performance and 8 efficiency) Memory: 192 GB System Firmware Version: 8422.141.2 OS Loader Version: 8422.141.2 Serial Number (system): G2P4927DXG Hardware UUID: D9B7B9C2-D50B-5EC6-A5BE-D6403AF1ADA2 Provisioning UDID: 00006022-000451D90E90A01E Activation Lock Status: Enabled ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version 0.1.32
GiteaMirror added the bug label 2026-05-03 17:55:41 -05:00
Author
Owner

@BruceMacD commented on GitHub (May 1, 2024):

How long does it take to generate a response in the streaming case (from start to finish)? It looks possible to me that the request is timing out due to the amount of time it takes to load the model, generate the response, and close the connection. If possible try increasing the timeout on your httpx client could help testing this.

<!-- gh-comment-id:2089072884 --> @BruceMacD commented on GitHub (May 1, 2024): How long does it take to generate a response in the streaming case (from start to finish)? It looks possible to me that the request is timing out due to the amount of time it takes to load the model, generate the response, and close the connection. If possible try increasing the timeout on your httpx client could help testing this.
Author
Owner

@JIAQIA commented on GitHub (May 6, 2024):

@BruceMacD Very quickly, about 5s in streaming mode.

<!-- gh-comment-id:2095071129 --> @JIAQIA commented on GitHub (May 6, 2024): @BruceMacD Very quickly, about 5s in streaming mode.
Author
Owner

@jmorganca commented on GitHub (May 9, 2024):

Make sure to set a large timeout as it might take time to generate the full response, e.g. 30 mins (might be too long):

client = httpx.Client(timeout=1800.0)
<!-- gh-comment-id:2103470813 --> @jmorganca commented on GitHub (May 9, 2024): Make sure to set a large timeout as it might take time to generate the full response, e.g. 30 mins (might be too long): ``` client = httpx.Client(timeout=1800.0) ```
Author
Owner

@JIAQIA commented on GitHub (May 10, 2024):

I have set timeout to 1800, and it failed again.

<!-- gh-comment-id:2104151426 --> @JIAQIA commented on GitHub (May 10, 2024): I have set timeout to 1800, and it failed again.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#64510