[GH-ISSUE #13285] Ollama's responses have about 4 sec pause between chunks of token sent #8780

Closed
opened 2026-04-12 21:32:45 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @nithin-bg on GitHub (Dec 1, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/13285

I am using a dot net application to receive tokens from Ollama.

API Request:
EndPoint: "http://localhost:11434/api/generate" - POST
Payload:
{
"model": "llama3.1:8b",
"prompt": "Explain French Revolution in detail",
"stream": true,
}
When I trigger this API request, In long responses I am getting about 4 seconds delay after receiving some chunks (about 100 - 150 tokens) of token and then after some chunks again about 4 seconds of delay... and so on.

Is this how Ollama gives response or any other reason?

Sample Code:

public async IAsyncEnumerable<string> ChatStream(StringContent requestContent, CancellationToken cancellationToken)
{
    var request = new HttpRequestMessage(HttpMethod.Post, _apiEndpoint)
    { Content = requestContent };
    using var response = await _httpClient.SendAsync(request, HttpCompletionOption.ResponseHeadersRead, cancellationToken);
    response.EnsureSuccessStatusCode();
    await using var responseStream = await response.Content.ReadAsStreamAsync(cancellationToken);
    using var reader = new StreamReader(responseStream);
    string token = null;
    while (!reader.EndOfStream)
    {
        var line = await reader.ReadLineAsync(cancellationToken);
        if (string.IsNullOrWhiteSpace(line))
            continue;
        if (!string.IsNullOrEmpty(line))
        {
            Console.WriteLine($"Token received at {DateTime.Now:yyyy-MM-dd HH:mm:ss.ffffffzzz}");
            yield return line;
        }
    }
}
Originally created by @nithin-bg on GitHub (Dec 1, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/13285 I am using a dot net application to receive tokens from Ollama. API Request: EndPoint: "http://localhost:11434/api/generate" - POST Payload: { "model": "llama3.1:8b", "prompt": "Explain French Revolution in detail", "stream": true, } When I trigger this API request, In long responses I am getting about 4 seconds delay after receiving some chunks (about 100 - 150 tokens) of token and then after some chunks again about 4 seconds of delay... and so on. **Is this how Ollama gives response or any other reason?** Sample Code: ``` public async IAsyncEnumerable<string> ChatStream(StringContent requestContent, CancellationToken cancellationToken) { var request = new HttpRequestMessage(HttpMethod.Post, _apiEndpoint) { Content = requestContent }; using var response = await _httpClient.SendAsync(request, HttpCompletionOption.ResponseHeadersRead, cancellationToken); response.EnsureSuccessStatusCode(); await using var responseStream = await response.Content.ReadAsStreamAsync(cancellationToken); using var reader = new StreamReader(responseStream); string token = null; while (!reader.EndOfStream) { var line = await reader.ReadLineAsync(cancellationToken); if (string.IsNullOrWhiteSpace(line)) continue; if (!string.IsNullOrEmpty(line)) { Console.WriteLine($"Token received at {DateTime.Now:yyyy-MM-dd HH:mm:ss.ffffffzzz}"); yield return line; } } } ```
Author
Owner

@rick-github commented on GitHub (Dec 1, 2025):

Does the same thing happen if you use the API directly?

curl http://localhost:11434/api/generate -d "{\"model\":\"llama3.1:8b\",\"prompt\":\"Explain French Revolution in detail\",\"stream\":true}"

What OS?
What GPU?
What CPU?
What version of ollama?

<!-- gh-comment-id:3597593615 --> @rick-github commented on GitHub (Dec 1, 2025): Does the same thing happen if you use the API directly? ``` curl http://localhost:11434/api/generate -d "{\"model\":\"llama3.1:8b\",\"prompt\":\"Explain French Revolution in detail\",\"stream\":true}" ``` What OS? What GPU? What CPU? What version of ollama?
Author
Owner

@nithin-bg commented on GitHub (Dec 2, 2025):

I'm not facing this issue in curl.

OS: Ubuntu 24.04.3 LTS
GPU: NVIDIA H100 NVL:- 96GB
CPU: 314 GB
ollama: 0.12.0

<!-- gh-comment-id:3600245293 --> @nithin-bg commented on GitHub (Dec 2, 2025): I'm not facing this issue in curl. OS: Ubuntu 24.04.3 LTS GPU: NVIDIA H100 NVL:- 96GB CPU: 314 GB ollama: 0.12.0
Author
Owner

@rick-github commented on GitHub (Dec 2, 2025):

Seems like an app issue. Is your reader buffering input? What happens if you don't use async calls?

<!-- gh-comment-id:3601362818 --> @rick-github commented on GitHub (Dec 2, 2025): Seems like an app issue. Is your reader buffering input? What happens if you don't use async calls?
Author
Owner

@rick-github commented on GitHub (Dec 3, 2025):

Resolved by upgrading HTTPClient version from 1.1 to 2.0.

<!-- gh-comment-id:3609072521 --> @rick-github commented on GitHub (Dec 3, 2025): Resolved by [upgrading](https://discord.com/channels/1128867683291627614/1444736133039652965/1445622534015025253) HTTPClient version from 1.1 to 2.0.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#8780