[GH-ISSUE #5602] Running latest version 0.2.1 running slowly and not returning output for long text input #29262

Closed
opened 2026-04-22 07:58:44 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @jillvillany on GitHub (Jul 10, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5602

Originally assigned to: @jmorganca on GitHub.

What is the issue?

I am running ollama on an AWS ml.p3.2xlarge SageMaker notebook instance.

When I install the latest version, 0.2.1, the response time on a langchain chain running an extract names prompt on a page of text using llama3:latest is about 8 seconds and doesn't return any names.

However, when I install version 0.1.37, the response time goes down to under a second and I get an accurate response with people's names found in the text.

OS

Linux

GPU

Nvidia

CPU

No response

Ollama version

0.2.1

Originally created by @jillvillany on GitHub (Jul 10, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5602 Originally assigned to: @jmorganca on GitHub. ### What is the issue? I am running ollama on an AWS ml.p3.2xlarge SageMaker notebook instance. When I install the latest version, 0.2.1, the response time on a langchain chain running an extract names prompt on a page of text using llama3:latest is about 8 seconds and doesn't return any names. However, when I install version 0.1.37, the response time goes down to under a second and I get an accurate response with people's names found in the text. ### OS Linux ### GPU Nvidia ### CPU _No response_ ### Ollama version 0.2.1
GiteaMirror added the performancebug labels 2026-04-22 07:58:44 -05:00
Author
Owner

@jmorganca commented on GitHub (Jul 10, 2024):

Hi there sorry you’re hitting this. I’m working on a fix, just to confirm - this instance type has V100 GPUs right?

<!-- gh-comment-id:2220684700 --> @jmorganca commented on GitHub (Jul 10, 2024): Hi there sorry you’re hitting this. I’m working on a fix, just to confirm - this instance type has V100 GPUs right?
Author
Owner

@Demirrr commented on GitHub (Jul 11, 2024):

I have a similar issue with NVIDIA H100 GPUs

Update: I do not observe the slow inference problem with v0.2.8. Thank you for the release

<!-- gh-comment-id:2222833996 --> @Demirrr commented on GitHub (Jul 11, 2024): I have a similar issue with NVIDIA H100 GPUs Update: I do not observe the slow inference problem with v0.2.8. Thank you for the release
Author
Owner

@rhelenagh commented on GitHub (Jul 14, 2024):

I have the same situation, with a Windows 11 operating system.

<!-- gh-comment-id:2227395080 --> @rhelenagh commented on GitHub (Jul 14, 2024): I have the same situation, with a Windows 11 operating system.
Author
Owner

@jmorganca commented on GitHub (Jun 19, 2025):

This should be fixed now – there was an issue last year with V100 GPUs that should be resolved. If not feel free to let me know and I can reopen!

<!-- gh-comment-id:2988694111 --> @jmorganca commented on GitHub (Jun 19, 2025): This should be fixed now – there was an issue last year with V100 GPUs that should be resolved. If not feel free to let me know and I can reopen!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#29262