[GH-ISSUE #4277] Unexpected Increase in Inference Time as Context Window Grows on Llama3:7b #2671

Closed
opened 2026-04-12 13:00:21 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @gusanmaz on GitHub (May 9, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4277

What is the issue?

I am doing some benchmarks on RAG using llama3:7b model on Ollama.

I ask a question first directly to the model, then ask the question and provide context from relevant documents, asking the model to answer the question based on the given context; essentially, asking the question in the RAG way without exceeding the model's context window. As expected, the first query is a sentence long, and the second query is many sentences long.

I asked 14 questions (14 queries for direct questions and 14 queries for RAG questions in total for each machine) and the benchmark results can be seen below:

Machine Type CPU RAM (GB) Graphics Card OS Direct Question - Short Context (ms) RAG Question - Long Context (ms)
Mac Mini Apple Silicon M2 Pro 16 macOS 14.2.1 61152 105998
Laptop AMD Ryzen 9 5900HX (16) @ 4.680GHz 32 NVIDIA GeForce RTX 3050 Mobile, AMD ATI Cezanne Pop!_OS 22.04 LTS 413264 1052304
Desktop 11th Gen Intel i5-11400F (12) @ 4.400GHz 64 NVIDIA GeForce RTX 3060 Lite Hash Rate Pop!_OS 22.04 LTS 114599 152341

I use Ollama version 0.1.34. As far as I know, inference time doesn't change significantly as query context grows for LLMs. I am particularly surprised to see more than a 2.5x increase in inference time on my laptop machine. I haven't performed this benchmark on a different model than Llama3.

I wonder if something is wrong with Ollama or if these benchmark results I am getting are normal.

Thanks!

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.1.34

Originally created by @gusanmaz on GitHub (May 9, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4277 ### What is the issue? I am doing some benchmarks on RAG using llama3:7b model on Ollama. I ask a question first directly to the model, then ask the question and provide context from relevant documents, asking the model to answer the question based on the given context; essentially, asking the question in the RAG way without exceeding the model's context window. As expected, the first query is a sentence long, and the second query is many sentences long. I asked 14 questions (14 queries for direct questions and 14 queries for RAG questions in total for each machine) and the benchmark results can be seen below: | Machine Type | CPU | RAM (GB) | Graphics Card | OS | Direct Question - Short Context (ms) | RAG Question - Long Context (ms) | |----------------|----------------------------------------|----------|--------------------------------------------------|------------------|------------|----------| | Mac Mini | Apple Silicon M2 Pro | 16 | | macOS 14.2.1 | 61152 | 105998 | | Laptop | AMD Ryzen 9 5900HX (16) @ 4.680GHz | 32 | NVIDIA GeForce RTX 3050 Mobile, AMD ATI Cezanne | Pop!_OS 22.04 LTS| 413264 | 1052304 | | Desktop | 11th Gen Intel i5-11400F (12) @ 4.400GHz| 64 | NVIDIA GeForce RTX 3060 Lite Hash Rate | Pop!_OS 22.04 LTS| 114599 | 152341 | I use Ollama version 0.1.34. As far as I know, inference time doesn't change significantly as query context grows for LLMs. I am particularly surprised to see more than a 2.5x increase in inference time on my laptop machine. I haven't performed this benchmark on a different model than Llama3. I wonder if something is wrong with Ollama or if these benchmark results I am getting are normal. Thanks! ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.1.34
GiteaMirror added the bugperformance labels 2026-04-12 13:00:21 -05:00
Author
Owner

@igorschlum commented on GitHub (May 12, 2024):

Hi @gusanmaz Could you provide a link with a script to run? So I can make a benchmark on my Mac. If it's slow on windows, some improvements have been made to the Windows version 0.1.36 of Ollama. It could be interesting that you renew your tests with this latest version.

<!-- gh-comment-id:2106192441 --> @igorschlum commented on GitHub (May 12, 2024): Hi @gusanmaz Could you provide a link with a script to run? So I can make a benchmark on my Mac. If it's slow on windows, some improvements have been made to the Windows version 0.1.36 of Ollama. It could be interesting that you renew your tests with this latest version.
Author
Owner

@gusanmaz commented on GitHub (May 12, 2024):

Hi @igorschlum, thank you for your kind help.

The code I'm running is hosted on https://github.com/Jet-Engine/rag_art_deco .

First, you need to execute indexing.py to index files for RAG, and then chat.py for benchmarking. The benchmark results can be found in the answers.html file generated inside the evaluation folder.

You may remove gpt-4 and groq-llama3-70b from the line: selected_models = ["gpt-4", "ollama-llama3", "groq-llama3-70b"] in chat.py.

The README file, which also serves as a blog post, explains the code in detail, but I wanted to provide a summary of what needs to be done for benchmarking.

<!-- gh-comment-id:2106201843 --> @gusanmaz commented on GitHub (May 12, 2024): Hi @igorschlum, thank you for your kind help. The code I'm running is hosted on [https://github.com/Jet-Engine/rag_art_deco](https://github.com/Jet-Engine/rag_art_deco) . First, you need to execute `indexing.py` to index files for RAG, and then `chat.py` for benchmarking. The benchmark results can be found in the `answers.html` file generated inside the `evaluation` folder. You may remove `gpt-4` and `groq-llama3-70b` from the line: `selected_models = ["gpt-4", "ollama-llama3", "groq-llama3-70b"]` in `chat.py`. The README file, which also serves as a blog post, explains the code in detail, but I wanted to provide a summary of what needs to be done for benchmarking.
Author
Owner

@jessegross commented on GitHub (Oct 23, 2024):

I'm not entirely sure if I understand you question. Is it:

  • "Does it take longer to process more data passed in the prompt?" Yes, that is expected - there is prompt processing time.
  • "Should continued conversation take longer for the same length prompt but with more history?" This used to be an issue but should be less of a performance hit in more recent versions of Ollama.

Since both cases should have the expected behavior on current versions of Ollama, I'm going to close this but you can reopen if you are still seeing it.

<!-- gh-comment-id:2433467254 --> @jessegross commented on GitHub (Oct 23, 2024): I'm not entirely sure if I understand you question. Is it: - "Does it take longer to process more data passed in the prompt?" Yes, that is expected - there is prompt processing time. - "Should continued conversation take longer for the same length prompt but with more history?" This used to be an issue but should be less of a performance hit in more recent versions of Ollama. Since both cases should have the expected behavior on current versions of Ollama, I'm going to close this but you can reopen if you are still seeing it.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#2671