[GH-ISSUE #10137] Performance is terrible #6651

Closed
opened 2026-04-12 18:20:40 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @khteh on GitHub (Apr 5, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10137

Both on bare-metal and docker container running in local k8s, this is what I see from runing a pytest of a Langchain RAG application:

'response_metadata': {'model': 'llama3.3', 'created_at': '2025-04-05T07:27:29.787284732Z', 'done': True, 'done_reason': 'stop', 'total_duration': 368677640273, 'load_duration': 23849621, 'prompt_eval_count': 1229, 'prompt_eval_duration': 31958590162, 'eval_count': 242, 'eval_duration': 336691893036,

At the end of the pytest run:

=============================================================================== 2 passed in 2506.38s (0:41:46) ===============================================================================

Inside the pod:

ollamaroot@ollama-0:/# ollama ps
NAME               ID              SIZE     PROCESSOR         UNTIL              
llama3.3:latest    a6eb4748fd29    49 GB    93%/7% CPU/GPU    2 minutes from now  
root@ollama-0:/# nvidia-smi
Sat Apr  5 07:47:01 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A2000 Laptop GPU    Off |   00000000:01:00.0  On |                  N/A |
| N/A   62C    P0             19W /   60W |    2548MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

Config:

  OLLAMA_MODELS: "/models"
  OLLAMA_SCHED_SPREAD: "true"
  OLLAMA_CONTEXT_LENGTH: "8192"
  OLLAMA_HOST: "http://0.0.0.0:11434"
  OLLAMA_DEBUG: "true"
Originally created by @khteh on GitHub (Apr 5, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10137 Both on bare-metal and docker container running in local k8s, this is what I see from runing a `pytest` of a Langchain RAG application: ``` 'response_metadata': {'model': 'llama3.3', 'created_at': '2025-04-05T07:27:29.787284732Z', 'done': True, 'done_reason': 'stop', 'total_duration': 368677640273, 'load_duration': 23849621, 'prompt_eval_count': 1229, 'prompt_eval_duration': 31958590162, 'eval_count': 242, 'eval_duration': 336691893036, ``` At the end of the `pytest` run: ``` =============================================================================== 2 passed in 2506.38s (0:41:46) =============================================================================== ``` Inside the pod: ``` ollamaroot@ollama-0:/# ollama ps NAME ID SIZE PROCESSOR UNTIL llama3.3:latest a6eb4748fd29 49 GB 93%/7% CPU/GPU 2 minutes from now root@ollama-0:/# nvidia-smi Sat Apr 5 07:47:01 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA RTX A2000 Laptop GPU Off | 00000000:01:00.0 On | N/A | | N/A 62C P0 19W / 60W | 2548MiB / 4096MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| +-----------------------------------------------------------------------------------------+ ``` Config: ``` OLLAMA_MODELS: "/models" OLLAMA_SCHED_SPREAD: "true" OLLAMA_CONTEXT_LENGTH: "8192" OLLAMA_HOST: "http://0.0.0.0:11434" OLLAMA_DEBUG: "true" ```
Author
Owner

@ghmer commented on GitHub (Apr 5, 2025):

If I am not mistaken, you are running a 70b model on a Laptop GPU - what is your expectation?

<!-- gh-comment-id:2780576495 --> @ghmer commented on GitHub (Apr 5, 2025): If I am not mistaken, you are running a 70b model on a Laptop GPU - what is your expectation?
Author
Owner

@khteh commented on GitHub (Apr 5, 2025):

CPU(s):                   16
  On-line CPU(s) list:    0-15
Vendor ID:                GenuineIntel
  Model name:             11th Gen Intel(R) Core(TM) i9-11950H @ 2.60GHz

$ free -h
               total        used        free      shared  buff/cache   available
Mem:            68Gi        24Gi       1.4Gi       5.0Gi        47Gi        43Gi
Swap:          2.0Gi       2.0Gi       1.1Mi

What's the min spec for an acceptable performance?

<!-- gh-comment-id:2780581496 --> @khteh commented on GitHub (Apr 5, 2025): ``` CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: GenuineIntel Model name: 11th Gen Intel(R) Core(TM) i9-11950H @ 2.60GHz $ free -h total used free shared buff/cache available Mem: 68Gi 24Gi 1.4Gi 5.0Gi 47Gi 43Gi Swap: 2.0Gi 2.0Gi 1.1Mi ``` What's the min spec for an acceptable performance?
Author
Owner

@ghmer commented on GitHub (Apr 5, 2025):

Depends on your definition of "acceptable performance", I'd say. If your CPU is doing the main part (which it does in your case), and you are not fine with the performance, there is little you can do, besides getting a beefy graphics card with a lot of VRAM.

Then, check out something like this: LLM RAM calculator
Put in your VRAM and check how big of a model your setup can support.

Alternatively, you can try a much, much smaller model. Results won't be great, but at least you might get some performance out of it while you are waiting for a beefy graphics card :-)

<!-- gh-comment-id:2780592927 --> @ghmer commented on GitHub (Apr 5, 2025): Depends on your definition of "acceptable performance", I'd say. If your CPU is doing the main part (which it does in your case), and you are not fine with the performance, there is little you can do, besides getting a beefy graphics card with a lot of VRAM. Then, check out something like this: [LLM RAM calculator](https://llm-calc.rayfernando.ai/) Put in your VRAM and check how big of a model your setup can support. Alternatively, you can try a much, much smaller model. Results won't be great, but at least you might get some performance out of it while you are waiting for a beefy graphics card :-)
Author
Owner

@khteh commented on GitHub (Apr 5, 2025):

How do I find out how much VRAM I have? The URL you shared seems to refer to physical DDR of the system?

<!-- gh-comment-id:2780596028 --> @khteh commented on GitHub (Apr 5, 2025): How do I find out how much VRAM I have? The URL you shared seems to refer to physical DDR of the system?
Author
Owner

@ghmer commented on GitHub (Apr 5, 2025):

2548MiB / 4096MiB

If I understand the output of your nvidia-smi command correct, you got 4GB VRAM in total, and 2,6GB are already taken (I guess by some apps on the host OS?!). So, not so much.

<!-- gh-comment-id:2780599629 --> @ghmer commented on GitHub (Apr 5, 2025): > 2548MiB / 4096MiB If I understand the output of your nvidia-smi command correct, you got 4GB VRAM in total, and 2,6GB are already taken (I guess by some apps on the host OS?!). So, not so much.
Author
Owner

@khteh commented on GitHub (Apr 5, 2025):

$

<!-- gh-comment-id:2780601190 --> @khteh commented on GitHub (Apr 5, 2025): $$$$$
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#6651