[GH-ISSUE #7827] I hope that ollama can optimize the parallel performance of CPU computations? #5009

Closed
opened 2026-04-12 16:04:51 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @CarsonJiang on GitHub (Nov 25, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7827

When serving as the LLM provider for GraphRAG, the 100% single CPU usage is preventing full utilization of multiple GPU resources on the server.

ollama_graphrag_low_util

settings.yaml

### This config file contains required core defaults that must be set, along with a handful of common optional settings.
### For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/

### LLM settings ###
## There are a number of settings to tune the threading and token limits for LLM calls - check the docs.

encoding_model: cl100k_base # this needs to be matched to your model!

llm:
  api_key: ollama # set this in the generated .env file
  type: openai_chat # or azure_openai_chat
  model: llama3.2
  model_supports_json: true # recommended if this is available for your model.
  # audience: "https://cognitiveservices.azure.com/.default"
  api_base: http://localhost:11434/v1
  request_timeout: 1800.0
  max_tokens: 8000

parallelization:
  stagger: 0.3
  # num_threads: 50

async_mode: threaded # or asyncio

embeddings:
  async_mode: threaded # or asyncio
  vector_store:
    type: lancedb
    db_uri: 'output/lancedb'
    container_name: default
    overwrite: true
  llm:
    api_key: ollama
    type: openai_embedding # or azure_openai_embedding
    model: nomic-embed-text:latest
    api_base: http://localhost:11434/v1
    request_timeout: 1800.0
Originally created by @CarsonJiang on GitHub (Nov 25, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7827 When serving as the LLM provider for GraphRAG, the 100% single CPU usage is preventing full utilization of multiple GPU resources on the server. ![ollama_graphrag_low_util](https://github.com/user-attachments/assets/2bbb1b65-e63a-480c-ad5e-ee17fc4723a1) settings.yaml ``` ### This config file contains required core defaults that must be set, along with a handful of common optional settings. ### For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/ ### LLM settings ### ## There are a number of settings to tune the threading and token limits for LLM calls - check the docs. encoding_model: cl100k_base # this needs to be matched to your model! llm: api_key: ollama # set this in the generated .env file type: openai_chat # or azure_openai_chat model: llama3.2 model_supports_json: true # recommended if this is available for your model. # audience: "https://cognitiveservices.azure.com/.default" api_base: http://localhost:11434/v1 request_timeout: 1800.0 max_tokens: 8000 parallelization: stagger: 0.3 # num_threads: 50 async_mode: threaded # or asyncio embeddings: async_mode: threaded # or asyncio vector_store: type: lancedb db_uri: 'output/lancedb' container_name: default overwrite: true llm: api_key: ollama type: openai_embedding # or azure_openai_embedding model: nomic-embed-text:latest api_base: http://localhost:11434/v1 request_timeout: 1800.0 ```
GiteaMirror added the feature request label 2026-04-12 16:04:51 -05:00
Author
Owner

@rick-github commented on GitHub (Nov 25, 2024):

https://github.com/ollama/ollama/issues/6913

Multi-threading will not improve performance. The reason the CPU is at 100% is because CPU/GPU synchronization is a spin-wait. The two processors share a bit of memory and use it to tell each other what the current state is. So the CPU is busy checking "are you ready yet" a million times per second, and once in a while the GPU will say "yes, I am finished", the CPU will spend a few cycles sending a new command to the GPU, and then goes back to spinning on "are you ready yet".

From your screenshot, the model you are using is llama3.2:3b-instruct-q4_K_M, and since all of the processes listed have the same PID, you are probably using OLLAMA_SCHED_SPREAD to force the model to run on all GPUs. This is not an efficient use of your GPUs, see here for more discussion. The reason your GPU utilization is poor is because each GPU is waiting for the intermediate results from a different GPU. With a small model like llama3.2:3b, you will get better performance if you run multiple servers as outlined in the linked discussion.

<!-- gh-comment-id:2498013776 --> @rick-github commented on GitHub (Nov 25, 2024): https://github.com/ollama/ollama/issues/6913 Multi-threading will not improve performance. The reason the CPU is at 100% is because CPU/GPU synchronization is a spin-wait. The two processors share a bit of memory and use it to tell each other what the current state is. So the CPU is busy checking "are you ready yet" a million times per second, and once in a while the GPU will say "yes, I am finished", the CPU will spend a few cycles sending a new command to the GPU, and then goes back to spinning on "are you ready yet". From your screenshot, the model you are using is llama3.2:3b-instruct-q4_K_M, and since all of the processes listed have the same PID, you are probably using `OLLAMA_SCHED_SPREAD` to force the model to run on all GPUs. This is not an efficient use of your GPUs, see [here](https://github.com/ollama/ollama/issues/7648) for more discussion. The reason your GPU utilization is poor is because each GPU is waiting for the intermediate results from a different GPU. With a small model like llama3.2:3b, you will get better performance if you run multiple servers as outlined in the linked discussion.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#5009