[GH-ISSUE #10622] Redundant and broken responses when using OLLAMA_SCHED_SPREAD #6988

Closed
opened 2026-04-12 18:52:59 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @sempervictus on GitHub (May 8, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10622

What is the issue?

Using OLLAMA_SCHED_SPREAD to run workloads concurrently clearly shows parallelism across 4 v100 SXM2 cards but they all appear to be doing their own thing and jumbling the responses together into some dr Moreau concoction.

What's the right way to invoke ollama to utilize multiple GPUs without redoing the same work and producing fractured responses?

Relevant log output

version: '3.9'
services:
  open-webui:
    image: your_open_webui_image
    ports:
      - "8080:80"
    volumes:
      - ./data:/data
    depends_on:
      - postgres

  metrics:
    image: your_metrics_image
    ports:
      - "8080:80"
    - image: "3.9"
    open-webui:
      image: your_webui_image
    ports:
    - "808080:80"
    depends_on:
    - ./data:/data
      postgres:
        - ENABLE_RAG_WEB_SEARCH=true
        - WEB_SEARCH_ENGINE=jina
        - JINA_API_KEY=YOUR_API_KEY
        - JINA_API_KEY=YOUR_JINA_API_KEY
        - RAG_WEB_SEARCH_COUNT=3
        - WEB_SEARCH_CONCURRENT_REQUESTS=2

  postgress:
    image: postgres
    environment:
      - POSTGRES_USER=postgres
      - POSTGRES_PASSWORD=your_postgres_password
    environment:
      - ENABLE_RAG_WEB_SEARCH=true
2. Get your API key from DuckGo
3. Open WebUI Admin panel.
4. Navigate to `Settings` tab, then click `Web Search`.
5. Enable `Web Search` and set to `bing`.
6. Click `Save` button.

![Enable Web Search](images/enable_web_search.png)


version: '3.9'

services:
  open-webui:
    image: your_webui_image
    ports:
      - "8080:80
    depends_on:
      - postgres

  postgres:

overalpping outputs example

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.6.7

Originally created by @sempervictus on GitHub (May 8, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10622 ### What is the issue? Using OLLAMA_SCHED_SPREAD to run workloads concurrently clearly shows parallelism across 4 v100 SXM2 cards but they all appear to be doing their own thing and jumbling the responses together into some dr Moreau concoction. What's the right way to invoke ollama to utilize multiple GPUs without redoing the same work and producing fractured responses? ### Relevant log output ```shell version: '3.9' services: open-webui: image: your_open_webui_image ports: - "8080:80" volumes: - ./data:/data depends_on: - postgres metrics: image: your_metrics_image ports: - "8080:80" - image: "3.9" open-webui: image: your_webui_image ports: - "808080:80" depends_on: - ./data:/data postgres: - ENABLE_RAG_WEB_SEARCH=true - WEB_SEARCH_ENGINE=jina - JINA_API_KEY=YOUR_API_KEY - JINA_API_KEY=YOUR_JINA_API_KEY - RAG_WEB_SEARCH_COUNT=3 - WEB_SEARCH_CONCURRENT_REQUESTS=2 postgress: image: postgres environment: - POSTGRES_USER=postgres - POSTGRES_PASSWORD=your_postgres_password environment: - ENABLE_RAG_WEB_SEARCH=true 2. Get your API key from DuckGo 3. Open WebUI Admin panel. 4. Navigate to `Settings` tab, then click `Web Search`. 5. Enable `Web Search` and set to `bing`. 6. Click `Save` button. ![Enable Web Search](images/enable_web_search.png) version: '3.9' services: open-webui: image: your_webui_image ports: - "8080:80 depends_on: - postgres postgres: overalpping outputs example ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.6.7
GiteaMirror added the bugneeds more info labels 2026-04-12 18:52:59 -05:00
Author
Owner

@rick-github commented on GitHub (May 8, 2025):

Setting OLLAMA_SCHED_SPREAD is usually not required because ollama will schedule layers across GPUs if required. Nonetheless, it shouldn't impact generation. Server logs may aid in debugging.

<!-- gh-comment-id:2864657871 --> @rick-github commented on GitHub (May 8, 2025): Setting `OLLAMA_SCHED_SPREAD` is usually not required because ollama will schedule layers across GPUs if required. Nonetheless, it shouldn't impact generation. [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging.
Author
Owner

@galoisgroupcn commented on GitHub (May 10, 2025):

The behavior you're seeing is expected:
OLLAMA_SCHED_SPREAD spreads separate requests across GPUs for throughput, but does not parallelize a single prompt/completion across multiple GPUs.

How to use multiple GPUs with Ollama:

  1. Send multiple independent prompts—Ollama will assign each to a different GPU.
  2. Do not send the same prompt to multiple endpoints or in parallel; this causes duplicated and jumbled outputs.
  3. Each completion runs on a single GPU. Ollama does not split one model or prompt across GPUs.

If you want to parallelize a single large inference job (true model parallelism):

Ollama does not support this. Consider other frameworks like DeepSpeed, Hugging Face Accelerate, or vLLM if you need this capability.

Bottom line:

OLLAMA_SCHED_SPREAD = best for serving many concurrent, independent requests.
For a single prompt, only one GPU will be used.

Check your orchestrator/UI:

Make sure it's not dispatching the same prompt to multiple backends at once.

<!-- gh-comment-id:2868618591 --> @galoisgroupcn commented on GitHub (May 10, 2025): The behavior you're seeing is expected: OLLAMA_SCHED_SPREAD spreads separate requests across GPUs for throughput, but does not parallelize a single prompt/completion across multiple GPUs. How to use multiple GPUs with Ollama: 1. Send multiple independent prompts—Ollama will assign each to a different GPU. 2. Do not send the same prompt to multiple endpoints or in parallel; this causes duplicated and jumbled outputs. 3. Each completion runs on a single GPU. Ollama does not split one model or prompt across GPUs. If you want to parallelize a single large inference job (true model parallelism): Ollama does not support this. Consider other frameworks like DeepSpeed, Hugging Face Accelerate, or vLLM if you need this capability. Bottom line: OLLAMA_SCHED_SPREAD = best for serving many concurrent, independent requests. For a single prompt, only one GPU will be used. Check your orchestrator/UI: Make sure it's not dispatching the same prompt to multiple backends at once.
Author
Owner

@rick-github commented on GitHub (May 10, 2025):

@galoisgroupcn This is mostly wrong. Please do not spam threads with LLM output, particularly if it's not correct.

<!-- gh-comment-id:2868628689 --> @rick-github commented on GitHub (May 10, 2025): @galoisgroupcn This is mostly wrong. Please do not spam threads with LLM output, particularly if it's not correct.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#6988