[GH-ISSUE #5024] Multiple GPU HI00 #65218

Closed
opened 2026-05-03 20:03:00 -05:00 by GiteaMirror · 19 comments
Owner

Originally created by @sksdev27 on GitHub (Jun 13, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5024

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

I have NVIDA H100 multiple of them with NVLINK but ollama seems to only use 1 nvidia gpu. I tried various deployments but here is current one:

nvidia-smi

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA H100 PCIe Off | 00000000:17:00.0 Off | 0 |
| N/A 35C P0 49W / 310W | 7MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA H100 PCIe Off | 00000000:65:00.0 Off | 0 |
| N/A 33C P0 47W / 310W | 7MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA H100 PCIe Off | 00000000:CA:00.0 Off | 0 |
| N/A 32C P0 47W / 310W | 7MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA H100 PCIe Off | 00000000:E3:00.0 Off | 0 |
| N/A 33C P0 49W / 310W | 7MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+

Dockerfile

NVIDIA CUDA 12.2

FROM nvcr.io/nvidia/ai-workbench/python-cuda122:1.0.3

Set up environment variables

ENV LANG=C.UTF-8 LC_ALL=C.UTF-8
ENV CUDA_VISIBLE_DEVICES=0,1,2,3
ENV OLLAMA_CONFIG_PATH=/opt/ollama/ollama.yaml

Install dependencies

RUN apt-get update && apt-get install -y wget && rm -rf /var/lib/apt/lists/*

Install Ollama

RUN wget https://ollama.com/install.sh -O - | bash

Copy the configuration file to the expected location

COPY ollama.yaml /opt/ollama/ollama.yaml

Set working directory

WORKDIR /opt/ollama

Expose port for Ollama

EXPOSE 5000

Default command to start Ollama

CMD ["ollama", "start"]

version: '3.8'

services:
ollama:
image: ollama-cuda122
build: .
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
environment:
- CUDA_VISIBLE_DEVICES=0,1,2,3
ports:
- "5000:5000"
volumes:
- ./models:/opt/ollama/models # Mount the models directory
restart: unless-stopped

OS

Linux

GPU

Nvidia

CPU

Other

Ollama version

0.1.43

Originally created by @sksdev27 on GitHub (Jun 13, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5024 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? I have NVIDA H100 multiple of them with NVLINK but ollama seems to only use 1 nvidia gpu. I tried various deployments but here is current one: nvidia-smi +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA H100 PCIe Off | 00000000:17:00.0 Off | 0 | | N/A 35C P0 49W / 310W | 7MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA H100 PCIe Off | 00000000:65:00.0 Off | 0 | | N/A 33C P0 47W / 310W | 7MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA H100 PCIe Off | 00000000:CA:00.0 Off | 0 | | N/A 32C P0 47W / 310W | 7MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA H100 PCIe Off | 00000000:E3:00.0 Off | 0 | | N/A 33C P0 49W / 310W | 7MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+ Dockerfile # NVIDIA CUDA 12.2 FROM nvcr.io/nvidia/ai-workbench/python-cuda122:1.0.3 # Set up environment variables ENV LANG=C.UTF-8 LC_ALL=C.UTF-8 ENV CUDA_VISIBLE_DEVICES=0,1,2,3 ENV OLLAMA_CONFIG_PATH=/opt/ollama/ollama.yaml # Install dependencies RUN apt-get update && apt-get install -y wget && rm -rf /var/lib/apt/lists/* # Install Ollama RUN wget https://ollama.com/install.sh -O - | bash # Copy the configuration file to the expected location COPY ollama.yaml /opt/ollama/ollama.yaml # Set working directory WORKDIR /opt/ollama # Expose port for Ollama EXPOSE 5000 # Default command to start Ollama CMD ["ollama", "start"] version: '3.8' services: ollama: image: ollama-cuda122 build: . deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] environment: - CUDA_VISIBLE_DEVICES=0,1,2,3 ports: - "5000:5000" volumes: - ./models:/opt/ollama/models # Mount the models directory restart: unless-stopped ### OS Linux ### GPU Nvidia ### CPU Other ### Ollama version 0.1.43
GiteaMirror added the bug label 2026-05-03 20:03:00 -05:00
Author
Owner

@dhiltgen commented on GitHub (Jun 18, 2024):

Can you share your server log?

My suspicion is we do see all the GPUs, but you are loading a model that fits in 1 GPUs VRAM and we're only loading it on one. If you attempt to load a large model, it will spread, or you can set OLLAMA_SCHED_SPREAD to force it to spread over multiple GPUs on newer versions.

<!-- gh-comment-id:2176835257 --> @dhiltgen commented on GitHub (Jun 18, 2024): Can you share your server log? My suspicion is we do see all the GPUs, but you are loading a model that fits in 1 GPUs VRAM and we're only loading it on one. If you attempt to load a large model, it will spread, or you can set OLLAMA_SCHED_SPREAD to force it to spread over multiple GPUs on newer versions.
Author
Owner

@sksdev27 commented on GitHub (Jun 19, 2024):

We this time when i tried the 70b it couldn't load in 1 gpu so it failed:
here are the docker logs:

docker_logs_ollama.log

I also tried setting the OLLAMA_SCHED_SPREAD: docker run --gpus all -p 11434:11434 -e OLLAMA_SCHED_SPREAD=1 -it --rm ollama/ollama:latest
here is its logs:
OLLAMA_SCHED_SPREAD.log

<!-- gh-comment-id:2177525963 --> @sksdev27 commented on GitHub (Jun 19, 2024): We this time when i tried the 70b it couldn't load in 1 gpu so it failed: here are the docker logs: [docker_logs_ollama.log](https://github.com/user-attachments/files/15895114/docker_logs_ollama.log) I also tried setting the OLLAMA_SCHED_SPREAD: docker run --gpus all -p 11434:11434 -e OLLAMA_SCHED_SPREAD=1 -it --rm ollama/ollama:latest here is its logs: [OLLAMA_SCHED_SPREAD.log](https://github.com/user-attachments/files/15895227/OLLAMA_SCHED_SPREAD.log)
Author
Owner

@dhiltgen commented on GitHub (Jun 19, 2024):

From the looks of the first log, your client gave up after ~2 minutes and we aborted the load as a result of that.

time=2024-06-19T03:33:26.815Z level=WARN source=server.go:536 msg="client connection closed before server finished loading, aborting load"

You may see better load performance by disabling mmap

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Why is the sky blue?",
  "stream": false, "options": {"num_gpu": 21 }
}'

I forgot that OLLAMA_SCHED_SPREAD is new in 0.1.45 which explains why 0.1.44 didn't respect it.

<!-- gh-comment-id:2179115659 --> @dhiltgen commented on GitHub (Jun 19, 2024): From the looks of the first log, your client gave up after ~2 minutes and we aborted the load as a result of that. ``` time=2024-06-19T03:33:26.815Z level=WARN source=server.go:536 msg="client connection closed before server finished loading, aborting load" ``` You may see better load performance by disabling mmap ``` curl http://localhost:11434/api/generate -d '{ "model": "llama3", "prompt": "Why is the sky blue?", "stream": false, "options": {"num_gpu": 21 } }' ``` I forgot that OLLAMA_SCHED_SPREAD is new in 0.1.45 which explains why 0.1.44 didn't respect it.
Author
Owner

@sksdev27 commented on GitHub (Jun 20, 2024):

ollama_45_rc3.log
ollama_45_rc2.log
ollama_45_rc4_rom.log

Not sure what crashed NVIDIA GPU. but after running the ollama GPU crashes.

I wanted to load this we openwebui not sure disabling mmap would be posible with openwebui

<!-- gh-comment-id:2181379970 --> @sksdev27 commented on GitHub (Jun 20, 2024): [ollama_45_rc3.log](https://github.com/user-attachments/files/15919215/ollama_45_rc3.log) [ollama_45_rc2.log](https://github.com/user-attachments/files/15919216/ollama_45_rc2.log) [ollama_45_rc4_rom.log](https://github.com/user-attachments/files/15919217/ollama_45_rc4_rom.log) Not sure what crashed NVIDIA GPU. but after running the ollama GPU crashes. I wanted to load this we openwebui not sure disabling mmap would be posible with openwebui
Author
Owner

@dhiltgen commented on GitHub (Jun 20, 2024):

In the next release (0.1.46) we'll have automatic mmap logic so if the model is larger than the free memory on the system, we'll revert to regular file reads instead of mmap. From your logs though, it looks like this system has a lot of memory, so we'd still default to mmap for the model you're trying to load.

You didn't mention what model you're trying to load, however I see the load timed out before the cuda error happened, so it's possible this was a race of trying to shutdown while it was still loading. I'd suggest trying to load this model with mmap disabled using curl (see above) and see if that at least gets it to load, or if there's still some other bug lurking in here.

If switching to regular file reads solves the problem, then I may be able to adjust the algorithm to set some upper threshold where we disable mmap for extremely large models, but I don't want to do that until we can confirm it actually solves the problem.

<!-- gh-comment-id:2181688635 --> @dhiltgen commented on GitHub (Jun 20, 2024): In the next release (0.1.46) we'll have [automatic mmap logic](https://github.com/ollama/ollama/pull/5194) so if the model is larger than the free memory on the system, we'll revert to regular file reads instead of mmap. From your logs though, it looks like this system has a lot of memory, so we'd still default to mmap for the model you're trying to load. You didn't mention what model you're trying to load, however I see the load timed out before the cuda error happened, so it's possible this was a race of trying to shutdown while it was still loading. I'd suggest trying to load this model with mmap disabled using curl (see above) and see if that at least gets it to load, or if there's still some other bug lurking in here. If switching to regular file reads solves the problem, then I may be able to adjust the algorithm to set some upper threshold where we disable mmap for extremely large models, but I don't want to do that until we can confirm it actually solves the problem.
Author
Owner

@sksdev27 commented on GitHub (Jun 21, 2024):

tested out 0.1.45.rc4 with the curl command Here are the logs:

ollama_45_rc4_mmap_dis.log

NVIDIA SMI during exit:

Fri Jun 21 09:24:16 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA H100 PCIe Off | 00000000:17:00.0 Off | 0 |
| N/A 47C P0 86W / 310W | 4469MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA H100 PCIe Off | 00000000:65:00.0 Off | 0 |
| N/A 48C P0 82W / 310W | 3151MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA H100 PCIe Off | 00000000:CA:00.0 Off | 0 |
| N/A 45C P0 80W / 310W | 3151MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA H100 PCIe Off | 00000000:E3:00.0 Off | 0 |
| N/A 45C P0 84W / 310W | 3151MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 44047 C ...unners/cuda_v11/ollama_llama_server 4456MiB |
| 1 N/A N/A 44047 C ...unners/cuda_v11/ollama_llama_server 3138MiB |
| 2 N/A N/A 44047 C ...unners/cuda_v11/ollama_llama_server 3138MiB |
| 3 N/A N/A 44047 C ...unners/cuda_v11/ollama_llama_server 3138MiB |
+---------------------------------------------------------------------------------------+

<!-- gh-comment-id:2182974547 --> @sksdev27 commented on GitHub (Jun 21, 2024): tested out 0.1.45.rc4 with the curl command Here are the logs: [ollama_45_rc4_mmap_dis.log](https://github.com/user-attachments/files/15929587/ollama_45_rc4_mmap_dis.log) NVIDIA SMI during exit: Fri Jun 21 09:24:16 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA H100 PCIe Off | 00000000:17:00.0 Off | 0 | | N/A 47C P0 86W / 310W | 4469MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA H100 PCIe Off | 00000000:65:00.0 Off | 0 | | N/A 48C P0 82W / 310W | 3151MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA H100 PCIe Off | 00000000:CA:00.0 Off | 0 | | N/A 45C P0 80W / 310W | 3151MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA H100 PCIe Off | 00000000:E3:00.0 Off | 0 | | N/A 45C P0 84W / 310W | 3151MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 44047 C ...unners/cuda_v11/ollama_llama_server 4456MiB | | 1 N/A N/A 44047 C ...unners/cuda_v11/ollama_llama_server 3138MiB | | 2 N/A N/A 44047 C ...unners/cuda_v11/ollama_llama_server 3138MiB | | 3 N/A N/A 44047 C ...unners/cuda_v11/ollama_llama_server 3138MiB | +---------------------------------------------------------------------------------------+
Author
Owner

@dhiltgen commented on GitHub (Jun 21, 2024):

Hmm... those logs don't seem to indicate use_mmap=false was passed. It's still using the mmap logic to load the model.

The subprocess was started with the following:

time=2024-06-21T15:20:56.247Z level=INFO source=server.go:359 msg="starting llama server" cmd="/tmp/ollama3491629577/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-0bd51f8f0c975ce910ed067dcb962a9af05b77bafcdc595ef02178387f10e51d --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 21 --verbose --parallel 1 --tensor-split 6,5,5,5 --tensor-split 6,5,5,5 --port 44779"

There should be an additional --no-mmap flag passed in there if use_mmap=false was passed in.

<!-- gh-comment-id:2183008178 --> @dhiltgen commented on GitHub (Jun 21, 2024): Hmm... those logs don't seem to indicate use_mmap=false was passed. It's still using the mmap logic to load the model. The subprocess was started with the following: ``` time=2024-06-21T15:20:56.247Z level=INFO source=server.go:359 msg="starting llama server" cmd="/tmp/ollama3491629577/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-0bd51f8f0c975ce910ed067dcb962a9af05b77bafcdc595ef02178387f10e51d --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 21 --verbose --parallel 1 --tensor-split 6,5,5,5 --tensor-split 6,5,5,5 --port 44779" ``` There should be an additional `--no-mmap` flag passed in there if use_mmap=false was passed in.
Author
Owner

@sksdev27 commented on GitHub (Jun 21, 2024):

Hmm I used the curl command:
curl http://localhost:11434/api/generate -d '{
"model": "llama3:70b",
"prompt": "Why is the sky blue?",
"stream": false, "options": {"num_gpu": 21 }
}'

Is their another way to pass the argument --no-mmap

<!-- gh-comment-id:2183152715 --> @sksdev27 commented on GitHub (Jun 21, 2024): Hmm I used the curl command: curl http://localhost:11434/api/generate -d '{ "model": "llama3:70b", "prompt": "Why is the sky blue?", "stream": false, "options": {"num_gpu": 21 } }' Is their another way to pass the argument --no-mmap
Author
Owner

@dhiltgen commented on GitHub (Jun 21, 2024):

Oops, sorry, I cut-and-pasted the wrong curl example. Try this:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3:70b",
  "prompt": "Why is the sky blue?",
  "stream": false, "options": {"use_mmap": false }
}'
<!-- gh-comment-id:2183242014 --> @dhiltgen commented on GitHub (Jun 21, 2024): Oops, sorry, I cut-and-pasted the wrong curl example. Try this: ``` curl http://localhost:11434/api/generate -d '{ "model": "llama3:70b", "prompt": "Why is the sky blue?", "stream": false, "options": {"use_mmap": false } }' ```
Author
Owner

@sksdev27 commented on GitHub (Jun 21, 2024):

it went little bit further but did result in server crash but then i relaunch the same thing again and it was up and running Here are the logs
ollama_45_rc4_mmap_dis_latest.log

<!-- gh-comment-id:2183304622 --> @sksdev27 commented on GitHub (Jun 21, 2024): it went little bit further but did result in server crash but then i relaunch the same thing again and it was up and running Here are the logs [ollama_45_rc4_mmap_dis_latest.log](https://github.com/user-attachments/files/15932078/ollama_45_rc4_mmap_dis_latest.log)
Author
Owner

@dhiltgen commented on GitHub (Jun 21, 2024):

The latest log seems somewhat truncated, so I can't see the loading portion, but good to hear you got it working by adding use_mmap=false - I'm curious how long the load took.

I'm not sure what the threshold should be to toggle off mmap. I'll try to run some more experiments to see if I can find what the deciding factor(s) should be, but if you have the ability to experiment with different sized models and single vs. multi-GPU in this same environment, that might help us understand when we should switch loading strategy.

<!-- gh-comment-id:2183364673 --> @dhiltgen commented on GitHub (Jun 21, 2024): The latest log seems somewhat truncated, so I can't see the loading portion, but good to hear you got it working by adding `use_mmap=false` - I'm curious how long the load took. I'm not sure what the threshold should be to toggle off mmap. I'll try to run some more experiments to see if I can find what the deciding factor(s) should be, but if you have the ability to experiment with different sized models and single vs. multi-GPU in this same environment, that might help us understand when we should switch loading strategy.
Author
Owner

@sksdev27 commented on GitHub (Jun 21, 2024):

so it crashed after a while here are the latest logs:
log.txt

i will try to load it again get back to you with the loading logs and also try different size models. I have single H100 PC as well and test the single GPU vs multi GPU

<!-- gh-comment-id:2183381173 --> @sksdev27 commented on GitHub (Jun 21, 2024): so it crashed after a while here are the latest logs: [log.txt](https://github.com/user-attachments/files/15932944/log.txt) i will try to load it again get back to you with the loading logs and also try different size models. I have single H100 PC as well and test the single GPU vs multi GPU
Author
Owner

@sksdev27 commented on GitHub (Jun 22, 2024):

So I relaunched it with first time it failed, second time it failed, third time it failed and forth time it started working. Then I did I did two curl commands similar to this:
curl http://localhost:11434/api/generate -d '{
"model": "llama3:70b",
"prompt": "Why is the sky blue?",
"stream": false, "options": {"use_mmap": false }
}'
Then I started openwebui and started sending it couple of questions. after two question gpu crashed. i think its because i am not passing use_mmap": false through the web ui.

ollama_1_45_mmap_dis_all.log

I will do more testing and try out small models

<!-- gh-comment-id:2183703828 --> @sksdev27 commented on GitHub (Jun 22, 2024): So I relaunched it with first time it failed, second time it failed, third time it failed and forth time it started working. Then I did I did two curl commands similar to this: curl http://localhost:11434/api/generate -d '{ "model": "llama3:70b", "prompt": "Why is the sky blue?", "stream": false, "options": {"use_mmap": false } }' Then I started openwebui and started sending it couple of questions. after two question gpu crashed. i think its because i am not passing use_mmap": false through the web ui. [ollama_1_45_mmap_dis_all.log](https://github.com/user-attachments/files/15934808/ollama_1_45_mmap_dis_all.log) I will do more testing and try out small models
Author
Owner

@sksdev27 commented on GitHub (Jun 22, 2024):

ollama_46.log
I tried the 1.46 it works but if I leave the GPU idle after a while it breaks. Don't know why may be its the GPU or something else. Trying to figure that out but seems like loading is working fine. I will try other models

<!-- gh-comment-id:2184090982 --> @sksdev27 commented on GitHub (Jun 22, 2024): [ollama_46.log](https://github.com/user-attachments/files/15937507/ollama_46.log) I tried the 1.46 it works but if I leave the GPU idle after a while it breaks. Don't know why may be its the GPU or something else. Trying to figure that out but seems like loading is working fine. I will try other models
Author
Owner

@sksdev27 commented on GitHub (Jun 25, 2024):

So GPU crash was because my NVIDIA drivers weren't updated: suppose to 535.183 instead of 535.163

Bottom Line i assume this ticket is closed. However:
Here is one issue that I do want to highlight:
During the lunch of llama3:70b with 1.46.. unless i run this first:
curl http://localhost:11434/api/generate -d '{
"model": "llama3:70b",
"prompt": "Why is the sky blue?",
"stream": false, "options": {"use_mmap": false }
}'
if I did ollama pull llama3:70b and followed up with the run command:
It would fail to load the server

Also as for your other questions:
Current Set up with 1 GPU server and 4 GPU Server:

1GPU Running following models with ollama 1.46:
root@4cdbe351ed8b:/# ollama list
NAME ID SIZE MODIFIED
mistral:latest 2ae6f6dd7a3d 4.1 GB About a minute ago
starcoder2:7b 0679cedc1189 4.0 GB About a minute ago
gemma:7b a72c7f4d0a15 5.0 GB About a minute ago
llama3:latest 365c0bd3c000 4.7 GB About a minute ago
command-r:latest b8cdfff0263c 20 GB About a minute ago

4GPU Running following models with ollama 1.46:
root@c1e628e9c647:/# ollama list
NAME ID SIZE MODIFIED
starcoder2:15b 20cdb0f709c2 9.1 GB 33 seconds ago
mistral:latest 2ae6f6dd7a3d 4.1 GB 34 seconds ago
command-r-plus:latest c9c6cc6d20c7 59 GB 35 seconds ago
llama3:70b 786f3184aec0 39 GB 34 seconds ago
openchat:latest 537a4e03b649 4.1 GB About a minute ago

Testing with an OpenWebUI client

<!-- gh-comment-id:2189444837 --> @sksdev27 commented on GitHub (Jun 25, 2024): So GPU crash was because my NVIDIA drivers weren't updated: suppose to 535.183 instead of 535.163 Bottom Line i assume this ticket is closed. However: Here is one issue that I do want to highlight: During the lunch of llama3:70b with 1.46.. unless i run this first: curl http://localhost:11434/api/generate -d '{ "model": "llama3:70b", "prompt": "Why is the sky blue?", "stream": false, "options": {"use_mmap": false } }' if I did ollama pull llama3:70b and followed up with the run command: It would fail to load the server Also as for your other questions: Current Set up with 1 GPU server and 4 GPU Server: 1GPU Running following models with ollama 1.46: root@4cdbe351ed8b:/# ollama list NAME ID SIZE MODIFIED mistral:latest 2ae6f6dd7a3d 4.1 GB About a minute ago starcoder2:7b 0679cedc1189 4.0 GB About a minute ago gemma:7b a72c7f4d0a15 5.0 GB About a minute ago llama3:latest 365c0bd3c000 4.7 GB About a minute ago command-r:latest b8cdfff0263c 20 GB About a minute ago 4GPU Running following models with ollama 1.46: root@c1e628e9c647:/# ollama list NAME ID SIZE MODIFIED starcoder2:15b 20cdb0f709c2 9.1 GB 33 seconds ago mistral:latest 2ae6f6dd7a3d 4.1 GB 34 seconds ago command-r-plus:latest c9c6cc6d20c7 59 GB 35 seconds ago llama3:70b 786f3184aec0 39 GB 34 seconds ago openchat:latest 537a4e03b649 4.1 GB About a minute ago Testing with an OpenWebUI client
Author
Owner

@sksdev27 commented on GitHub (Jun 25, 2024):

Here is the comparison on loading ollama version 0.1.46 lunched by using the following docker commands:
docker run --gpus all -p 11434:11434 -e OLLAMA_SCHED_SPREAD=true -e OLLAMA_DEBUG=true -it --rm ollama/ollama:0.1.46

loading logs for ollama run llama3:70b:
ollama_1_46_ollama_run_llama70b.log
nvidia-smi log:
log-nvidia-smi.log

loading logs for the curl command:
curl http://localhost:11434/api/generate -d '{
"model": "llama3:70b",
"prompt": "Why is the sky blue?",
"stream": false, "options": {"use_mmap": false }
}'
commands had to run twice:
root@f03fa0d6d2bd:/# curl http://localhost:11434/api/generate -d '{
"model": "llama3:70b",
"prompt": "Why is the sky blue?",
"stream": false, "options": {"use_mmap": false }
}'
{"error":"timed out waiting for llama runner to start - progress 1.00 - "}root@f03fa0d6d2bd:/# curl http://localhost:11434/api/generate -d '{
"model": "llama3:70b",
"prompt": "Why is the sky blue?",
"stream": false, "options": {"use_mmap": false }
}'
{"model":"llama3:70b","created_at":"2024-06-25T18:55:09.266770179Z","response":"One of the most popular and intriguing questions in all of science!\n\nThe sky appears blue because of a phenomenon called Rayleigh scattering, which is named after the British physicist Lord Rayleigh. In 1871, he discovered that shorter (blue) wavelengths of light are scattered more than longer (red) wavelengths by the tiny molecules of gases in the atmosphere.\n\nHere's what happens:\n\n1. Sunlight enters Earth's atmosphere: When sunlight enters our atmosphere, it encounters tiny molecules of gases like nitrogen (N2) and oxygen (O2). These molecules are much smaller than the wavelength of light.\n2. Scattering occurs: The shorter wavelengths of light, such as blue and violet, are more easily deflected by these small molecules due to their smaller size. This is known as Rayleigh scattering.\n3. Blue light is scattered in all directions: As a result of this scattering, the blue light is dispersed throughout the atmosphere, reaching our eyes from all directions.\n4. Red light continues its path: The longer wavelengths of light, like red and orange, are less affected by the small molecules and continue to travel in a more direct path to our eyes.\n\nThis combination of scattered blue light and direct red light creates the blue color we see in the sky during the daytime. The exact shade of blue can vary depending on atmospheric conditions, such as pollution, dust, and water vapor, which can scatter light in different ways.\n\nAdditionally, the following factors can influence the apparent color of the sky:\n\n* Time of day: During sunrise and sunset, the sun's rays have to travel through more of the atmosphere, scattering shorter wavelengths and making the sky appear more red or orange.\n* Atmospheric conditions: Dust, pollution, and water vapor can scatter light in different ways, changing the apparent color of the sky.\n* Altitude and atmospheric pressure: At higher elevations, there is less air to scatter the light, resulting in a deeper blue color.\n\nSo, to summarize, the sky appears blue because of the scattering of shorter (blue) wavelengths of light by the tiny molecules in our atmosphere, while longer (red) wavelengths continue their path directly to our eyes.","done":true,"done_reason":"stop","context":[128006,882,128007,271,10445,374,279,13180,6437,30,128009,128006,78191,128007,271,4054,315,279,1455,5526,323,41765,4860,304,682,315,8198,2268,791,13180,8111,6437,1606,315,264,25885,2663,13558,64069,72916,11,902,374,7086,1306,279,8013,83323,10425,13558,64069,13,763,220,9674,16,11,568,11352,430,24210,320,12481,8,93959,315,3177,527,38067,810,1109,5129,320,1171,8,93959,555,279,13987,35715,315,45612,304,279,16975,382,8586,596,1148,8741,1473,16,13,3146,31192,4238,29933,9420,596,16975,96618,3277,40120,29933,1057,16975,11,433,35006,13987,35715,315,45612,1093,47503,320,45,17,8,323,24463,320,46,17,570,4314,35715,527,1790,9333,1109,279,46406,315,3177,627,17,13,3146,3407,31436,13980,96618,578,24210,93959,315,3177,11,1778,439,6437,323,80836,11,527,810,6847,711,2258,555,1521,2678,35715,4245,311,872,9333,1404,13,1115,374,3967,439,13558,64069,72916,627,18,13,3146,10544,3177,374,38067,304,682,18445,96618,1666,264,1121,315,420,72916,11,279,6437,3177,374,77810,6957,279,16975,11,19261,1057,6548,505,682,18445,627,19,13,3146,6161,3177,9731,1202,1853,96618,578,5129,93959,315,3177,11,1093,2579,323,19087,11,527,2753,11754,555,279,2678,35715,323,3136,311,5944,304,264,810,2167,1853,311,1057,6548,382,2028,10824,315,38067,6437,3177,323,2167,2579,3177,11705,279,6437,1933,584,1518,304,279,13180,2391,279,62182,13,578,4839,28601,315,6437,649,13592,11911,389,45475,4787,11,1778,439,25793,11,16174,11,323,3090,38752,11,902,649,45577,3177,304,2204,5627,382,50674,11,279,2768,9547,649,10383,279,10186,1933,315,279,13180,1473,9,3146,1489,315,1938,96618,12220,64919,323,44084,11,279,7160,596,45220,617,311,5944,1555,810,315,279,16975,11,72916,24210,93959,323,3339,279,13180,5101,810,2579,477,19087,627,9,3146,1688,8801,33349,4787,96618,33093,11,25793,11,323,3090,38752,649,45577,3177,304,2204,5627,11,10223,279,10186,1933,315,279,13180,627,9,3146,27108,3993,323,45475,7410,96618,2468,5190,12231,811,11,1070,374,2753,3805,311,45577,279,3177,11,13239,304,264,19662,6437,1933,382,4516,11,311,63179,11,279,13180,8111,6437,1606,315,279,72916,315,24210,320,12481,8,93959,315,3177,555,279,13987,35715,304,1057,16975,11,1418,5129,320,1171,8,93959,3136,872,1853,6089,311,1057,6548,13,128009],"total_duration":55703871735,"load_duration":39062954205,"prompt_eval_count":16,"prompt_eval_duration":90332000,"eval_count":443,"eval_duration":16548221000}root@f03fa0d6d2bd:/
logs:
Load logs:
ollama_1_46_ollama_run_llama70b_curl.log
nvidia-smi logs:
log-nvidia-smi_curl.log

<!-- gh-comment-id:2189753895 --> @sksdev27 commented on GitHub (Jun 25, 2024): Here is the comparison on loading ollama version 0.1.46 lunched by using the following docker commands: docker run --gpus all -p 11434:11434 -e OLLAMA_SCHED_SPREAD=true -e OLLAMA_DEBUG=true -it --rm ollama/ollama:0.1.46 loading logs for ollama run llama3:70b: [ollama_1_46_ollama_run_llama70b.log](https://github.com/user-attachments/files/15976212/ollama_1_46_ollama_run_llama70b.log) nvidia-smi log: [log-nvidia-smi.log](https://github.com/user-attachments/files/15976234/log-nvidia-smi.log) loading logs for the curl command: curl http://localhost:11434/api/generate -d '{ "model": "llama3:70b", "prompt": "Why is the sky blue?", "stream": false, "options": {"use_mmap": false } }' commands had to run twice: root@f03fa0d6d2bd:/# curl http://localhost:11434/api/generate -d '{ "model": "llama3:70b", "prompt": "Why is the sky blue?", "stream": false, "options": {"use_mmap": false } }' {"error":"timed out waiting for llama runner to start - progress 1.00 - "}root@f03fa0d6d2bd:/# curl http://localhost:11434/api/generate -d '{ "model": "llama3:70b", "prompt": "Why is the sky blue?", "stream": false, "options": {"use_mmap": false } }' {"model":"llama3:70b","created_at":"2024-06-25T18:55:09.266770179Z","response":"One of the most popular and intriguing questions in all of science!\n\nThe sky appears blue because of a phenomenon called Rayleigh scattering, which is named after the British physicist Lord Rayleigh. In 1871, he discovered that shorter (blue) wavelengths of light are scattered more than longer (red) wavelengths by the tiny molecules of gases in the atmosphere.\n\nHere's what happens:\n\n1. **Sunlight enters Earth's atmosphere**: When sunlight enters our atmosphere, it encounters tiny molecules of gases like nitrogen (N2) and oxygen (O2). These molecules are much smaller than the wavelength of light.\n2. **Scattering occurs**: The shorter wavelengths of light, such as blue and violet, are more easily deflected by these small molecules due to their smaller size. This is known as Rayleigh scattering.\n3. **Blue light is scattered in all directions**: As a result of this scattering, the blue light is dispersed throughout the atmosphere, reaching our eyes from all directions.\n4. **Red light continues its path**: The longer wavelengths of light, like red and orange, are less affected by the small molecules and continue to travel in a more direct path to our eyes.\n\nThis combination of scattered blue light and direct red light creates the blue color we see in the sky during the daytime. The exact shade of blue can vary depending on atmospheric conditions, such as pollution, dust, and water vapor, which can scatter light in different ways.\n\nAdditionally, the following factors can influence the apparent color of the sky:\n\n* **Time of day**: During sunrise and sunset, the sun's rays have to travel through more of the atmosphere, scattering shorter wavelengths and making the sky appear more red or orange.\n* **Atmospheric conditions**: Dust, pollution, and water vapor can scatter light in different ways, changing the apparent color of the sky.\n* **Altitude and atmospheric pressure**: At higher elevations, there is less air to scatter the light, resulting in a deeper blue color.\n\nSo, to summarize, the sky appears blue because of the scattering of shorter (blue) wavelengths of light by the tiny molecules in our atmosphere, while longer (red) wavelengths continue their path directly to our eyes.","done":true,"done_reason":"stop","context":[128006,882,128007,271,10445,374,279,13180,6437,30,128009,128006,78191,128007,271,4054,315,279,1455,5526,323,41765,4860,304,682,315,8198,2268,791,13180,8111,6437,1606,315,264,25885,2663,13558,64069,72916,11,902,374,7086,1306,279,8013,83323,10425,13558,64069,13,763,220,9674,16,11,568,11352,430,24210,320,12481,8,93959,315,3177,527,38067,810,1109,5129,320,1171,8,93959,555,279,13987,35715,315,45612,304,279,16975,382,8586,596,1148,8741,1473,16,13,3146,31192,4238,29933,9420,596,16975,96618,3277,40120,29933,1057,16975,11,433,35006,13987,35715,315,45612,1093,47503,320,45,17,8,323,24463,320,46,17,570,4314,35715,527,1790,9333,1109,279,46406,315,3177,627,17,13,3146,3407,31436,13980,96618,578,24210,93959,315,3177,11,1778,439,6437,323,80836,11,527,810,6847,711,2258,555,1521,2678,35715,4245,311,872,9333,1404,13,1115,374,3967,439,13558,64069,72916,627,18,13,3146,10544,3177,374,38067,304,682,18445,96618,1666,264,1121,315,420,72916,11,279,6437,3177,374,77810,6957,279,16975,11,19261,1057,6548,505,682,18445,627,19,13,3146,6161,3177,9731,1202,1853,96618,578,5129,93959,315,3177,11,1093,2579,323,19087,11,527,2753,11754,555,279,2678,35715,323,3136,311,5944,304,264,810,2167,1853,311,1057,6548,382,2028,10824,315,38067,6437,3177,323,2167,2579,3177,11705,279,6437,1933,584,1518,304,279,13180,2391,279,62182,13,578,4839,28601,315,6437,649,13592,11911,389,45475,4787,11,1778,439,25793,11,16174,11,323,3090,38752,11,902,649,45577,3177,304,2204,5627,382,50674,11,279,2768,9547,649,10383,279,10186,1933,315,279,13180,1473,9,3146,1489,315,1938,96618,12220,64919,323,44084,11,279,7160,596,45220,617,311,5944,1555,810,315,279,16975,11,72916,24210,93959,323,3339,279,13180,5101,810,2579,477,19087,627,9,3146,1688,8801,33349,4787,96618,33093,11,25793,11,323,3090,38752,649,45577,3177,304,2204,5627,11,10223,279,10186,1933,315,279,13180,627,9,3146,27108,3993,323,45475,7410,96618,2468,5190,12231,811,11,1070,374,2753,3805,311,45577,279,3177,11,13239,304,264,19662,6437,1933,382,4516,11,311,63179,11,279,13180,8111,6437,1606,315,279,72916,315,24210,320,12481,8,93959,315,3177,555,279,13987,35715,304,1057,16975,11,1418,5129,320,1171,8,93959,3136,872,1853,6089,311,1057,6548,13,128009],"total_duration":55703871735,"load_duration":39062954205,"prompt_eval_count":16,"prompt_eval_duration":90332000,"eval_count":443,"eval_duration":16548221000}root@f03fa0d6d2bd:/ logs: Load logs: [ollama_1_46_ollama_run_llama70b_curl.log](https://github.com/user-attachments/files/15976479/ollama_1_46_ollama_run_llama70b_curl.log) nvidia-smi logs: [log-nvidia-smi_curl.log](https://github.com/user-attachments/files/15976486/log-nvidia-smi_curl.log)
Author
Owner

@dhiltgen commented on GitHub (Jul 5, 2024):

That's great that you have a working setup.

Looking at that last log, even without mmap, we're still taking a really long time to initialize on your 4 GPU setup. It looks like the loading progress hit 100% in ~14 seconds, but was still initializing for over 5 minutes and triggered our timeout. On the second attempt, things were warmed up in caches and it only took 36s to load over all.

While we could increase the timeout, taking more than 5 minutes to fully load the model still feels problematic. I'm working on another change to add CUDA v12 support, with the intent of improving performance on more modern GPUs, which might wind up solving this load lag. #5049

<!-- gh-comment-id:2211332243 --> @dhiltgen commented on GitHub (Jul 5, 2024): That's great that you have a working setup. Looking at that last log, even without mmap, we're still taking a really long time to initialize on your 4 GPU setup. It looks like the loading progress hit 100% in ~14 seconds, but was still initializing for over 5 minutes and triggered our timeout. On the second attempt, things were warmed up in caches and it only took 36s to load over all. While we could increase the timeout, taking more than 5 minutes to fully load the model still feels problematic. I'm working on another change to add CUDA v12 support, with the intent of improving performance on more modern GPUs, which might wind up solving this load lag. #5049
Author
Owner

@sksdev27 commented on GitHub (Jul 5, 2024):

So eventually it was starting up but would eventually fail after a day or two. Seems like their was an issue with on of the GPU and I did some research the model I am working with also has nvlink installed, so it should have been treating it as one GPU. So, currently we are working on replacing it. Either the gpu or one of the components around it.

<!-- gh-comment-id:2211381946 --> @sksdev27 commented on GitHub (Jul 5, 2024): So eventually it was starting up but would eventually fail after a day or two. Seems like their was an issue with on of the GPU and I did some research the model I am working with also has nvlink installed, so it should have been treating it as one GPU. So, currently we are working on replacing it. Either the gpu or one of the components around it.
Author
Owner

@sksdev27 commented on GitHub (Jul 25, 2024):

@dhiltgen So when i launch the latest ollama 0.2.8 it uses one gpu but when i use ollama version 0.1.30 it uses all the gpu. The fix that you applied here didnt make it to 0.2.8

<!-- gh-comment-id:2249121012 --> @sksdev27 commented on GitHub (Jul 25, 2024): @dhiltgen So when i launch the latest ollama 0.2.8 it uses one gpu but when i use ollama version 0.1.30 it uses all the gpu. The fix that you applied here didnt make it to 0.2.8
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#65218