[GH-ISSUE #1769] Long initial loading time. #63052

Closed
opened 2026-05-03 11:35:54 -05:00 by GiteaMirror · 15 comments
Owner

Originally created by @themw123 on GitHub (Jan 3, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/1769

It takes a few minutes and sometimes it never starts the model when trying to run it. The problem is with all models i am using, also with small ones like tinyllama. After model has loaded after a few minutes everything works fine and i am getting fast chat responses.

I am using Windows with WSL2 and Docker Desktop. Ollama is installed in wsl2 and the models are also placed there by mounting(bin mount) them with docker volumes in the wsl2 file system.

Originally created by @themw123 on GitHub (Jan 3, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/1769 It takes a few minutes and sometimes it never starts the model when trying to run it. The problem is with all models i am using, also with small ones like tinyllama. After model has loaded after a few minutes everything works fine and i am getting fast chat responses. I am using Windows with WSL2 and Docker Desktop. Ollama is installed in wsl2 and the models are also placed there by mounting(bin mount) them with docker volumes in the wsl2 file system.
Author
Owner

@pdevine commented on GitHub (Jan 4, 2024):

The long load time is because the model is being loaded into memory when you start the REPL. I'm guessing the problem is related to Docker Desktop's IO speed. You can confirm this by trying to copy a large file (even tinyllama is > 600MB) inside of the docker volume.

<!-- gh-comment-id:1876152081 --> @pdevine commented on GitHub (Jan 4, 2024): The long load time is because the model is being loaded into memory when you start the REPL. I'm guessing the problem is related to Docker Desktop's IO speed. You can confirm this by trying to copy a large file (even tinyllama is > 600MB) inside of the docker volume.
Author
Owner

@zioalex commented on GitHub (Mar 25, 2024):

I am in the same situation. Did you find anyway to improve Docker IO performance in this case?

<!-- gh-comment-id:2018285061 --> @zioalex commented on GitHub (Mar 25, 2024): I am in the same situation. Did you find anyway to improve Docker IO performance in this case?
Author
Owner

@themw123 commented on GitHub (Mar 25, 2024):

No the slow loading is due to wsl. Now i am using ollama on native windows. The loading time has improved, but it is still not very fast. I would say it takes half of the time as before with wsl.

<!-- gh-comment-id:2019018527 --> @themw123 commented on GitHub (Mar 25, 2024): No the slow loading is due to wsl. Now i am using ollama on native windows. The loading time has improved, but it is still not very fast. I would say it takes half of the time as before with wsl.
Author
Owner

@zioalex commented on GitHub (Mar 26, 2024):

I see, unfortunately I cannot Install it natively on Windows. Still searching for a way to optimise wsl+docker+ollama

<!-- gh-comment-id:2019779757 --> @zioalex commented on GitHub (Mar 26, 2024): I see, unfortunately I cannot Install it natively on Windows. Still searching for a way to optimise wsl+docker+ollama
Author
Owner

@M0wLaue commented on GitHub (May 7, 2024):

try this compose.yml with docker compose up -d:

services:
  ollama:
    container_name: ollama
    image: ollama/ollama:latest
    volumes:
#      - ./ollama:/root/.ollama # this solution synchronizes with the real harddrive and is slow af
      - ollama:/root/.ollama # this solution synchronizes with the docker volume and loads the model rocket fast
    ports:
      - 11434:11434
    networks:
      - llm-network
    environment:
      - gpus=all
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

networks:
  llm-network:
    driver: bridge

volumes:
    ollama:
<!-- gh-comment-id:2097549282 --> @M0wLaue commented on GitHub (May 7, 2024): try this compose.yml with `docker compose up -d`: ``` services: ollama: container_name: ollama image: ollama/ollama:latest volumes: # - ./ollama:/root/.ollama # this solution synchronizes with the real harddrive and is slow af - ollama:/root/.ollama # this solution synchronizes with the docker volume and loads the model rocket fast ports: - 11434:11434 networks: - llm-network environment: - gpus=all restart: unless-stopped deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] networks: llm-network: driver: bridge volumes: ollama: ```
Author
Owner

@Chukarslan commented on GitHub (May 23, 2024):

Did someone find a solution for this? Running ollama on AWS EC2 (Tried a range of g5 and g4 instances and it seems its always around 17 seconds for Mistral 3.7gb which is extremely sloooooow. Screenshot from G5.12XL instance with 4 GPUs

g5x12
<!-- gh-comment-id:2127928477 --> @Chukarslan commented on GitHub (May 23, 2024): Did someone find a solution for this? Running ollama on AWS EC2 (Tried a range of g5 and g4 instances and it seems its always around 17 seconds for Mistral 3.7gb which is extremely sloooooow. Screenshot from G5.12XL instance with 4 GPUs <img width="241" alt="g5x12" src="https://github.com/ollama/ollama/assets/108550191/fa677004-51ac-4b8d-bd25-db47452cb491">
Author
Owner

@Darshan2104 commented on GitHub (Jul 11, 2024):

I am using codellama model in my local machine and for the very first query only it is taking longer than expected time!!
What could be a reason ? will it work fine after few initial queries ?

<!-- gh-comment-id:2223733086 --> @Darshan2104 commented on GitHub (Jul 11, 2024): I am using codellama model in my local machine and for the very first query only it is taking longer than expected time!! What could be a reason ? will it work fine after few initial queries ?
Author
Owner

@LuisMalhadas commented on GitHub (Jul 15, 2024):

Here is another report:
Trying to load Llama3:70b into 3 rtx3090 takes me around half hour to an hour.
I am currently using it through docker compose:

docker run -d --gpus '"device=0,2,3"' -v ollama:/root/.ollama -v .../ollama:/model -p 11434:11434 -e OLLAMA_HOST=0.0.0.0 -e OLLAMA_ORIGINS=*  -e OLLAMA_MAX_LOADED_MODELS=2 -e OLLAMA_NUM_PARALLEL=2 -e OLLAMA_DEBUG=1 OLLAMA_DEBUG=1 -e CUDA_ERROR_LEVEL=50 --name ollama2 ollama/ollama
$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
Mon Jul 15 08:01:44 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090 Ti     Off |   00000000:04:00.0 Off |                  Off |
| 35%   61C    P2            215W /  450W |   17597MiB /  24564MiB |     27%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        Off |   00000000:43:00.0 Off |                  N/A |
| 55%   66C    P2            213W /  350W |   17293MiB /  24576MiB |     31%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090        Off |   00000000:88:00.0 Off |                  N/A |
| 78%   58C    P2            241W /  420W |   17293MiB /  24576MiB |     37%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 3090        Off |   00000000:C4:00.0 Off |                  N/A |
|  0%   36C    P8             21W /  350W |   10499MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2236599      C   ...unners/cuda_v11/ollama_llama_server      17590MiB |
|    1   N/A  N/A   2236599      C   ...unners/cuda_v11/ollama_llama_server      17286MiB |
|    2   N/A  N/A   2236599      C   ...unners/cuda_v11/ollama_llama_server      17286MiB |
|    3   N/A  N/A     19625      C   /app/.venv/bin/python                       10492MiB |
+-----------------------------------------------------------------------------------------+
sudo dmesg | grep -i nvrm
[sudo] password for djfil: 
[    5.645486] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  550.90.07  Fri May 31 09:35:42 UTC 2024
[20922.713866] NVRM: GPU at PCI:0000:88:00: GPU-0a469c40-39b4-37e0-3229-5ff659d33432
[20922.713885] NVRM: Xid (PCI:0000:88:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[20922.713893] NVRM: GPU 0000:88:00.0: GPU has fallen off the bus.
[20922.713907] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.
djfil@antonio:~$ sudo dmesg | grep -i nvidia
[    5.401223] nvidia: loading out-of-tree module taints kernel.
[    5.401286] nvidia: module license 'NVIDIA' taints kernel.
[    5.432111] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    5.453959] nvidia-nvlink: Nvlink Core is being initialized, major device number 509
[    5.457233] nvidia 0000:04:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    5.508309] nvidia 0000:43:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    5.551675] nvidia 0000:88:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    5.599148] nvidia 0000:c4:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    5.645486] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  550.90.07  Fri May 31 09:35:42 UTC 2024
[    5.666118] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  550.90.07  Fri May 31 09:30:47 UTC 2024
[    5.670384] [drm] [nvidia-drm] [GPU ID 0x00000400] Loading driver
[    6.798471] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:04:00.0 on minor 1
[    6.799105] [drm] [nvidia-drm] [GPU ID 0x00004300] Loading driver
[    7.768095] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:43:00.0 on minor 2
[    7.774306] [drm] [nvidia-drm] [GPU ID 0x00008800] Loading driver
[   10.529242] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:88:00.0 on minor 3
[   10.534219] [drm] [nvidia-drm] [GPU ID 0x0000c400] Loading driver
[   11.551526] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:c4:00.0 on minor 4
[   25.182630] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[   25.188515] nvidia-uvm: Loaded the UVM driver, major device number 506.
[   29.637075] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:80/0000:80:03.1/0000:86:00.0/0000:87:00.0/0000:88:00.1/sound/card2/input8
[   29.637208] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:80/0000:80:03.1/0000:86:00.0/0000:87:00.0/0000:88:00.1/sound/card2/input9
[   29.637326] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:80/0000:80:03.1/0000:86:00.0/0000:87:00.0/0000:88:00.1/sound/card2/input10
[   29.637466] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:80/0000:80:03.1/0000:86:00.0/0000:87:00.0/0000:88:00.1/sound/card2/input11
[   29.637565] input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:80/0000:80:03.1/0000:86:00.0/0000:87:00.0/0000:88:00.1/sound/card2/input12
[   29.637666] input: HDA NVidia HDMI/DP,pcm=11 as /devices/pci0000:80/0000:80:03.1/0000:86:00.0/0000:87:00.0/0000:88:00.1/sound/card2/input13
[   29.637765] input: HDA NVidia HDMI/DP,pcm=12 as /devices/pci0000:80/0000:80:03.1/0000:86:00.0/0000:87:00.0/0000:88:00.1/sound/card2/input14
[   29.637928] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:c0/0000:c0:01.1/0000:c1:00.0/0000:c2:01.0/0000:c4:00.1/sound/card3/input15
[   29.640361] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.1/sound/card1/input1
[   29.641554] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:c0/0000:c0:01.1/0000:c1:00.0/0000:c2:01.0/0000:c4:00.1/sound/card3/input16
[   29.649335] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:c0/0000:c0:01.1/0000:c1:00.0/0000:c2:01.0/0000:c4:00.1/sound/card3/input17
[   29.650121] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.1/sound/card1/input2
[   29.654941] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:c0/0000:c0:01.1/0000:c1:00.0/0000:c2:01.0/0000:c4:00.1/sound/card3/input18
[   29.660677] input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:c0/0000:c0:01.1/0000:c1:00.0/0000:c2:01.0/0000:c4:00.1/sound/card3/input19
[   29.667755] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.1/sound/card1/input3
[   29.676380] input: HDA NVidia HDMI/DP,pcm=11 as /devices/pci0000:c0/0000:c0:01.1/0000:c1:00.0/0000:c2:01.0/0000:c4:00.1/sound/card3/input20
[   29.692111] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.1/sound/card1/input4
[   29.698743] input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.1/sound/card1/input5
[   29.699110] input: HDA NVidia HDMI/DP,pcm=12 as /devices/pci0000:c0/0000:c0:01.1/0000:c1:00.0/0000:c2:01.0/0000:c4:00.1/sound/card3/input21
[   29.707294] input: HDA NVidia HDMI/DP,pcm=11 as /devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.1/sound/card1/input6
[   29.715194] input: HDA NVidia HDMI/DP,pcm=12 as /devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.1/sound/card1/input7
[   32.513871] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.1/0000:01:00.0/0000:02:01.0/0000:04:00.1/sound/card0/input22
[   32.514018] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.1/0000:01:00.0/0000:02:01.0/0000:04:00.1/sound/card0/input23
[   32.514146] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.1/0000:01:00.0/0000:02:01.0/0000:04:00.1/sound/card0/input24
[   32.514302] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.1/0000:01:00.0/0000:02:01.0/0000:04:00.1/sound/card0/input25
[   32.514441] input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:00/0000:00:03.1/0000:01:00.0/0000:02:01.0/0000:04:00.1/sound/card0/input26
[   32.514575] input: HDA NVidia HDMI/DP,pcm=11 as /devices/pci0000:00/0000:00:03.1/0000:01:00.0/0000:02:01.0/0000:04:00.1/sound/card0/input27
[   32.514720] input: HDA NVidia HDMI/DP,pcm=12 as /devices/pci0000:00/0000:00:03.1/0000:01:00.0/0000:02:01.0/0000:04:00.1/sound/card0/input28
[   34.566728] audit: type=1400 audit(1720623481.265:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1246 comm="apparmor_parser"
[   34.566737] audit: type=1400 audit(1720623481.265:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1246 comm="apparmor_parser"
[  181.782373] audit: type=1400 audit(1720623628.941:113): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="nvidia_modprobe" pid=1858 comm="apparmor_parser"
[  181.782381] audit: type=1400 audit(1720623628.941:114): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="nvidia_modprobe//kmod" pid=1858 comm="apparmor_parser"
[  210.793993] audit: type=1400 audit(1720623657.953:137): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="nvidia_modprobe" pid=2440 comm="apparmor_parser"
[  210.793998] audit: type=1400 audit(1720623657.953:138): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="nvidia_modprobe//kmod" pid=2440 comm="apparmor_parser"
[  340.825361] audit: type=1400 audit(1720623787.980:162): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="nvidia_modprobe" pid=4103 comm="apparmor_parser"
[  340.825368] audit: type=1400 audit(1720623787.980:163): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="nvidia_modprobe//kmod" pid=4103 comm="apparmor_parser"
[  631.337720] audit: type=1400 audit(1720624078.493:186): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="nvidia_modprobe" pid=18203 comm="apparmor_parser"
[  631.337728] audit: type=1400 audit(1720624078.493:187): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="nvidia_modprobe//kmod" pid=18203 comm="apparmor_parser"
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.

In the end it loads, but I get lots of:

time=2024-07-15T03:40:01.397Z level=DEBUG source=sched.go:348 msg="context for request finished"
time=2024-07-15T03:40:01.398Z level=DEBUG source=sched.go:281 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-87d5b13e5157d3a67f8e10a46d8a846ec2b68c1f731e3dfe1546a585432b8fa0 duration=5m0s
time=2024-07-15T03:40:01.398Z level=DEBUG source=sched.go:299 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-87d5b13e5157d3a67f8e10a46d8a846ec2b68c1f731e3dfe1546a585432b8fa0 refCount=0
time=2024-07-15T03:40:09.124Z level=DEBUG source=gpu.go:333 msg="updating system memory data" before.total="251.6 GiB" before.free="240.9 GiB" now.total="251.6 GiB" now.free="239.9 GiB"
time=2024-07-15T03:40:09.273Z level=DEBUG source=gpu.go:374 msg="updating cuda memory data" gpu=GPU-875fb951-07e8-0173-63ca-3926ddbd69de name="NVIDIA GeForce RTX 3090 Ti" before.total="23.7 GiB" before.free="23.4 GiB" now.total="23.7 GiB" now.free="23.4 GiB" now.used="266.9 MiB"
time=2024-07-15T03:40:09.372Z level=DEBUG source=gpu.go:374 msg="updating cuda memory data" gpu=GPU-831da45a-c458-4027-02e2-c35737c26225 name="NVIDIA GeForce RTX 3090" before.total="23.7 GiB" before.free="23.4 GiB" now.total="23.7 GiB" now.free="8.5 GiB" now.used="15.2 GiB"
time=2024-07-15T03:40:09.493Z level=DEBUG source=gpu.go:374 msg="updating cuda memory data" gpu=GPU-0a469c40-39b4-37e0-3229-5ff659d33432 name="NVIDIA GeForce RTX 3090" before.total="23.7 GiB" before.free="23.4 GiB" now.total="23.7 GiB" now.free="23.4 GiB" now.used="260.9 MiB"
time=2024-07-15T03:40:09.516Z level=DEBUG source=sched.go:429 msg="gpu reported" gpu=GPU-875fb951-07e8-0173-63ca-3926ddbd69de library=cuda available="23.4 GiB"
time=2024-07-15T03:40:09.516Z level=INFO source=sched.go:440 msg="updated VRAM based on existing loaded models" gpu=GPU-875fb951-07e8-0173-63ca-3926ddbd69de library=cuda total="23.7 GiB" available="23.4 GiB"
time=2024-07-15T03:40:09.516Z level=DEBUG source=sched.go:429 msg="gpu reported" gpu=GPU-831da45a-c458-4027-02e2-c35737c26225 library=cuda available="8.5 GiB"
time=2024-07-15T03:40:09.516Z level=INFO source=sched.go:440 msg="updated VRAM based on existing loaded models" gpu=GPU-831da45a-c458-4027-02e2-c35737c26225 library=cuda total="23.7 GiB" available="8.5 GiB"
time=2024-07-15T03:40:09.516Z level=DEBUG source=sched.go:429 msg="gpu reported" gpu=GPU-0a469c40-39b4-37e0-3229-5ff659d33432 library=cuda available="23.4 GiB"
time=2024-07-15T03:40:09.516Z level=INFO source=sched.go:440 msg="updated VRAM based on existing loaded models" gpu=GPU-0a469c40-39b4-37e0-3229-5ff659d33432 library=cuda total="23.7 GiB" available="23.4 GiB"
time=2024-07-15T03:40:09.516Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[23.4 GiB]"
time=2024-07-15T03:40:09.517Z level=DEBUG source=sched.go:628 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-f2296999531d6120801529a45b1d103f7370c5970be939ebfc2ba5d0833e9e1e gpu=GPU-0a469c40-39b4-37e0-3229-5ff659d33432 available=25157238784 required="18.0 GiB"
time=2024-07-15T03:40:09.517Z level=DEBUG source=sched.go:191 msg="new model fits with existing models, loading"
time=2024-07-15T03:40:09.517Z level=DEBUG source=server.go:98 msg="system memory" total="251.6 GiB" free=257536163840
time=2024-07-15T03:40:09.517Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[23.4 GiB]"
time=2024-07-15T03:40:09.517Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[23.4 GiB]" memory.required.full="18.0 GiB" memory.required.partial="18.0 GiB" memory.required.kv="2.0 GiB" memory.required.allocations="[18.0 GiB]" memory.weights.total="15.0 GiB" memory.weights.repeating="14.0 GiB" memory.weights.nonrepeating="1002.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB"
time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/cpu/ollama_llama_server
time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/cpu_avx/ollama_llama_server
time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/cpu_avx2/ollama_llama_server
time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/cuda_v11/ollama_llama_server
time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/rocm_v60101/ollama_llama_server
time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/cpu/ollama_llama_server
time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/cpu_avx/ollama_llama_server
time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/cpu_avx2/ollama_llama_server
time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/cuda_v11/ollama_llama_server
time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/rocm_v60101/ollama_llama_server
time=2024-07-15T03:40:09.518Z level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama4250454338/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-f2296999531d6120801529a45b1d103f7370c5970be939ebfc2ba5d0833e9e1e --ctx-size 16384 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --verbose --parallel 2 --port 45355"
time=2024-07-15T03:40:09.518Z level=DEBUG source=server.go:383 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/tmp/ollama4250454338/runners/cuda_v11:/tmp/ollama4250454338/runners:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 CUDA_VISIBLE_DEVICES=GPU-0a469c40-39b4-37e0-3229-5ff659d33432]"
time=2024-07-15T03:40:09.518Z level=INFO source=sched.go:382 msg="loaded runners" count=2
time=2024-07-15T03:40:09.518Z level=INFO source=server.go:556 msg="waiting for llama runner to start responding"
time=2024-07-15T03:40:09.519Z level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server error"
time=2024-07-15T03:40:09.770Z level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model"
time=2024-07-15T03:40:11.227Z level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server not responding"
time=2024-07-15T03:40:11.478Z level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model"
time=2024-07-15T03:40:11.478Z level=DEBUG source=server.go:605 msg="model load progress 0.22"
time=2024-07-15T03:40:11.730Z level=DEBUG source=server.go:605 msg="model load progress 0.37"
time=2024-07-15T03:40:11.981Z level=DEBUG source=server.go:605 msg="model load progress 0.53"
time=2024-07-15T03:40:12.233Z level=DEBUG source=server.go:605 msg="model load progress 0.69"
time=2024-07-15T03:40:12.484Z level=DEBUG source=server.go:605 msg="model load progress 0.84"
time=2024-07-15T03:40:12.935Z level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server not responding"
time=2024-07-15T03:40:13.186Z level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model"
time=2024-07-15T03:40:13.186Z level=DEBUG source=server.go:605 msg="model load progress 1.00"
time=2024-07-15T03:40:13.437Z level=DEBUG source=server.go:608 msg="model load completed, waiting for server to become available" status="llm server loading model"
time=2024-07-15T03:40:13.940Z level=INFO source=server.go:599 msg="llama runner started in 4.42 seconds"
time=2024-07-15T03:40:13.940Z level=DEBUG source=sched.go:395 msg="finished setting up runner" model=/root/.ollama/models/blobs/sha256-f2296999531d6120801529a45b1d103f7370c5970be939ebfc2ba5d0833e9e1e
<!-- gh-comment-id:2227935124 --> @LuisMalhadas commented on GitHub (Jul 15, 2024): Here is another report: Trying to load Llama3:70b into 3 rtx3090 takes me around half hour to an hour. I am currently using it through docker compose: ``` docker run -d --gpus '"device=0,2,3"' -v ollama:/root/.ollama -v .../ollama:/model -p 11434:11434 -e OLLAMA_HOST=0.0.0.0 -e OLLAMA_ORIGINS=* -e OLLAMA_MAX_LOADED_MODELS=2 -e OLLAMA_NUM_PARALLEL=2 -e OLLAMA_DEBUG=1 OLLAMA_DEBUG=1 -e CUDA_ERROR_LEVEL=50 --name ollama2 ollama/ollama ``` ``` $ cat /etc/os-release PRETTY_NAME="Ubuntu 22.04.4 LTS" NAME="Ubuntu" VERSION_ID="22.04" VERSION="22.04.4 LTS (Jammy Jellyfish)" VERSION_CODENAME=jammy ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=jammy ``` ``` Mon Jul 15 08:01:44 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3090 Ti Off | 00000000:04:00.0 Off | Off | | 35% 61C P2 215W / 450W | 17597MiB / 24564MiB | 27% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 3090 Off | 00000000:43:00.0 Off | N/A | | 55% 66C P2 213W / 350W | 17293MiB / 24576MiB | 31% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA GeForce RTX 3090 Off | 00000000:88:00.0 Off | N/A | | 78% 58C P2 241W / 420W | 17293MiB / 24576MiB | 37% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA GeForce RTX 3090 Off | 00000000:C4:00.0 Off | N/A | | 0% 36C P8 21W / 350W | 10499MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 2236599 C ...unners/cuda_v11/ollama_llama_server 17590MiB | | 1 N/A N/A 2236599 C ...unners/cuda_v11/ollama_llama_server 17286MiB | | 2 N/A N/A 2236599 C ...unners/cuda_v11/ollama_llama_server 17286MiB | | 3 N/A N/A 19625 C /app/.venv/bin/python 10492MiB | +-----------------------------------------------------------------------------------------+ ``` ``` sudo dmesg | grep -i nvrm [sudo] password for djfil: [ 5.645486] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 550.90.07 Fri May 31 09:35:42 UTC 2024 [20922.713866] NVRM: GPU at PCI:0000:88:00: GPU-0a469c40-39b4-37e0-3229-5ff659d33432 [20922.713885] NVRM: Xid (PCI:0000:88:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus. [20922.713893] NVRM: GPU 0000:88:00.0: GPU has fallen off the bus. [20922.713907] NVRM: A GPU crash dump has been created. If possible, please run NVRM: nvidia-bug-report.sh as root to collect this data before NVRM: the NVIDIA kernel module is unloaded. djfil@antonio:~$ sudo dmesg | grep -i nvidia [ 5.401223] nvidia: loading out-of-tree module taints kernel. [ 5.401286] nvidia: module license 'NVIDIA' taints kernel. [ 5.432111] nvidia: module verification failed: signature and/or required key missing - tainting kernel [ 5.453959] nvidia-nvlink: Nvlink Core is being initialized, major device number 509 [ 5.457233] nvidia 0000:04:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none [ 5.508309] nvidia 0000:43:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none [ 5.551675] nvidia 0000:88:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none [ 5.599148] nvidia 0000:c4:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none [ 5.645486] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 550.90.07 Fri May 31 09:35:42 UTC 2024 [ 5.666118] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 550.90.07 Fri May 31 09:30:47 UTC 2024 [ 5.670384] [drm] [nvidia-drm] [GPU ID 0x00000400] Loading driver [ 6.798471] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:04:00.0 on minor 1 [ 6.799105] [drm] [nvidia-drm] [GPU ID 0x00004300] Loading driver [ 7.768095] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:43:00.0 on minor 2 [ 7.774306] [drm] [nvidia-drm] [GPU ID 0x00008800] Loading driver [ 10.529242] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:88:00.0 on minor 3 [ 10.534219] [drm] [nvidia-drm] [GPU ID 0x0000c400] Loading driver [ 11.551526] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:c4:00.0 on minor 4 [ 25.182630] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint. [ 25.188515] nvidia-uvm: Loaded the UVM driver, major device number 506. [ 29.637075] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:80/0000:80:03.1/0000:86:00.0/0000:87:00.0/0000:88:00.1/sound/card2/input8 [ 29.637208] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:80/0000:80:03.1/0000:86:00.0/0000:87:00.0/0000:88:00.1/sound/card2/input9 [ 29.637326] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:80/0000:80:03.1/0000:86:00.0/0000:87:00.0/0000:88:00.1/sound/card2/input10 [ 29.637466] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:80/0000:80:03.1/0000:86:00.0/0000:87:00.0/0000:88:00.1/sound/card2/input11 [ 29.637565] input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:80/0000:80:03.1/0000:86:00.0/0000:87:00.0/0000:88:00.1/sound/card2/input12 [ 29.637666] input: HDA NVidia HDMI/DP,pcm=11 as /devices/pci0000:80/0000:80:03.1/0000:86:00.0/0000:87:00.0/0000:88:00.1/sound/card2/input13 [ 29.637765] input: HDA NVidia HDMI/DP,pcm=12 as /devices/pci0000:80/0000:80:03.1/0000:86:00.0/0000:87:00.0/0000:88:00.1/sound/card2/input14 [ 29.637928] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:c0/0000:c0:01.1/0000:c1:00.0/0000:c2:01.0/0000:c4:00.1/sound/card3/input15 [ 29.640361] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.1/sound/card1/input1 [ 29.641554] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:c0/0000:c0:01.1/0000:c1:00.0/0000:c2:01.0/0000:c4:00.1/sound/card3/input16 [ 29.649335] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:c0/0000:c0:01.1/0000:c1:00.0/0000:c2:01.0/0000:c4:00.1/sound/card3/input17 [ 29.650121] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.1/sound/card1/input2 [ 29.654941] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:c0/0000:c0:01.1/0000:c1:00.0/0000:c2:01.0/0000:c4:00.1/sound/card3/input18 [ 29.660677] input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:c0/0000:c0:01.1/0000:c1:00.0/0000:c2:01.0/0000:c4:00.1/sound/card3/input19 [ 29.667755] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.1/sound/card1/input3 [ 29.676380] input: HDA NVidia HDMI/DP,pcm=11 as /devices/pci0000:c0/0000:c0:01.1/0000:c1:00.0/0000:c2:01.0/0000:c4:00.1/sound/card3/input20 [ 29.692111] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.1/sound/card1/input4 [ 29.698743] input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.1/sound/card1/input5 [ 29.699110] input: HDA NVidia HDMI/DP,pcm=12 as /devices/pci0000:c0/0000:c0:01.1/0000:c1:00.0/0000:c2:01.0/0000:c4:00.1/sound/card3/input21 [ 29.707294] input: HDA NVidia HDMI/DP,pcm=11 as /devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.1/sound/card1/input6 [ 29.715194] input: HDA NVidia HDMI/DP,pcm=12 as /devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.1/sound/card1/input7 [ 32.513871] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.1/0000:01:00.0/0000:02:01.0/0000:04:00.1/sound/card0/input22 [ 32.514018] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.1/0000:01:00.0/0000:02:01.0/0000:04:00.1/sound/card0/input23 [ 32.514146] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.1/0000:01:00.0/0000:02:01.0/0000:04:00.1/sound/card0/input24 [ 32.514302] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.1/0000:01:00.0/0000:02:01.0/0000:04:00.1/sound/card0/input25 [ 32.514441] input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:00/0000:00:03.1/0000:01:00.0/0000:02:01.0/0000:04:00.1/sound/card0/input26 [ 32.514575] input: HDA NVidia HDMI/DP,pcm=11 as /devices/pci0000:00/0000:00:03.1/0000:01:00.0/0000:02:01.0/0000:04:00.1/sound/card0/input27 [ 32.514720] input: HDA NVidia HDMI/DP,pcm=12 as /devices/pci0000:00/0000:00:03.1/0000:01:00.0/0000:02:01.0/0000:04:00.1/sound/card0/input28 [ 34.566728] audit: type=1400 audit(1720623481.265:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1246 comm="apparmor_parser" [ 34.566737] audit: type=1400 audit(1720623481.265:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1246 comm="apparmor_parser" [ 181.782373] audit: type=1400 audit(1720623628.941:113): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="nvidia_modprobe" pid=1858 comm="apparmor_parser" [ 181.782381] audit: type=1400 audit(1720623628.941:114): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="nvidia_modprobe//kmod" pid=1858 comm="apparmor_parser" [ 210.793993] audit: type=1400 audit(1720623657.953:137): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="nvidia_modprobe" pid=2440 comm="apparmor_parser" [ 210.793998] audit: type=1400 audit(1720623657.953:138): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="nvidia_modprobe//kmod" pid=2440 comm="apparmor_parser" [ 340.825361] audit: type=1400 audit(1720623787.980:162): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="nvidia_modprobe" pid=4103 comm="apparmor_parser" [ 340.825368] audit: type=1400 audit(1720623787.980:163): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="nvidia_modprobe//kmod" pid=4103 comm="apparmor_parser" [ 631.337720] audit: type=1400 audit(1720624078.493:186): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="nvidia_modprobe" pid=18203 comm="apparmor_parser" [ 631.337728] audit: type=1400 audit(1720624078.493:187): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="nvidia_modprobe//kmod" pid=18203 comm="apparmor_parser" NVRM: nvidia-bug-report.sh as root to collect this data before NVRM: the NVIDIA kernel module is unloaded. ``` In the end it loads, but I get lots of: ``` time=2024-07-15T03:40:01.397Z level=DEBUG source=sched.go:348 msg="context for request finished" time=2024-07-15T03:40:01.398Z level=DEBUG source=sched.go:281 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-87d5b13e5157d3a67f8e10a46d8a846ec2b68c1f731e3dfe1546a585432b8fa0 duration=5m0s time=2024-07-15T03:40:01.398Z level=DEBUG source=sched.go:299 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-87d5b13e5157d3a67f8e10a46d8a846ec2b68c1f731e3dfe1546a585432b8fa0 refCount=0 time=2024-07-15T03:40:09.124Z level=DEBUG source=gpu.go:333 msg="updating system memory data" before.total="251.6 GiB" before.free="240.9 GiB" now.total="251.6 GiB" now.free="239.9 GiB" time=2024-07-15T03:40:09.273Z level=DEBUG source=gpu.go:374 msg="updating cuda memory data" gpu=GPU-875fb951-07e8-0173-63ca-3926ddbd69de name="NVIDIA GeForce RTX 3090 Ti" before.total="23.7 GiB" before.free="23.4 GiB" now.total="23.7 GiB" now.free="23.4 GiB" now.used="266.9 MiB" time=2024-07-15T03:40:09.372Z level=DEBUG source=gpu.go:374 msg="updating cuda memory data" gpu=GPU-831da45a-c458-4027-02e2-c35737c26225 name="NVIDIA GeForce RTX 3090" before.total="23.7 GiB" before.free="23.4 GiB" now.total="23.7 GiB" now.free="8.5 GiB" now.used="15.2 GiB" time=2024-07-15T03:40:09.493Z level=DEBUG source=gpu.go:374 msg="updating cuda memory data" gpu=GPU-0a469c40-39b4-37e0-3229-5ff659d33432 name="NVIDIA GeForce RTX 3090" before.total="23.7 GiB" before.free="23.4 GiB" now.total="23.7 GiB" now.free="23.4 GiB" now.used="260.9 MiB" time=2024-07-15T03:40:09.516Z level=DEBUG source=sched.go:429 msg="gpu reported" gpu=GPU-875fb951-07e8-0173-63ca-3926ddbd69de library=cuda available="23.4 GiB" time=2024-07-15T03:40:09.516Z level=INFO source=sched.go:440 msg="updated VRAM based on existing loaded models" gpu=GPU-875fb951-07e8-0173-63ca-3926ddbd69de library=cuda total="23.7 GiB" available="23.4 GiB" time=2024-07-15T03:40:09.516Z level=DEBUG source=sched.go:429 msg="gpu reported" gpu=GPU-831da45a-c458-4027-02e2-c35737c26225 library=cuda available="8.5 GiB" time=2024-07-15T03:40:09.516Z level=INFO source=sched.go:440 msg="updated VRAM based on existing loaded models" gpu=GPU-831da45a-c458-4027-02e2-c35737c26225 library=cuda total="23.7 GiB" available="8.5 GiB" time=2024-07-15T03:40:09.516Z level=DEBUG source=sched.go:429 msg="gpu reported" gpu=GPU-0a469c40-39b4-37e0-3229-5ff659d33432 library=cuda available="23.4 GiB" time=2024-07-15T03:40:09.516Z level=INFO source=sched.go:440 msg="updated VRAM based on existing loaded models" gpu=GPU-0a469c40-39b4-37e0-3229-5ff659d33432 library=cuda total="23.7 GiB" available="23.4 GiB" time=2024-07-15T03:40:09.516Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[23.4 GiB]" time=2024-07-15T03:40:09.517Z level=DEBUG source=sched.go:628 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-f2296999531d6120801529a45b1d103f7370c5970be939ebfc2ba5d0833e9e1e gpu=GPU-0a469c40-39b4-37e0-3229-5ff659d33432 available=25157238784 required="18.0 GiB" time=2024-07-15T03:40:09.517Z level=DEBUG source=sched.go:191 msg="new model fits with existing models, loading" time=2024-07-15T03:40:09.517Z level=DEBUG source=server.go:98 msg="system memory" total="251.6 GiB" free=257536163840 time=2024-07-15T03:40:09.517Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[23.4 GiB]" time=2024-07-15T03:40:09.517Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[23.4 GiB]" memory.required.full="18.0 GiB" memory.required.partial="18.0 GiB" memory.required.kv="2.0 GiB" memory.required.allocations="[18.0 GiB]" memory.weights.total="15.0 GiB" memory.weights.repeating="14.0 GiB" memory.weights.nonrepeating="1002.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB" time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/cpu/ollama_llama_server time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/cpu_avx/ollama_llama_server time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/cpu_avx2/ollama_llama_server time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/cuda_v11/ollama_llama_server time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/rocm_v60101/ollama_llama_server time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/cpu/ollama_llama_server time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/cpu_avx/ollama_llama_server time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/cpu_avx2/ollama_llama_server time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/cuda_v11/ollama_llama_server time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/rocm_v60101/ollama_llama_server time=2024-07-15T03:40:09.518Z level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama4250454338/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-f2296999531d6120801529a45b1d103f7370c5970be939ebfc2ba5d0833e9e1e --ctx-size 16384 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --verbose --parallel 2 --port 45355" time=2024-07-15T03:40:09.518Z level=DEBUG source=server.go:383 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/tmp/ollama4250454338/runners/cuda_v11:/tmp/ollama4250454338/runners:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 CUDA_VISIBLE_DEVICES=GPU-0a469c40-39b4-37e0-3229-5ff659d33432]" time=2024-07-15T03:40:09.518Z level=INFO source=sched.go:382 msg="loaded runners" count=2 time=2024-07-15T03:40:09.518Z level=INFO source=server.go:556 msg="waiting for llama runner to start responding" time=2024-07-15T03:40:09.519Z level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server error" time=2024-07-15T03:40:09.770Z level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model" time=2024-07-15T03:40:11.227Z level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server not responding" time=2024-07-15T03:40:11.478Z level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model" time=2024-07-15T03:40:11.478Z level=DEBUG source=server.go:605 msg="model load progress 0.22" time=2024-07-15T03:40:11.730Z level=DEBUG source=server.go:605 msg="model load progress 0.37" time=2024-07-15T03:40:11.981Z level=DEBUG source=server.go:605 msg="model load progress 0.53" time=2024-07-15T03:40:12.233Z level=DEBUG source=server.go:605 msg="model load progress 0.69" time=2024-07-15T03:40:12.484Z level=DEBUG source=server.go:605 msg="model load progress 0.84" time=2024-07-15T03:40:12.935Z level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server not responding" time=2024-07-15T03:40:13.186Z level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model" time=2024-07-15T03:40:13.186Z level=DEBUG source=server.go:605 msg="model load progress 1.00" time=2024-07-15T03:40:13.437Z level=DEBUG source=server.go:608 msg="model load completed, waiting for server to become available" status="llm server loading model" time=2024-07-15T03:40:13.940Z level=INFO source=server.go:599 msg="llama runner started in 4.42 seconds" time=2024-07-15T03:40:13.940Z level=DEBUG source=sched.go:395 msg="finished setting up runner" model=/root/.ollama/models/blobs/sha256-f2296999531d6120801529a45b1d103f7370c5970be939ebfc2ba5d0833e9e1e ```
Author
Owner

@infrabrew commented on GitHub (Jan 12, 2025):

You could try ollama run ollama-model-name-here < /dev/null

<!-- gh-comment-id:2585860938 --> @infrabrew commented on GitHub (Jan 12, 2025): You could try ollama run ollama-model-name-here < /dev/null
Author
Owner

@nour-s commented on GitHub (Oct 19, 2025):

@M0wLaue
Does this way of mounting the volume considered the same as doing local folder ./ollam?

  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    networks:
      - net
    restart: unless-stopped
    ports:
      - 11434:11434
    volumes:
      - ollama_storage:/root/.ollama # ===> is it the same as ./ollama since I'm defining the volumes below?
    mem_limit: 15g
    environment:
      - OLLAMA_DEBUG=true
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]


volumes:
  ollama_storage:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: ~/docker_services/data/ollama_storage
<!-- gh-comment-id:3419442340 --> @nour-s commented on GitHub (Oct 19, 2025): @M0wLaue Does this way of mounting the volume considered the same as doing local folder ./ollam? ``` ollama: image: ollama/ollama:latest container_name: ollama networks: - net restart: unless-stopped ports: - 11434:11434 volumes: - ollama_storage:/root/.ollama # ===> is it the same as ./ollama since I'm defining the volumes below? mem_limit: 15g environment: - OLLAMA_DEBUG=true deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] volumes: ollama_storage: driver: local driver_opts: type: none o: bind device: ~/docker_services/data/ollama_storage ```
Author
Owner

@Darshan2104 commented on GitHub (Oct 20, 2025):

one solution, directly use llama.cpp to run any model locally. It's way more faster and anyways ollama is just a wrapper around it!

<!-- gh-comment-id:3421581303 --> @Darshan2104 commented on GitHub (Oct 20, 2025): one solution, directly use llama.cpp to run any model locally. It's way more faster and anyways ollama is just a wrapper around it!
Author
Owner

@pdevine commented on GitHub (Oct 20, 2025):

@Darshan2104 that hasn't been true for a while. It's unclear to me from reading the comments if there is a different issue here?

<!-- gh-comment-id:3422851033 --> @pdevine commented on GitHub (Oct 20, 2025): @Darshan2104 that hasn't been true for a while. It's unclear to me from reading the comments if there is a different issue here?
Author
Owner

@Pranaviee commented on GitHub (Dec 26, 2025):

Has anyone found a solution for this?
Same happening for me when trying to load llama3 using ollama on gpu for the first query.
How to preload the llama model quickly?

<!-- gh-comment-id:3692043866 --> @Pranaviee commented on GitHub (Dec 26, 2025): Has anyone found a solution for this? Same happening for me when trying to load llama3 using ollama on gpu for the first query. How to preload the llama model quickly?
Author
Owner

@thomas-meier85 commented on GitHub (Jan 4, 2026):

Same here,
I'm on a RTX6000 max-q and even small models take up to 60s for the initial load.
time=2026-01-04T20:30:12.894Z level=INFO source=server.go:1376 msg="llama runner started in 49.88 seconds"

However the actual request is pretty fast as expected.
Anybody else experiencing such long load up times?

Interesting fact: 2 others servers using A40 GPU's start rapidly fast - same Ollama version.

<!-- gh-comment-id:3708407302 --> @thomas-meier85 commented on GitHub (Jan 4, 2026): Same here, I'm on a RTX6000 max-q and even small models take up to 60s for the initial load. time=2026-01-04T20:30:12.894Z level=INFO source=server.go:1376 msg="llama runner started in 49.88 seconds" However the actual request is pretty fast as expected. Anybody else experiencing such long load up times? Interesting fact: 2 others servers using A40 GPU's start rapidly fast - same Ollama version.
Author
Owner

@Ryderjj89 commented on GitHub (Feb 18, 2026):

Noticing this issue here too. With qwen3:1.7b on a Quadro RTX 4000, it took almost 48 seconds to load. That is really not good.

<!-- gh-comment-id:3917705011 --> @Ryderjj89 commented on GitHub (Feb 18, 2026): Noticing this issue here too. With qwen3:1.7b on a Quadro RTX 4000, it took almost 48 seconds to load. That is really not good.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#63052