[GH-ISSUE #1769] Long initial loading time. #63052

New Issue

GiteaMirror · 2026-05-03T11:35:54-05:00

GiteaMirror commented

2026-05-03 11:35:54 -05:00

Originally created by @themw123 on GitHub (Jan 3, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/1769

It takes a few minutes and sometimes it never starts the model when trying to run it. The problem is with all models i am using, also with small ones like tinyllama. After model has loaded after a few minutes everything works fine and i am getting fast chat responses.

I am using Windows with WSL2 and Docker Desktop. Ollama is installed in wsl2 and the models are also placed there by mounting(bin mount) them with docker volumes in the wsl2 file system.

Originally created by @themw123 on GitHub (Jan 3, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/1769 It takes a few minutes and sometimes it never starts the model when trying to run it. The problem is with all models i am using, also with small ones like tinyllama. After model has loaded after a few minutes everything works fine and i am getting fast chat responses. I am using Windows with WSL2 and Docker Desktop. Ollama is installed in wsl2 and the models are also placed there by mounting(bin mount) them with docker volumes in the wsl2 file system.

GiteaMirror closed this issue

2026-05-03 11:35:54 -05:00

GiteaMirror commented

2026-05-03 11:35:55 -05:00

@pdevine commented on GitHub (Jan 4, 2024):

The long load time is because the model is being loaded into memory when you start the REPL. I'm guessing the problem is related to Docker Desktop's IO speed. You can confirm this by trying to copy a large file (even tinyllama is > 600MB) inside of the docker volume.

@pdevine commented on GitHub (Jan 4, 2024): The long load time is because the model is being loaded into memory when you start the REPL. I'm guessing the problem is related to Docker Desktop's IO speed. You can confirm this by trying to copy a large file (even tinyllama is > 600MB) inside of the docker volume.

GiteaMirror commented

2026-05-03 11:35:56 -05:00

@zioalex commented on GitHub (Mar 25, 2024):

I am in the same situation. Did you find anyway to improve Docker IO performance in this case?

@zioalex commented on GitHub (Mar 25, 2024): I am in the same situation. Did you find anyway to improve Docker IO performance in this case?

GiteaMirror commented

2026-05-03 11:35:56 -05:00

@themw123 commented on GitHub (Mar 25, 2024):

No the slow loading is due to wsl. Now i am using ollama on native windows. The loading time has improved, but it is still not very fast. I would say it takes half of the time as before with wsl.

@themw123 commented on GitHub (Mar 25, 2024): No the slow loading is due to wsl. Now i am using ollama on native windows. The loading time has improved, but it is still not very fast. I would say it takes half of the time as before with wsl.

GiteaMirror commented

2026-05-03 11:35:57 -05:00

@zioalex commented on GitHub (Mar 26, 2024):

I see, unfortunately I cannot Install it natively on Windows. Still searching for a way to optimise wsl+docker+ollama

@zioalex commented on GitHub (Mar 26, 2024): I see, unfortunately I cannot Install it natively on Windows. Still searching for a way to optimise wsl+docker+ollama

GiteaMirror commented

2026-05-03 11:35:57 -05:00

@M0wLaue commented on GitHub (May 7, 2024):

try this compose.yml with docker compose up -d:

services:
  ollama:
    container_name: ollama
    image: ollama/ollama:latest
    volumes:
#      - ./ollama:/root/.ollama # this solution synchronizes with the real harddrive and is slow af
      - ollama:/root/.ollama # this solution synchronizes with the docker volume and loads the model rocket fast
    ports:
      - 11434:11434
    networks:
      - llm-network
    environment:
      - gpus=all
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

networks:
  llm-network:
    driver: bridge

volumes:
    ollama:

@M0wLaue commented on GitHub (May 7, 2024): try this compose.yml with `docker compose up -d`: ``` services: ollama: container_name: ollama image: ollama/ollama:latest volumes: # - ./ollama:/root/.ollama # this solution synchronizes with the real harddrive and is slow af - ollama:/root/.ollama # this solution synchronizes with the docker volume and loads the model rocket fast ports: - 11434:11434 networks: - llm-network environment: - gpus=all restart: unless-stopped deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] networks: llm-network: driver: bridge volumes: ollama: ```

GiteaMirror commented

2026-05-03 11:35:58 -05:00

@Chukarslan commented on GitHub (May 23, 2024):

Did someone find a solution for this? Running ollama on AWS EC2 (Tried a range of g5 and g4 instances and it seems its always around 17 seconds for Mistral 3.7gb which is extremely sloooooow. Screenshot from G5.12XL instance with 4 GPUs

@Chukarslan commented on GitHub (May 23, 2024): Did someone find a solution for this? Running ollama on AWS EC2 (Tried a range of g5 and g4 instances and it seems its always around 17 seconds for Mistral 3.7gb which is extremely sloooooow. Screenshot from G5.12XL instance with 4 GPUs <img width="241" alt="g5x12" src="https://github.com/ollama/ollama/assets/108550191/fa677004-51ac-4b8d-bd25-db47452cb491">

GiteaMirror commented

2026-05-03 11:35:59 -05:00

@Darshan2104 commented on GitHub (Jul 11, 2024):

I am using codellama model in my local machine and for the very first query only it is taking longer than expected time!!
What could be a reason ? will it work fine after few initial queries ?

@Darshan2104 commented on GitHub (Jul 11, 2024): I am using codellama model in my local machine and for the very first query only it is taking longer than expected time!! What could be a reason ? will it work fine after few initial queries ?

GiteaMirror commented

2026-05-03 11:35:59 -05:00

@LuisMalhadas commented on GitHub (Jul 15, 2024):

Here is another report:
Trying to load Llama3:70b into 3 rtx3090 takes me around half hour to an hour.
I am currently using it through docker compose:

docker run -d --gpus '"device=0,2,3"' -v ollama:/root/.ollama -v .../ollama:/model -p 11434:11434 -e OLLAMA_HOST=0.0.0.0 -e OLLAMA_ORIGINS=*  -e OLLAMA_MAX_LOADED_MODELS=2 -e OLLAMA_NUM_PARALLEL=2 -e OLLAMA_DEBUG=1 OLLAMA_DEBUG=1 -e CUDA_ERROR_LEVEL=50 --name ollama2 ollama/ollama

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

Mon Jul 15 08:01:44 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090 Ti     Off |   00000000:04:00.0 Off |                  Off |
| 35%   61C    P2            215W /  450W |   17597MiB /  24564MiB |     27%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        Off |   00000000:43:00.0 Off |                  N/A |
| 55%   66C    P2            213W /  350W |   17293MiB /  24576MiB |     31%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090        Off |   00000000:88:00.0 Off |                  N/A |
| 78%   58C    P2            241W /  420W |   17293MiB /  24576MiB |     37%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 3090        Off |   00000000:C4:00.0 Off |                  N/A |
|  0%   36C    P8             21W /  350W |   10499MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2236599      C   ...unners/cuda_v11/ollama_llama_server      17590MiB |
|    1   N/A  N/A   2236599      C   ...unners/cuda_v11/ollama_llama_server      17286MiB |
|    2   N/A  N/A   2236599      C   ...unners/cuda_v11/ollama_llama_server      17286MiB |
|    3   N/A  N/A     19625      C   /app/.venv/bin/python                       10492MiB |
+-----------------------------------------------------------------------------------------+

sudo dmesg | grep -i nvrm
[sudo] password for djfil: 
[    5.645486] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  550.90.07  Fri May 31 09:35:42 UTC 2024
[20922.713866] NVRM: GPU at PCI:0000:88:00: GPU-0a469c40-39b4-37e0-3229-5ff659d33432
[20922.713885] NVRM: Xid (PCI:0000:88:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[20922.713893] NVRM: GPU 0000:88:00.0: GPU has fallen off the bus.
[20922.713907] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.
djfil@antonio:~$ sudo dmesg | grep -i nvidia
[    5.401223] nvidia: loading out-of-tree module taints kernel.
[    5.401286] nvidia: module license 'NVIDIA' taints kernel.
[    5.432111] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    5.453959] nvidia-nvlink: Nvlink Core is being initialized, major device number 509
[    5.457233] nvidia 0000:04:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    5.508309] nvidia 0000:43:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    5.551675] nvidia 0000:88:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    5.599148] nvidia 0000:c4:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    5.645486] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  550.90.07  Fri May 31 09:35:42 UTC 2024
[    5.666118] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  550.90.07  Fri May 31 09:30:47 UTC 2024
[    5.670384] [drm] [nvidia-drm] [GPU ID 0x00000400] Loading driver
[    6.798471] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:04:00.0 on minor 1
[    6.799105] [drm] [nvidia-drm] [GPU ID 0x00004300] Loading driver
[    7.768095] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:43:00.0 on minor 2
[    7.774306] [drm] [nvidia-drm] [GPU ID 0x00008800] Loading driver
[   10.529242] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:88:00.0 on minor 3
[   10.534219] [drm] [nvidia-drm] [GPU ID 0x0000c400] Loading driver
[   11.551526] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:c4:00.0 on minor 4
[   25.182630] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[   25.188515] nvidia-uvm: Loaded the UVM driver, major device number 506.
[   29.637075] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:80/0000:80:03.1/0000:86:00.0/0000:87:00.0/0000:88:00.1/sound/card2/input8
[   29.637208] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:80/0000:80:03.1/0000:86:00.0/0000:87:00.0/0000:88:00.1/sound/card2/input9
[   29.637326] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:80/0000:80:03.1/0000:86:00.0/0000:87:00.0/0000:88:00.1/sound/card2/input10
[   29.637466] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:80/0000:80:03.1/0000:86:00.0/0000:87:00.0/0000:88:00.1/sound/card2/input11
[   29.637565] input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:80/0000:80:03.1/0000:86:00.0/0000:87:00.0/0000:88:00.1/sound/card2/input12
[   29.637666] input: HDA NVidia HDMI/DP,pcm=11 as /devices/pci0000:80/0000:80:03.1/0000:86:00.0/0000:87:00.0/0000:88:00.1/sound/card2/input13
[   29.637765] input: HDA NVidia HDMI/DP,pcm=12 as /devices/pci0000:80/0000:80:03.1/0000:86:00.0/0000:87:00.0/0000:88:00.1/sound/card2/input14
[   29.637928] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:c0/0000:c0:01.1/0000:c1:00.0/0000:c2:01.0/0000:c4:00.1/sound/card3/input15
[   29.640361] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.1/sound/card1/input1
[   29.641554] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:c0/0000:c0:01.1/0000:c1:00.0/0000:c2:01.0/0000:c4:00.1/sound/card3/input16
[   29.649335] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:c0/0000:c0:01.1/0000:c1:00.0/0000:c2:01.0/0000:c4:00.1/sound/card3/input17
[   29.650121] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.1/sound/card1/input2
[   29.654941] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:c0/0000:c0:01.1/0000:c1:00.0/0000:c2:01.0/0000:c4:00.1/sound/card3/input18
[   29.660677] input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:c0/0000:c0:01.1/0000:c1:00.0/0000:c2:01.0/0000:c4:00.1/sound/card3/input19
[   29.667755] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.1/sound/card1/input3
[   29.676380] input: HDA NVidia HDMI/DP,pcm=11 as /devices/pci0000:c0/0000:c0:01.1/0000:c1:00.0/0000:c2:01.0/0000:c4:00.1/sound/card3/input20
[   29.692111] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.1/sound/card1/input4
[   29.698743] input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.1/sound/card1/input5
[   29.699110] input: HDA NVidia HDMI/DP,pcm=12 as /devices/pci0000:c0/0000:c0:01.1/0000:c1:00.0/0000:c2:01.0/0000:c4:00.1/sound/card3/input21
[   29.707294] input: HDA NVidia HDMI/DP,pcm=11 as /devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.1/sound/card1/input6
[   29.715194] input: HDA NVidia HDMI/DP,pcm=12 as /devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.1/sound/card1/input7
[   32.513871] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.1/0000:01:00.0/0000:02:01.0/0000:04:00.1/sound/card0/input22
[   32.514018] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.1/0000:01:00.0/0000:02:01.0/0000:04:00.1/sound/card0/input23
[   32.514146] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.1/0000:01:00.0/0000:02:01.0/0000:04:00.1/sound/card0/input24
[   32.514302] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.1/0000:01:00.0/0000:02:01.0/0000:04:00.1/sound/card0/input25
[   32.514441] input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:00/0000:00:03.1/0000:01:00.0/0000:02:01.0/0000:04:00.1/sound/card0/input26
[   32.514575] input: HDA NVidia HDMI/DP,pcm=11 as /devices/pci0000:00/0000:00:03.1/0000:01:00.0/0000:02:01.0/0000:04:00.1/sound/card0/input27
[   32.514720] input: HDA NVidia HDMI/DP,pcm=12 as /devices/pci0000:00/0000:00:03.1/0000:01:00.0/0000:02:01.0/0000:04:00.1/sound/card0/input28
[   34.566728] audit: type=1400 audit(1720623481.265:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1246 comm="apparmor_parser"
[   34.566737] audit: type=1400 audit(1720623481.265:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1246 comm="apparmor_parser"
[  181.782373] audit: type=1400 audit(1720623628.941:113): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="nvidia_modprobe" pid=1858 comm="apparmor_parser"
[  181.782381] audit: type=1400 audit(1720623628.941:114): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="nvidia_modprobe//kmod" pid=1858 comm="apparmor_parser"
[  210.793993] audit: type=1400 audit(1720623657.953:137): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="nvidia_modprobe" pid=2440 comm="apparmor_parser"
[  210.793998] audit: type=1400 audit(1720623657.953:138): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="nvidia_modprobe//kmod" pid=2440 comm="apparmor_parser"
[  340.825361] audit: type=1400 audit(1720623787.980:162): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="nvidia_modprobe" pid=4103 comm="apparmor_parser"
[  340.825368] audit: type=1400 audit(1720623787.980:163): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="nvidia_modprobe//kmod" pid=4103 comm="apparmor_parser"
[  631.337720] audit: type=1400 audit(1720624078.493:186): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="nvidia_modprobe" pid=18203 comm="apparmor_parser"
[  631.337728] audit: type=1400 audit(1720624078.493:187): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="nvidia_modprobe//kmod" pid=18203 comm="apparmor_parser"
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.

In the end it loads, but I get lots of:

time=2024-07-15T03:40:01.397Z level=DEBUG source=sched.go:348 msg="context for request finished"
time=2024-07-15T03:40:01.398Z level=DEBUG source=sched.go:281 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-87d5b13e5157d3a67f8e10a46d8a846ec2b68c1f731e3dfe1546a585432b8fa0 duration=5m0s
time=2024-07-15T03:40:01.398Z level=DEBUG source=sched.go:299 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-87d5b13e5157d3a67f8e10a46d8a846ec2b68c1f731e3dfe1546a585432b8fa0 refCount=0
time=2024-07-15T03:40:09.124Z level=DEBUG source=gpu.go:333 msg="updating system memory data" before.total="251.6 GiB" before.free="240.9 GiB" now.total="251.6 GiB" now.free="239.9 GiB"
time=2024-07-15T03:40:09.273Z level=DEBUG source=gpu.go:374 msg="updating cuda memory data" gpu=GPU-875fb951-07e8-0173-63ca-3926ddbd69de name="NVIDIA GeForce RTX 3090 Ti" before.total="23.7 GiB" before.free="23.4 GiB" now.total="23.7 GiB" now.free="23.4 GiB" now.used="266.9 MiB"
time=2024-07-15T03:40:09.372Z level=DEBUG source=gpu.go:374 msg="updating cuda memory data" gpu=GPU-831da45a-c458-4027-02e2-c35737c26225 name="NVIDIA GeForce RTX 3090" before.total="23.7 GiB" before.free="23.4 GiB" now.total="23.7 GiB" now.free="8.5 GiB" now.used="15.2 GiB"
time=2024-07-15T03:40:09.493Z level=DEBUG source=gpu.go:374 msg="updating cuda memory data" gpu=GPU-0a469c40-39b4-37e0-3229-5ff659d33432 name="NVIDIA GeForce RTX 3090" before.total="23.7 GiB" before.free="23.4 GiB" now.total="23.7 GiB" now.free="23.4 GiB" now.used="260.9 MiB"
time=2024-07-15T03:40:09.516Z level=DEBUG source=sched.go:429 msg="gpu reported" gpu=GPU-875fb951-07e8-0173-63ca-3926ddbd69de library=cuda available="23.4 GiB"
time=2024-07-15T03:40:09.516Z level=INFO source=sched.go:440 msg="updated VRAM based on existing loaded models" gpu=GPU-875fb951-07e8-0173-63ca-3926ddbd69de library=cuda total="23.7 GiB" available="23.4 GiB"
time=2024-07-15T03:40:09.516Z level=DEBUG source=sched.go:429 msg="gpu reported" gpu=GPU-831da45a-c458-4027-02e2-c35737c26225 library=cuda available="8.5 GiB"
time=2024-07-15T03:40:09.516Z level=INFO source=sched.go:440 msg="updated VRAM based on existing loaded models" gpu=GPU-831da45a-c458-4027-02e2-c35737c26225 library=cuda total="23.7 GiB" available="8.5 GiB"
time=2024-07-15T03:40:09.516Z level=DEBUG source=sched.go:429 msg="gpu reported" gpu=GPU-0a469c40-39b4-37e0-3229-5ff659d33432 library=cuda available="23.4 GiB"
time=2024-07-15T03:40:09.516Z level=INFO source=sched.go:440 msg="updated VRAM based on existing loaded models" gpu=GPU-0a469c40-39b4-37e0-3229-5ff659d33432 library=cuda total="23.7 GiB" available="23.4 GiB"
time=2024-07-15T03:40:09.516Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[23.4 GiB]"
time=2024-07-15T03:40:09.517Z level=DEBUG source=sched.go:628 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-f2296999531d6120801529a45b1d103f7370c5970be939ebfc2ba5d0833e9e1e gpu=GPU-0a469c40-39b4-37e0-3229-5ff659d33432 available=25157238784 required="18.0 GiB"
time=2024-07-15T03:40:09.517Z level=DEBUG source=sched.go:191 msg="new model fits with existing models, loading"
time=2024-07-15T03:40:09.517Z level=DEBUG source=server.go:98 msg="system memory" total="251.6 GiB" free=257536163840
time=2024-07-15T03:40:09.517Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[23.4 GiB]"
time=2024-07-15T03:40:09.517Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[23.4 GiB]" memory.required.full="18.0 GiB" memory.required.partial="18.0 GiB" memory.required.kv="2.0 GiB" memory.required.allocations="[18.0 GiB]" memory.weights.total="15.0 GiB" memory.weights.repeating="14.0 GiB" memory.weights.nonrepeating="1002.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB"
time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/cpu/ollama_llama_server
time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/cpu_avx/ollama_llama_server
time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/cpu_avx2/ollama_llama_server
time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/cuda_v11/ollama_llama_server
time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/rocm_v60101/ollama_llama_server
time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/cpu/ollama_llama_server
time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/cpu_avx/ollama_llama_server
time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/cpu_avx2/ollama_llama_server
time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/cuda_v11/ollama_llama_server
time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/rocm_v60101/ollama_llama_server
time=2024-07-15T03:40:09.518Z level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama4250454338/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-f2296999531d6120801529a45b1d103f7370c5970be939ebfc2ba5d0833e9e1e --ctx-size 16384 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --verbose --parallel 2 --port 45355"
time=2024-07-15T03:40:09.518Z level=DEBUG source=server.go:383 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/tmp/ollama4250454338/runners/cuda_v11:/tmp/ollama4250454338/runners:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 CUDA_VISIBLE_DEVICES=GPU-0a469c40-39b4-37e0-3229-5ff659d33432]"
time=2024-07-15T03:40:09.518Z level=INFO source=sched.go:382 msg="loaded runners" count=2
time=2024-07-15T03:40:09.518Z level=INFO source=server.go:556 msg="waiting for llama runner to start responding"
time=2024-07-15T03:40:09.519Z level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server error"
time=2024-07-15T03:40:09.770Z level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model"
time=2024-07-15T03:40:11.227Z level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server not responding"
time=2024-07-15T03:40:11.478Z level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model"
time=2024-07-15T03:40:11.478Z level=DEBUG source=server.go:605 msg="model load progress 0.22"
time=2024-07-15T03:40:11.730Z level=DEBUG source=server.go:605 msg="model load progress 0.37"
time=2024-07-15T03:40:11.981Z level=DEBUG source=server.go:605 msg="model load progress 0.53"
time=2024-07-15T03:40:12.233Z level=DEBUG source=server.go:605 msg="model load progress 0.69"
time=2024-07-15T03:40:12.484Z level=DEBUG source=server.go:605 msg="model load progress 0.84"
time=2024-07-15T03:40:12.935Z level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server not responding"
time=2024-07-15T03:40:13.186Z level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model"
time=2024-07-15T03:40:13.186Z level=DEBUG source=server.go:605 msg="model load progress 1.00"
time=2024-07-15T03:40:13.437Z level=DEBUG source=server.go:608 msg="model load completed, waiting for server to become available" status="llm server loading model"
time=2024-07-15T03:40:13.940Z level=INFO source=server.go:599 msg="llama runner started in 4.42 seconds"
time=2024-07-15T03:40:13.940Z level=DEBUG source=sched.go:395 msg="finished setting up runner" model=/root/.ollama/models/blobs/sha256-f2296999531d6120801529a45b1d103f7370c5970be939ebfc2ba5d0833e9e1e

@LuisMalhadas commented on GitHub (Jul 15, 2024): Here is another report: Trying to load Llama3:70b into 3 rtx3090 takes me around half hour to an hour. I am currently using it through docker compose: ``` docker run -d --gpus '"device=0,2,3"' -v ollama:/root/.ollama -v .../ollama:/model -p 11434:11434 -e OLLAMA_HOST=0.0.0.0 -e OLLAMA_ORIGINS=* -e OLLAMA_MAX_LOADED_MODELS=2 -e OLLAMA_NUM_PARALLEL=2 -e OLLAMA_DEBUG=1 OLLAMA_DEBUG=1 -e CUDA_ERROR_LEVEL=50 --name ollama2 ollama/ollama ``` ``` $ cat /etc/os-release PRETTY_NAME="Ubuntu 22.04.4 LTS" NAME="Ubuntu" VERSION_ID="22.04" VERSION="22.04.4 LTS (Jammy Jellyfish)" VERSION_CODENAME=jammy ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=jammy ``` ``` Mon Jul 15 08:01:44 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3090 Ti Off | 00000000:04:00.0 Off | Off | | 35% 61C P2 215W / 450W | 17597MiB / 24564MiB | 27% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 3090 Off | 00000000:43:00.0 Off | N/A | | 55% 66C P2 213W / 350W | 17293MiB / 24576MiB | 31% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA GeForce RTX 3090 Off | 00000000:88:00.0 Off | N/A | | 78% 58C P2 241W / 420W | 17293MiB / 24576MiB | 37% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA GeForce RTX 3090 Off | 00000000:C4:00.0 Off | N/A | | 0% 36C P8 21W / 350W | 10499MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 2236599 C ...unners/cuda_v11/ollama_llama_server 17590MiB | | 1 N/A N/A 2236599 C ...unners/cuda_v11/ollama_llama_server 17286MiB | | 2 N/A N/A 2236599 C ...unners/cuda_v11/ollama_llama_server 17286MiB | | 3 N/A N/A 19625 C /app/.venv/bin/python 10492MiB | +-----------------------------------------------------------------------------------------+ ``` ``` sudo dmesg | grep -i nvrm [sudo] password for djfil: [ 5.645486] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 550.90.07 Fri May 31 09:35:42 UTC 2024 [20922.713866] NVRM: GPU at PCI:0000:88:00: GPU-0a469c40-39b4-37e0-3229-5ff659d33432 [20922.713885] NVRM: Xid (PCI:0000:88:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus. [20922.713893] NVRM: GPU 0000:88:00.0: GPU has fallen off the bus. [20922.713907] NVRM: A GPU crash dump has been created. If possible, please run NVRM: nvidia-bug-report.sh as root to collect this data before NVRM: the NVIDIA kernel module is unloaded. djfil@antonio:~$ sudo dmesg | grep -i nvidia [ 5.401223] nvidia: loading out-of-tree module taints kernel. [ 5.401286] nvidia: module license 'NVIDIA' taints kernel. [ 5.432111] nvidia: module verification failed: signature and/or required key missing - tainting kernel [ 5.453959] nvidia-nvlink: Nvlink Core is being initialized, major device number 509 [ 5.457233] nvidia 0000:04:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none [ 5.508309] nvidia 0000:43:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none [ 5.551675] nvidia 0000:88:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none [ 5.599148] nvidia 0000:c4:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none [ 5.645486] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 550.90.07 Fri May 31 09:35:42 UTC 2024 [ 5.666118] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 550.90.07 Fri May 31 09:30:47 UTC 2024 [ 5.670384] [drm] [nvidia-drm] [GPU ID 0x00000400] Loading driver [ 6.798471] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:04:00.0 on minor 1 [ 6.799105] [drm] [nvidia-drm] [GPU ID 0x00004300] Loading driver [ 7.768095] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:43:00.0 on minor 2 [ 7.774306] [drm] [nvidia-drm] [GPU ID 0x00008800] Loading driver [ 10.529242] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:88:00.0 on minor 3 [ 10.534219] [drm] [nvidia-drm] [GPU ID 0x0000c400] Loading driver [ 11.551526] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:c4:00.0 on minor 4 [ 25.182630] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint. [ 25.188515] nvidia-uvm: Loaded the UVM driver, major device number 506. [ 29.637075] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:80/0000:80:03.1/0000:86:00.0/0000:87:00.0/0000:88:00.1/sound/card2/input8 [ 29.637208] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:80/0000:80:03.1/0000:86:00.0/0000:87:00.0/0000:88:00.1/sound/card2/input9 [ 29.637326] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:80/0000:80:03.1/0000:86:00.0/0000:87:00.0/0000:88:00.1/sound/card2/input10 [ 29.637466] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:80/0000:80:03.1/0000:86:00.0/0000:87:00.0/0000:88:00.1/sound/card2/input11 [ 29.637565] input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:80/0000:80:03.1/0000:86:00.0/0000:87:00.0/0000:88:00.1/sound/card2/input12 [ 29.637666] input: HDA NVidia HDMI/DP,pcm=11 as /devices/pci0000:80/0000:80:03.1/0000:86:00.0/0000:87:00.0/0000:88:00.1/sound/card2/input13 [ 29.637765] input: HDA NVidia HDMI/DP,pcm=12 as /devices/pci0000:80/0000:80:03.1/0000:86:00.0/0000:87:00.0/0000:88:00.1/sound/card2/input14 [ 29.637928] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:c0/0000:c0:01.1/0000:c1:00.0/0000:c2:01.0/0000:c4:00.1/sound/card3/input15 [ 29.640361] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.1/sound/card1/input1 [ 29.641554] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:c0/0000:c0:01.1/0000:c1:00.0/0000:c2:01.0/0000:c4:00.1/sound/card3/input16 [ 29.649335] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:c0/0000:c0:01.1/0000:c1:00.0/0000:c2:01.0/0000:c4:00.1/sound/card3/input17 [ 29.650121] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.1/sound/card1/input2 [ 29.654941] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:c0/0000:c0:01.1/0000:c1:00.0/0000:c2:01.0/0000:c4:00.1/sound/card3/input18 [ 29.660677] input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:c0/0000:c0:01.1/0000:c1:00.0/0000:c2:01.0/0000:c4:00.1/sound/card3/input19 [ 29.667755] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.1/sound/card1/input3 [ 29.676380] input: HDA NVidia HDMI/DP,pcm=11 as /devices/pci0000:c0/0000:c0:01.1/0000:c1:00.0/0000:c2:01.0/0000:c4:00.1/sound/card3/input20 [ 29.692111] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.1/sound/card1/input4 [ 29.698743] input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.1/sound/card1/input5 [ 29.699110] input: HDA NVidia HDMI/DP,pcm=12 as /devices/pci0000:c0/0000:c0:01.1/0000:c1:00.0/0000:c2:01.0/0000:c4:00.1/sound/card3/input21 [ 29.707294] input: HDA NVidia HDMI/DP,pcm=11 as /devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.1/sound/card1/input6 [ 29.715194] input: HDA NVidia HDMI/DP,pcm=12 as /devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.1/sound/card1/input7 [ 32.513871] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.1/0000:01:00.0/0000:02:01.0/0000:04:00.1/sound/card0/input22 [ 32.514018] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.1/0000:01:00.0/0000:02:01.0/0000:04:00.1/sound/card0/input23 [ 32.514146] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.1/0000:01:00.0/0000:02:01.0/0000:04:00.1/sound/card0/input24 [ 32.514302] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.1/0000:01:00.0/0000:02:01.0/0000:04:00.1/sound/card0/input25 [ 32.514441] input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:00/0000:00:03.1/0000:01:00.0/0000:02:01.0/0000:04:00.1/sound/card0/input26 [ 32.514575] input: HDA NVidia HDMI/DP,pcm=11 as /devices/pci0000:00/0000:00:03.1/0000:01:00.0/0000:02:01.0/0000:04:00.1/sound/card0/input27 [ 32.514720] input: HDA NVidia HDMI/DP,pcm=12 as /devices/pci0000:00/0000:00:03.1/0000:01:00.0/0000:02:01.0/0000:04:00.1/sound/card0/input28 [ 34.566728] audit: type=1400 audit(1720623481.265:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1246 comm="apparmor_parser" [ 34.566737] audit: type=1400 audit(1720623481.265:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1246 comm="apparmor_parser" [ 181.782373] audit: type=1400 audit(1720623628.941:113): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="nvidia_modprobe" pid=1858 comm="apparmor_parser" [ 181.782381] audit: type=1400 audit(1720623628.941:114): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="nvidia_modprobe//kmod" pid=1858 comm="apparmor_parser" [ 210.793993] audit: type=1400 audit(1720623657.953:137): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="nvidia_modprobe" pid=2440 comm="apparmor_parser" [ 210.793998] audit: type=1400 audit(1720623657.953:138): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="nvidia_modprobe//kmod" pid=2440 comm="apparmor_parser" [ 340.825361] audit: type=1400 audit(1720623787.980:162): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="nvidia_modprobe" pid=4103 comm="apparmor_parser" [ 340.825368] audit: type=1400 audit(1720623787.980:163): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="nvidia_modprobe//kmod" pid=4103 comm="apparmor_parser" [ 631.337720] audit: type=1400 audit(1720624078.493:186): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="nvidia_modprobe" pid=18203 comm="apparmor_parser" [ 631.337728] audit: type=1400 audit(1720624078.493:187): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="nvidia_modprobe//kmod" pid=18203 comm="apparmor_parser" NVRM: nvidia-bug-report.sh as root to collect this data before NVRM: the NVIDIA kernel module is unloaded. ``` In the end it loads, but I get lots of: ``` time=2024-07-15T03:40:01.397Z level=DEBUG source=sched.go:348 msg="context for request finished" time=2024-07-15T03:40:01.398Z level=DEBUG source=sched.go:281 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-87d5b13e5157d3a67f8e10a46d8a846ec2b68c1f731e3dfe1546a585432b8fa0 duration=5m0s time=2024-07-15T03:40:01.398Z level=DEBUG source=sched.go:299 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-87d5b13e5157d3a67f8e10a46d8a846ec2b68c1f731e3dfe1546a585432b8fa0 refCount=0 time=2024-07-15T03:40:09.124Z level=DEBUG source=gpu.go:333 msg="updating system memory data" before.total="251.6 GiB" before.free="240.9 GiB" now.total="251.6 GiB" now.free="239.9 GiB" time=2024-07-15T03:40:09.273Z level=DEBUG source=gpu.go:374 msg="updating cuda memory data" gpu=GPU-875fb951-07e8-0173-63ca-3926ddbd69de name="NVIDIA GeForce RTX 3090 Ti" before.total="23.7 GiB" before.free="23.4 GiB" now.total="23.7 GiB" now.free="23.4 GiB" now.used="266.9 MiB" time=2024-07-15T03:40:09.372Z level=DEBUG source=gpu.go:374 msg="updating cuda memory data" gpu=GPU-831da45a-c458-4027-02e2-c35737c26225 name="NVIDIA GeForce RTX 3090" before.total="23.7 GiB" before.free="23.4 GiB" now.total="23.7 GiB" now.free="8.5 GiB" now.used="15.2 GiB" time=2024-07-15T03:40:09.493Z level=DEBUG source=gpu.go:374 msg="updating cuda memory data" gpu=GPU-0a469c40-39b4-37e0-3229-5ff659d33432 name="NVIDIA GeForce RTX 3090" before.total="23.7 GiB" before.free="23.4 GiB" now.total="23.7 GiB" now.free="23.4 GiB" now.used="260.9 MiB" time=2024-07-15T03:40:09.516Z level=DEBUG source=sched.go:429 msg="gpu reported" gpu=GPU-875fb951-07e8-0173-63ca-3926ddbd69de library=cuda available="23.4 GiB" time=2024-07-15T03:40:09.516Z level=INFO source=sched.go:440 msg="updated VRAM based on existing loaded models" gpu=GPU-875fb951-07e8-0173-63ca-3926ddbd69de library=cuda total="23.7 GiB" available="23.4 GiB" time=2024-07-15T03:40:09.516Z level=DEBUG source=sched.go:429 msg="gpu reported" gpu=GPU-831da45a-c458-4027-02e2-c35737c26225 library=cuda available="8.5 GiB" time=2024-07-15T03:40:09.516Z level=INFO source=sched.go:440 msg="updated VRAM based on existing loaded models" gpu=GPU-831da45a-c458-4027-02e2-c35737c26225 library=cuda total="23.7 GiB" available="8.5 GiB" time=2024-07-15T03:40:09.516Z level=DEBUG source=sched.go:429 msg="gpu reported" gpu=GPU-0a469c40-39b4-37e0-3229-5ff659d33432 library=cuda available="23.4 GiB" time=2024-07-15T03:40:09.516Z level=INFO source=sched.go:440 msg="updated VRAM based on existing loaded models" gpu=GPU-0a469c40-39b4-37e0-3229-5ff659d33432 library=cuda total="23.7 GiB" available="23.4 GiB" time=2024-07-15T03:40:09.516Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[23.4 GiB]" time=2024-07-15T03:40:09.517Z level=DEBUG source=sched.go:628 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-f2296999531d6120801529a45b1d103f7370c5970be939ebfc2ba5d0833e9e1e gpu=GPU-0a469c40-39b4-37e0-3229-5ff659d33432 available=25157238784 required="18.0 GiB" time=2024-07-15T03:40:09.517Z level=DEBUG source=sched.go:191 msg="new model fits with existing models, loading" time=2024-07-15T03:40:09.517Z level=DEBUG source=server.go:98 msg="system memory" total="251.6 GiB" free=257536163840 time=2024-07-15T03:40:09.517Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[23.4 GiB]" time=2024-07-15T03:40:09.517Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[23.4 GiB]" memory.required.full="18.0 GiB" memory.required.partial="18.0 GiB" memory.required.kv="2.0 GiB" memory.required.allocations="[18.0 GiB]" memory.weights.total="15.0 GiB" memory.weights.repeating="14.0 GiB" memory.weights.nonrepeating="1002.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB" time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/cpu/ollama_llama_server time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/cpu_avx/ollama_llama_server time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/cpu_avx2/ollama_llama_server time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/cuda_v11/ollama_llama_server time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/rocm_v60101/ollama_llama_server time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/cpu/ollama_llama_server time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/cpu_avx/ollama_llama_server time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/cpu_avx2/ollama_llama_server time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/cuda_v11/ollama_llama_server time=2024-07-15T03:40:09.518Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4250454338/runners/rocm_v60101/ollama_llama_server time=2024-07-15T03:40:09.518Z level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama4250454338/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-f2296999531d6120801529a45b1d103f7370c5970be939ebfc2ba5d0833e9e1e --ctx-size 16384 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --verbose --parallel 2 --port 45355" time=2024-07-15T03:40:09.518Z level=DEBUG source=server.go:383 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/tmp/ollama4250454338/runners/cuda_v11:/tmp/ollama4250454338/runners:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 CUDA_VISIBLE_DEVICES=GPU-0a469c40-39b4-37e0-3229-5ff659d33432]" time=2024-07-15T03:40:09.518Z level=INFO source=sched.go:382 msg="loaded runners" count=2 time=2024-07-15T03:40:09.518Z level=INFO source=server.go:556 msg="waiting for llama runner to start responding" time=2024-07-15T03:40:09.519Z level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server error" time=2024-07-15T03:40:09.770Z level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model" time=2024-07-15T03:40:11.227Z level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server not responding" time=2024-07-15T03:40:11.478Z level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model" time=2024-07-15T03:40:11.478Z level=DEBUG source=server.go:605 msg="model load progress 0.22" time=2024-07-15T03:40:11.730Z level=DEBUG source=server.go:605 msg="model load progress 0.37" time=2024-07-15T03:40:11.981Z level=DEBUG source=server.go:605 msg="model load progress 0.53" time=2024-07-15T03:40:12.233Z level=DEBUG source=server.go:605 msg="model load progress 0.69" time=2024-07-15T03:40:12.484Z level=DEBUG source=server.go:605 msg="model load progress 0.84" time=2024-07-15T03:40:12.935Z level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server not responding" time=2024-07-15T03:40:13.186Z level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model" time=2024-07-15T03:40:13.186Z level=DEBUG source=server.go:605 msg="model load progress 1.00" time=2024-07-15T03:40:13.437Z level=DEBUG source=server.go:608 msg="model load completed, waiting for server to become available" status="llm server loading model" time=2024-07-15T03:40:13.940Z level=INFO source=server.go:599 msg="llama runner started in 4.42 seconds" time=2024-07-15T03:40:13.940Z level=DEBUG source=sched.go:395 msg="finished setting up runner" model=/root/.ollama/models/blobs/sha256-f2296999531d6120801529a45b1d103f7370c5970be939ebfc2ba5d0833e9e1e ```

GiteaMirror commented

2026-05-03 11:36:01 -05:00

@infrabrew commented on GitHub (Jan 12, 2025):

You could try ollama run ollama-model-name-here < /dev/null

@infrabrew commented on GitHub (Jan 12, 2025): You could try ollama run ollama-model-name-here < /dev/null

GiteaMirror commented

2026-05-03 11:36:02 -05:00

@nour-s commented on GitHub (Oct 19, 2025):

@M0wLaue
Does this way of mounting the volume considered the same as doing local folder ./ollam?

  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    networks:
      - net
    restart: unless-stopped
    ports:
      - 11434:11434
    volumes:
      - ollama_storage:/root/.ollama # ===> is it the same as ./ollama since I'm defining the volumes below?
    mem_limit: 15g
    environment:
      - OLLAMA_DEBUG=true
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]


volumes:
  ollama_storage:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: ~/docker_services/data/ollama_storage

@nour-s commented on GitHub (Oct 19, 2025): @M0wLaue Does this way of mounting the volume considered the same as doing local folder ./ollam? ``` ollama: image: ollama/ollama:latest container_name: ollama networks: - net restart: unless-stopped ports: - 11434:11434 volumes: - ollama_storage:/root/.ollama # ===> is it the same as ./ollama since I'm defining the volumes below? mem_limit: 15g environment: - OLLAMA_DEBUG=true deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] volumes: ollama_storage: driver: local driver_opts: type: none o: bind device: ~/docker_services/data/ollama_storage ```

GiteaMirror commented

2026-05-03 11:36:03 -05:00

@Darshan2104 commented on GitHub (Oct 20, 2025):

one solution, directly use llama.cpp to run any model locally. It's way more faster and anyways ollama is just a wrapper around it!

@Darshan2104 commented on GitHub (Oct 20, 2025): one solution, directly use llama.cpp to run any model locally. It's way more faster and anyways ollama is just a wrapper around it!

GiteaMirror commented

2026-05-03 11:36:04 -05:00

@pdevine commented on GitHub (Oct 20, 2025):

@Darshan2104 that hasn't been true for a while. It's unclear to me from reading the comments if there is a different issue here?

@pdevine commented on GitHub (Oct 20, 2025): @Darshan2104 that hasn't been true for a while. It's unclear to me from reading the comments if there is a different issue here?

GiteaMirror commented

2026-05-03 11:36:05 -05:00

@Pranaviee commented on GitHub (Dec 26, 2025):

Has anyone found a solution for this?
Same happening for me when trying to load llama3 using ollama on gpu for the first query.
How to preload the llama model quickly?

@Pranaviee commented on GitHub (Dec 26, 2025): Has anyone found a solution for this? Same happening for me when trying to load llama3 using ollama on gpu for the first query. How to preload the llama model quickly?

GiteaMirror commented

2026-05-03 11:36:06 -05:00

@thomas-meier85 commented on GitHub (Jan 4, 2026):

Same here,
I'm on a RTX6000 max-q and even small models take up to 60s for the initial load.
time=2026-01-04T20:30:12.894Z level=INFO source=server.go:1376 msg="llama runner started in 49.88 seconds"

However the actual request is pretty fast as expected.
Anybody else experiencing such long load up times?

Interesting fact: 2 others servers using A40 GPU's start rapidly fast - same Ollama version.

@thomas-meier85 commented on GitHub (Jan 4, 2026): Same here, I'm on a RTX6000 max-q and even small models take up to 60s for the initial load. time=2026-01-04T20:30:12.894Z level=INFO source=server.go:1376 msg="llama runner started in 49.88 seconds" However the actual request is pretty fast as expected. Anybody else experiencing such long load up times? Interesting fact: 2 others servers using A40 GPU's start rapidly fast - same Ollama version.

GiteaMirror commented

2026-05-03 11:36:07 -05:00

@Ryderjj89 commented on GitHub (Feb 18, 2026):

Noticing this issue here too. With qwen3:1.7b on a Quadro RTX 4000, it took almost 48 seconds to load. That is really not good.

@Ryderjj89 commented on GitHub (Feb 18, 2026): Noticing this issue here too. With qwen3:1.7b on a Quadro RTX 4000, it took almost 48 seconds to load. That is really not good.

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#63052