[GH-ISSUE #6364] docker container can't detect Nvidia GPU - intermittent "cuda driver library failed to get device context 801" #3995

Open
opened 2026-04-12 14:51:52 -05:00 by GiteaMirror · 37 comments
Owner

Originally created by @fahadshery on GitHub (Aug 14, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6364

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

OS: Ubuntu 24.04 LTS
GPU: Nvidia Tesla P40 (24G)

I installed ollama without docker and it was able to utilise my gpu without any issues.
I then deployed ollama using the following docker compose file:

  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    environment:
      - PUID=${PUID:-1000}
      - PGID=${PGID:-1000}
      - OLLAMA_KEEP_ALIVE=24h
      - ENABLE_IMAGE_GENERATION=True
      - COMFYUI_BASE_URL=http://stable-diffusion-webui:7860
    networks:
      - traefik
    volumes:
      - /etc/localtime:/etc/localtime:ro
      - /etc/timezone:/etc/timezone:ro
      - ./ollama:/root/.ollama
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.ollama.rule=Host(`ollama.local.example.com`)"
      - "traefik.http.routers.ollama.entrypoints=https"
      - "traefik.http.routers.ollama.tls=true"
      - "traefik.http.routers.ollama.tls.certresolver=cloudflare"
      - "traefik.http.routers.ollama.middlewares=default-headers@file"
      - "traefik.http.routers.ollama.middlewares=ollama-auth"
      - "traefik.http.services.ollama.loadbalancer.server.port=11434"
      - "traefik.http.routers.ollama.middlewares=auth"
      - "traefik.http.middlewares.auth.basicauth.users=${OLLAMA_API_CREDENTIALS}"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

When I exec in to the container and run nvidia-smi it successfully executes it from within the ollama docker container.
but the logs show that it can't detect my gpu?

2024/08/14 22:50:17 routes.go:1123: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:24h0m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-08-14T22:50:18.674+01:00 level=INFO source=images.go:782 msg="total blobs: 5"
time=2024-08-14T22:50:18.675+01:00 level=INFO source=images.go:790 msg="total unused blobs removed: 0"
time=2024-08-14T22:50:18.677+01:00 level=INFO source=routes.go:1170 msg="Listening on [::]:11434 (version 0.3.5)"
time=2024-08-14T22:50:18.678+01:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama2940291930/runners
time=2024-08-14T22:50:30.626+01:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11 rocm_v60102]"
time=2024-08-14T22:50:30.626+01:00 level=INFO source=gpu.go:204 msg="looking for compatible GPUs"
time=2024-08-14T22:50:30.640+01:00 level=INFO source=gpu.go:260 msg="error looking up nvidia GPU memory" error="cuda driver library failed to get device context 801"
time=2024-08-14T22:50:30.640+01:00 level=INFO source=gpu.go:350 msg="no compatible GPUs were discovered"
time=2024-08-14T22:50:30.640+01:00 level=INFO source=types.go:105 msg="inference compute" id=0 library=cpu compute="" driver=0.0 name="" total="47.1 GiB" available="43.9 GiB"
2024/08/14 22:54:19 routes.go:1123: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:24h0m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-08-14T22:54:19.967+01:00 level=INFO source=images.go:782 msg="total blobs: 5"
time=2024-08-14T22:54:20.012+01:00 level=INFO source=images.go:790 msg="total unused blobs removed: 0"
time=2024-08-14T22:54:20.013+01:00 level=INFO source=routes.go:1170 msg="Listening on [::]:11434 (version 0.3.5)"
time=2024-08-14T22:54:20.032+01:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama1278819119/runners

not sure why??

OS

Linux, Docker

GPU

Nvidia

CPU

Intel

Ollama version

0.3.5

Originally created by @fahadshery on GitHub (Aug 14, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6364 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? OS: Ubuntu 24.04 LTS GPU: Nvidia Tesla P40 (24G) I installed ollama without docker and it was able to utilise my gpu without any issues. I then deployed ollama using the following docker compose file: ``` ollama: image: ollama/ollama:latest container_name: ollama restart: unless-stopped environment: - PUID=${PUID:-1000} - PGID=${PGID:-1000} - OLLAMA_KEEP_ALIVE=24h - ENABLE_IMAGE_GENERATION=True - COMFYUI_BASE_URL=http://stable-diffusion-webui:7860 networks: - traefik volumes: - /etc/localtime:/etc/localtime:ro - /etc/timezone:/etc/timezone:ro - ./ollama:/root/.ollama labels: - "traefik.enable=true" - "traefik.http.routers.ollama.rule=Host(`ollama.local.example.com`)" - "traefik.http.routers.ollama.entrypoints=https" - "traefik.http.routers.ollama.tls=true" - "traefik.http.routers.ollama.tls.certresolver=cloudflare" - "traefik.http.routers.ollama.middlewares=default-headers@file" - "traefik.http.routers.ollama.middlewares=ollama-auth" - "traefik.http.services.ollama.loadbalancer.server.port=11434" - "traefik.http.routers.ollama.middlewares=auth" - "traefik.http.middlewares.auth.basicauth.users=${OLLAMA_API_CREDENTIALS}" deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] ``` When I exec in to the container and run `nvidia-smi` it successfully executes it from `within` the ollama docker container. but the logs show that it can't detect my gpu? ``` 2024/08/14 22:50:17 routes.go:1123: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:24h0m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" time=2024-08-14T22:50:18.674+01:00 level=INFO source=images.go:782 msg="total blobs: 5" time=2024-08-14T22:50:18.675+01:00 level=INFO source=images.go:790 msg="total unused blobs removed: 0" time=2024-08-14T22:50:18.677+01:00 level=INFO source=routes.go:1170 msg="Listening on [::]:11434 (version 0.3.5)" time=2024-08-14T22:50:18.678+01:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama2940291930/runners time=2024-08-14T22:50:30.626+01:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11 rocm_v60102]" time=2024-08-14T22:50:30.626+01:00 level=INFO source=gpu.go:204 msg="looking for compatible GPUs" time=2024-08-14T22:50:30.640+01:00 level=INFO source=gpu.go:260 msg="error looking up nvidia GPU memory" error="cuda driver library failed to get device context 801" time=2024-08-14T22:50:30.640+01:00 level=INFO source=gpu.go:350 msg="no compatible GPUs were discovered" time=2024-08-14T22:50:30.640+01:00 level=INFO source=types.go:105 msg="inference compute" id=0 library=cpu compute="" driver=0.0 name="" total="47.1 GiB" available="43.9 GiB" 2024/08/14 22:54:19 routes.go:1123: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:24h0m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" time=2024-08-14T22:54:19.967+01:00 level=INFO source=images.go:782 msg="total blobs: 5" time=2024-08-14T22:54:20.012+01:00 level=INFO source=images.go:790 msg="total unused blobs removed: 0" time=2024-08-14T22:54:20.013+01:00 level=INFO source=routes.go:1170 msg="Listening on [::]:11434 (version 0.3.5)" time=2024-08-14T22:54:20.032+01:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama1278819119/runners ``` not sure why?? ### OS Linux, Docker ### GPU Nvidia ### CPU Intel ### Ollama version 0.3.5
GiteaMirror added the dockernvidianeeds more infobug labels 2026-04-12 14:51:53 -05:00
Author
Owner

@fahadshery commented on GitHub (Aug 14, 2024):

image
<!-- gh-comment-id:2290002809 --> @fahadshery commented on GitHub (Aug 14, 2024): <img width="900" alt="image" src="https://github.com/user-attachments/assets/628dbe8e-4ad5-49b9-89b9-c853cbb6d053">
Author
Owner

@rick-github commented on GitHub (Aug 14, 2024):

What's the output of nvidia-smi outside of the container?

<!-- gh-comment-id:2290003925 --> @rick-github commented on GitHub (Aug 14, 2024): What's the output of nvidia-smi outside of the container?
Author
Owner

@rick-github commented on GitHub (Aug 14, 2024):

If you (temporarily) install ollama as a service (curl -fsSL https://ollama.com/install.sh | sh) can it access the GPU?

I see that you've already done that.

<!-- gh-comment-id:2290007092 --> @rick-github commented on GitHub (Aug 14, 2024): ~If you (temporarily) install ollama as a service (`curl -fsSL https://ollama.com/install.sh | sh`) can it access the GPU?~ I see that you've already done that.
Author
Owner

@fahadshery commented on GitHub (Aug 14, 2024):

What's the output of nvidia-smi outside of the container?

image
<!-- gh-comment-id:2290018036 --> @fahadshery commented on GitHub (Aug 14, 2024): > What's the output of nvidia-smi outside of the container? <img width="903" alt="image" src="https://github.com/user-attachments/assets/7a61e040-c0bc-4bd9-a9c3-83da2ecf1614">
Author
Owner

@fahadshery commented on GitHub (Aug 14, 2024):

I am using vGPU which is a datacenter grade GPU. I changed and tried with different Nvidia profiles but no use. Here is more info on the GPU:

fahadshery@ai-stack:~$ nvidia-smi -q

==============NVSMI LOG==============

Timestamp                                 : Wed Aug 14 23:19:54 2024
Driver Version                            : 535.161.08
CUDA Version                              : 12.2

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Product Name                          : GRID P40-24Q
    Product Brand                         : NVIDIA RTX Virtual Workstation
    Product Architecture                  : Pascal
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    Addressing Mode                       : N/A
    MIG Mode
        Current                           : Disabled
        Pending                           : Disabled
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-a126170b-5a87-11ef-bdda-59b6da7f9cd4
    Minor Number                          : 0
    VBIOS Version                         : 00.00.00.00.00
    MultiGPU Board                        : No
    Board ID                              : 0x100
    Board Part Number                     : N/A
    GPU Part Number                       : 1B38-895-A1
    FRU Part Number                       : N/A
    Module ID                             : N/A
    Inforom Version
        Image Version                     : N/A
        OEM Object                        : N/A
        ECC Object                        : N/A
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : N/A
        Latest Duration                   : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : VGPU
        Host VGPU Mode                    : N/A
    vGPU Software Licensed Product
        Product Name                      : NVIDIA RTX Virtual Workstation
        License Status                    : Licensed (Expiry: 2024-11-12 21:54:2 GMT)
    GPU Reset Status
        Reset Required                    : N/A
        Drain and Reset Recommended       : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x01
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x1B3810DE
        Bus Id                            : 00000000:01:00.0
        Sub System Id                     : 0x11EF10DE
        GPU Link Info
            PCIe Generation
                Max                       : N/A
                Current                   : N/A
                Device Current            : N/A
                Device Max                : N/A
                Host Max                  : N/A
            Link Width
                Max                       : N/A
                Current                   : N/A
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : N/A
        Replay Number Rollovers           : N/A
        Tx Throughput                     : N/A
        Rx Throughput                     : N/A
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : N/A
    Performance State                     : P0
    Clocks Event Reasons                  : N/A
    Sparse Operation Mode                 : N/A
    FB Memory Usage
        Total                             : 24576 MiB
        Reserved                          : 1680 MiB
        Used                              : 4176 MiB
        Free                              : 18719 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 0 MiB
        Free                              : 256 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : N/A
        OFA                               : N/A
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            Single Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
            Double Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
        Aggregate
            Single Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
            Double Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : N/A
        GPU T.Limit Temp                  : N/A
        GPU Shutdown Temp                 : N/A
        GPU Slowdown Temp                 : N/A
        GPU Max Operating Temp            : N/A
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    GPU Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 1303 MHz
        SM                                : 1303 MHz
        Memory                            : 3615 MHz
        Video                             : 1164 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : N/A
        SM                                : N/A
        Memory                            : N/A
        Video                             : N/A
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : N/A
    Fabric
        State                             : N/A
        Status                            : N/A
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 2201
            Type                          : C
            Name                          : python
            Used GPU Memory               : 146 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 3576
            Type                          : C
            Name                          : /tmp/ollama1278819119/runners/cuda_v11/ollama_llama_server
            Used GPU Memory               : 4030 MiB
<!-- gh-comment-id:2290019356 --> @fahadshery commented on GitHub (Aug 14, 2024): I am using vGPU which is a datacenter grade GPU. I changed and tried with different Nvidia profiles but no use. Here is more info on the GPU: ``` fahadshery@ai-stack:~$ nvidia-smi -q ==============NVSMI LOG============== Timestamp : Wed Aug 14 23:19:54 2024 Driver Version : 535.161.08 CUDA Version : 12.2 Attached GPUs : 1 GPU 00000000:01:00.0 Product Name : GRID P40-24Q Product Brand : NVIDIA RTX Virtual Workstation Product Architecture : Pascal Display Mode : Enabled Display Active : Disabled Persistence Mode : Enabled Addressing Mode : N/A MIG Mode Current : Disabled Pending : Disabled Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : N/A GPU UUID : GPU-a126170b-5a87-11ef-bdda-59b6da7f9cd4 Minor Number : 0 VBIOS Version : 00.00.00.00.00 MultiGPU Board : No Board ID : 0x100 Board Part Number : N/A GPU Part Number : 1B38-895-A1 FRU Part Number : N/A Module ID : N/A Inforom Version Image Version : N/A OEM Object : N/A ECC Object : N/A Power Management Object : N/A Inforom BBX Object Flush Latest Timestamp : N/A Latest Duration : N/A GPU Operation Mode Current : N/A Pending : N/A GSP Firmware Version : N/A GPU Virtualization Mode Virtualization Mode : VGPU Host VGPU Mode : N/A vGPU Software Licensed Product Product Name : NVIDIA RTX Virtual Workstation License Status : Licensed (Expiry: 2024-11-12 21:54:2 GMT) GPU Reset Status Reset Required : N/A Drain and Reset Recommended : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x01 Device : 0x00 Domain : 0x0000 Device Id : 0x1B3810DE Bus Id : 00000000:01:00.0 Sub System Id : 0x11EF10DE GPU Link Info PCIe Generation Max : N/A Current : N/A Device Current : N/A Device Max : N/A Host Max : N/A Link Width Max : N/A Current : N/A Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : N/A Replay Number Rollovers : N/A Tx Throughput : N/A Rx Throughput : N/A Atomic Caps Inbound : N/A Atomic Caps Outbound : N/A Fan Speed : N/A Performance State : P0 Clocks Event Reasons : N/A Sparse Operation Mode : N/A FB Memory Usage Total : 24576 MiB Reserved : 1680 MiB Used : 4176 MiB Free : 18719 MiB BAR1 Memory Usage Total : 256 MiB Used : 0 MiB Free : 256 MiB Conf Compute Protected Memory Usage Total : 0 MiB Used : 0 MiB Free : 0 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % JPEG : N/A OFA : N/A Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 ECC Mode Current : N/A Pending : N/A ECC Errors Volatile Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Aggregate Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows : N/A Temperature GPU Current Temp : N/A GPU T.Limit Temp : N/A GPU Shutdown Temp : N/A GPU Slowdown Temp : N/A GPU Max Operating Temp : N/A GPU Target Temperature : N/A Memory Current Temp : N/A Memory Max Operating Temp : N/A GPU Power Readings Power Draw : N/A Current Power Limit : N/A Requested Power Limit : N/A Default Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A Module Power Readings Power Draw : N/A Current Power Limit : N/A Requested Power Limit : N/A Default Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A Clocks Graphics : 1303 MHz SM : 1303 MHz Memory : 3615 MHz Video : 1164 MHz Applications Clocks Graphics : N/A Memory : N/A Default Applications Clocks Graphics : N/A Memory : N/A Deferred Clocks Memory : N/A Max Clocks Graphics : N/A SM : N/A Memory : N/A Video : N/A Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Voltage Graphics : N/A Fabric State : N/A Status : N/A Processes GPU instance ID : N/A Compute instance ID : N/A Process ID : 2201 Type : C Name : python Used GPU Memory : 146 MiB GPU instance ID : N/A Compute instance ID : N/A Process ID : 3576 Type : C Name : /tmp/ollama1278819119/runners/cuda_v11/ollama_llama_server Used GPU Memory : 4030 MiB ```
Author
Owner

@fahadshery commented on GitHub (Aug 14, 2024):

If you (temporarily) install ollama as a service (curl -fsSL https://ollama.com/install.sh | sh) can it access the GPU?

I see that you've already done that.

yes, already tried and it works beautifully. But I need it running in docker so that it's easier to deploy other services with it like stable diffusion, open-webui, whisper, searxng, libretranslate etc. etc.

<!-- gh-comment-id:2290022968 --> @fahadshery commented on GitHub (Aug 14, 2024): > ~If you (temporarily) install ollama as a service (`curl -fsSL https://ollama.com/install.sh | sh`) can it access the GPU?~ > > I see that you've already done that. yes, already tried and it works beautifully. But I need it running in docker so that it's easier to deploy other services with it like `stable diffusion, open-webui, whisper, searxng, libretranslate` etc. etc.
Author
Owner

@rick-github commented on GitHub (Aug 14, 2024):

nvidia-smi outside of the container shows an ollama runner using the GPU. Is that running inside the container or is the ollama-as-a-service still running?

<!-- gh-comment-id:2290023179 --> @rick-github commented on GitHub (Aug 14, 2024): nvidia-smi outside of the container shows an ollama runner using the GPU. Is that running inside the container or is the ollama-as-a-service still running?
Author
Owner

@fahadshery commented on GitHub (Aug 14, 2024):

nvidia-smi outside of the container shows an ollama runner using the GPU. Is that running inside the container or is the ollama-as-a-service still running?

There is no ollama service running in this VM. It's a fresh VM and I deploy everything using ansible so that I don't mess things up. So I am assuming that it's got to be from inside the container. but as logs show, container fails to recognise that there is a GPU available to it

<!-- gh-comment-id:2290050726 --> @fahadshery commented on GitHub (Aug 14, 2024): > nvidia-smi outside of the container shows an ollama runner using the GPU. Is that running inside the container or is the ollama-as-a-service still running? There is no ollama service running in this VM. It's a fresh VM and I deploy everything using `ansible` so that I don't mess things up. So I am assuming that it's got to be from inside the container. but as logs show, container fails to recognise that there is a GPU available to it
Author
Owner

@fahadshery commented on GitHub (Aug 14, 2024):

this is what I am trying to build:

[](https://technotim.live/posts/ai-stack-tutorial

<!-- gh-comment-id:2290057366 --> @fahadshery commented on GitHub (Aug 14, 2024): this is what I am trying to build: [[](https://technotim.live/posts/ai-stack-tutorial](https://technotim.live/posts/ai-stack-tutorial/)
Author
Owner

@rick-github commented on GitHub (Aug 14, 2024):

What do the following show:
pstree -ls 3576
ps wwp3576

<!-- gh-comment-id:2290076013 --> @rick-github commented on GitHub (Aug 14, 2024): What do the following show: `pstree -ls 3576` `ps wwp3576`
Author
Owner

@fahadshery commented on GitHub (Aug 15, 2024):

What do the following show: pstree -ls 3576 ps wwp3576

ok, I don't know what happened. (I didn't make any change other than downloading a different model i.e. llama3.1:8b)

I ran it and the GPU utilisation went up to 85%.

then I did the process check and here are the results:

fahadshery@ai-stack:~$ pstree -ls 27163
systemd───containerd-shim───ollama───ollama_llama_se───15*[{ollama_llama_se}]
fahadshery@ai-stack:~$ ps wwp27163
    PID TTY      STAT   TIME COMMAND
  27163 ?        Sl     0:35 /tmp/ollama3783253186/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --numa distribute --parallel 4 --port 40425

is that normal? I was expecting a 90% GPU utilisation

<!-- gh-comment-id:2290810655 --> @fahadshery commented on GitHub (Aug 15, 2024): > What do the following show: `pstree -ls 3576` `ps wwp3576` ok, I don't know what happened. (I didn't make any change other than downloading a different model i.e. llama3.1:8b) I ran it and the GPU utilisation went up to 85%. then I did the process check and here are the results: ``` fahadshery@ai-stack:~$ pstree -ls 27163 systemd───containerd-shim───ollama───ollama_llama_se───15*[{ollama_llama_se}] fahadshery@ai-stack:~$ ps wwp27163 PID TTY STAT TIME COMMAND 27163 ? Sl 0:35 /tmp/ollama3783253186/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --numa distribute --parallel 4 --port 40425 ``` is that normal? I was expecting a 90% GPU utilisation
Author
Owner

@fahadshery commented on GitHub (Aug 15, 2024):

image
<!-- gh-comment-id:2290812488 --> @fahadshery commented on GitHub (Aug 15, 2024): <img width="1440" alt="image" src="https://github.com/user-attachments/assets/3bfdc1d7-6f7a-4754-951f-519fcf8e0e58">
Author
Owner

@rick-github commented on GitHub (Aug 15, 2024):

GPU utilization will vary on the efficiency of the model and external factors like power usage, etc. I don't know how this applies to VGPUs as they are a shared resource and presumably there will be some competition for cycles. You can get a view of possible limiting factors by looking at the performance state and throttle reasons from nvidia-smi -q -d POWER,TEMPERATURE,PERFORMANCE.

<!-- gh-comment-id:2290994244 --> @rick-github commented on GitHub (Aug 15, 2024): GPU utilization will vary on the efficiency of the model and external factors like power usage, etc. I don't know how this applies to VGPUs as they are a shared resource and presumably there will be some competition for cycles. You can get a view of possible limiting factors by looking at the performance state and throttle reasons from `nvidia-smi -q -d POWER,TEMPERATURE,PERFORMANCE`.
Author
Owner

@fahadshery commented on GitHub (Aug 15, 2024):

GPU utilization will vary on the efficiency of the model and external factors like power usage, etc. I don't know how this applies to VGPUs as they are a shared resource and presumably there will be some competition for cycles. You can get a view of possible limiting factors by looking at the performance state and throttle reasons from nvidia-smi -q -d POWER,TEMPERATURE,PERFORMANCE.

How does it come up with --n-gpu-layers 33 in the ps wwp27163 command? how do you determine that? or is it inherent to the underlying model to decide?

<!-- gh-comment-id:2291132596 --> @fahadshery commented on GitHub (Aug 15, 2024): > GPU utilization will vary on the efficiency of the model and external factors like power usage, etc. I don't know how this applies to VGPUs as they are a shared resource and presumably there will be some competition for cycles. You can get a view of possible limiting factors by looking at the performance state and throttle reasons from `nvidia-smi -q -d POWER,TEMPERATURE,PERFORMANCE`. How does it come up with `--n-gpu-layers 33` in the `ps wwp27163 command`? how do you determine that? or is it inherent to the underlying model to decide?
Author
Owner

@rick-github commented on GitHub (Aug 15, 2024):

In the server log there will be lines like:

ollama  | time=2024-08-14T22:59:28.178Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=14 layers.split="" memory.available="[11.6 GiB]" memory.required.full="55.9 GiB" memory.required.partial="11.4 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[11.4 GiB]" memory.weights.total="52.9 GiB" memory.weights.repeating="52.1 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="1.1 GiB"

This is ollama figuring out how much (V)RAM your system has, and calculating how many layers will fit in the available VRAM and how much RAM will be needed to the non-GPU layers. You can control the number of layers that are offloaded to the GPU with the num_gpu option, either in the CLI (/set parameter num_gpu xx) or in the API (curl localhost:11434/api/generate -d '{"model":"yy","options":{"num_gpu":xx}}').

<!-- gh-comment-id:2291287610 --> @rick-github commented on GitHub (Aug 15, 2024): In the server log there will be lines like: ``` ollama | time=2024-08-14T22:59:28.178Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=14 layers.split="" memory.available="[11.6 GiB]" memory.required.full="55.9 GiB" memory.required.partial="11.4 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[11.4 GiB]" memory.weights.total="52.9 GiB" memory.weights.repeating="52.1 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="1.1 GiB" ``` This is ollama figuring out how much (V)RAM your system has, and calculating how many layers will fit in the available VRAM and how much RAM will be needed to the non-GPU layers. You can control the number of layers that are offloaded to the GPU with the `num_gpu` option, either in the CLI (`/set parameter num_gpu xx`) or in the API (`curl localhost:11434/api/generate -d '{"model":"yy","options":{"num_gpu":xx}}'`).
Author
Owner

@fahadshery commented on GitHub (Aug 16, 2024):

so working within docker container is intermittent. it struggles to reload the model into the GPU once it's been offloaded. I installed using linux shell script and it's working as expected. in docker it some times don't even see the GPU even though nvidia-smi command works fine within the container.

<!-- gh-comment-id:2294418103 --> @fahadshery commented on GitHub (Aug 16, 2024): so working within docker container is intermittent. it struggles to reload the model into the GPU once it's been offloaded. I installed using linux shell script and it's working as expected. in docker it some times don't even see the GPU even though `nvidia-smi` command works fine within the container.
Author
Owner

@rick-github commented on GitHub (Aug 16, 2024):

What's in the server logs when it fails?

<!-- gh-comment-id:2294421210 --> @rick-github commented on GitHub (Aug 16, 2024): What's in the server logs when it fails?
Author
Owner

@TomorrowToday commented on GitHub (Aug 26, 2024):

Are you running with the nvidia container toolkit? It's not supported yet on Ubuntu 24.04 according to their docs.

<!-- gh-comment-id:2311120369 --> @TomorrowToday commented on GitHub (Aug 26, 2024): Are you running with the nvidia container toolkit? It's not supported yet on Ubuntu 24.04 according to their docs.
Author
Owner

@fahadshery commented on GitHub (Aug 27, 2024):

Are you running with the nvidia container toolkit? It's not supported yet on Ubuntu 24.04 according to their docs.

working fine in other containers like Stable-Diffusion-webui, whisper etc.

<!-- gh-comment-id:2312049346 --> @fahadshery commented on GitHub (Aug 27, 2024): > Are you running with the nvidia container toolkit? It's not supported yet on Ubuntu 24.04 according to their docs. working fine in other containers like `Stable-Diffusion-webui`, `whisper` etc.
Author
Owner

@superwolfboy commented on GitHub (Aug 31, 2024):

enable "above 4G" in bios already ?

<!-- gh-comment-id:2322751543 --> @superwolfboy commented on GitHub (Aug 31, 2024): enable "above 4G" in bios already ?
Author
Owner

@fahadshery commented on GitHub (Aug 31, 2024):

enable "above 4G" in bios already ?

I am running it on Dell R720 Server with NVIDIA Tesla P40 24G GPU. So not sure if there is an option there? But as I said, all the other containers are working fine. Even the gpu-jupyter container is working fine!

<!-- gh-comment-id:2322999171 --> @fahadshery commented on GitHub (Aug 31, 2024): > enable "above 4G" in bios already ? I am running it on Dell R720 Server with NVIDIA Tesla P40 24G GPU. So not sure if there is an option there? But as I said, all the other containers are working fine. Even the `gpu-jupyter` container is working fine!
Author
Owner

@superwolfboy commented on GitHub (Sep 2, 2024):

My problem is the same as you,vGPU is not working, almost the same log,
But GPU passthrough can working, and only one VM can use this GPU

<!-- gh-comment-id:2323819416 --> @superwolfboy commented on GitHub (Sep 2, 2024): My problem is the same as you,vGPU is not working, almost the same log, But GPU passthrough can working, and only one VM can use this GPU
Author
Owner

@dhiltgen commented on GitHub (Sep 4, 2024):

@fahadshery in your initial logs I see the following error

time=2024-08-14T22:50:30.640+01:00 level=INFO source=gpu.go:260 msg="error looking up nvidia GPU memory" error="cuda driver library failed to get device context 801"

That error code maps to:

    /**
     * This error indicates that the attempted operation is not supported
     * on the current system or device.
     */
    CUDA_ERROR_NOT_SUPPORTED                  = 801,

I would recommend working through our troublshooting guide for NVIDIA GPUs - https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#nvidia-gpu-discovery

In particular, the uvm driver may be unloading which may explain intermittent behavior (works sometimes, fails other times). In our install script we mitigate this with the following code https://github.com/ollama/ollama/blob/main/scripts/install.sh#L358-L367 which may be applicable for your host system if it turns out this is the root cause.

<!-- gh-comment-id:2327689787 --> @dhiltgen commented on GitHub (Sep 4, 2024): @fahadshery in your initial logs I see the following error ``` time=2024-08-14T22:50:30.640+01:00 level=INFO source=gpu.go:260 msg="error looking up nvidia GPU memory" error="cuda driver library failed to get device context 801" ``` That error code maps to: ``` /** * This error indicates that the attempted operation is not supported * on the current system or device. */ CUDA_ERROR_NOT_SUPPORTED = 801, ``` I would recommend working through our troublshooting guide for NVIDIA GPUs - https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#nvidia-gpu-discovery In particular, the uvm driver may be unloading which may explain intermittent behavior (works sometimes, fails other times). In our install script we mitigate this with the following code https://github.com/ollama/ollama/blob/main/scripts/install.sh#L358-L367 which may be applicable for your host system if it turns out this is the root cause.
Author
Owner

@JavierCCC commented on GitHub (Sep 12, 2024):

Check /etc/docker/daemon.json

You want to have a runtime definition related to nvidia inside it, something like this

(...)

    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
(...)

Then, you need to use

runtime: nvidia

Inside your docker compose yaml.

<!-- gh-comment-id:2347303922 --> @JavierCCC commented on GitHub (Sep 12, 2024): Check /etc/docker/daemon.json You want to have a runtime definition related to nvidia inside it, something like this ``` (...) "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } (...) ``` Then, you need to use `runtime: nvidia` Inside your docker compose yaml.
Author
Owner

@mrk3786 commented on GitHub (Sep 16, 2024):

nvidia-smi outside of the container shows an ollama runner using the GPU. Is that running inside the container or is the ollama-as-a-service still running?

There is no ollama service running in this VM. It's a fresh VM and I deploy everything using ansible so that I don't mess things up. So I am assuming that it's got to be from inside the container. but as logs show, container fails to recognise that there is a GPU available to it

I ran into the same problem. It turned out to be a CPU type configuration in my proxmox VM. I configured x86 and when i changed that to 'host', the issue was solved.

<!-- gh-comment-id:2352372530 --> @mrk3786 commented on GitHub (Sep 16, 2024): > > nvidia-smi outside of the container shows an ollama runner using the GPU. Is that running inside the container or is the ollama-as-a-service still running? > > There is no ollama service running in this VM. It's a fresh VM and I deploy everything using `ansible` so that I don't mess things up. So I am assuming that it's got to be from inside the container. but as logs show, container fails to recognise that there is a GPU available to it I ran into the same problem. It turned out to be a CPU type configuration in my proxmox VM. I configured x86 and when i changed that to 'host', the issue was solved.
Author
Owner

@vaclcer commented on GitHub (Sep 17, 2024):

Hello, reporting the same problem with error "cuda driver library failed to get device context 801":

time=2024-09-17T05:56:32.395Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx2 cuda_v11 cuda_v12 rocm_v60102 cpu cpu_avx]" time=2024-09-17T05:56:32.395Z level=DEBUG source=payload.go:45 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY" time=2024-09-17T05:56:32.395Z level=DEBUG source=sched.go:105 msg="starting llm scheduler" time=2024-09-17T05:56:32.395Z level=INFO source=gpu.go:200 msg="looking for compatible GPUs" time=2024-09-17T05:56:32.395Z level=DEBUG source=gpu.go:86 msg="searching for GPU discovery libraries for NVIDIA" time=2024-09-17T05:56:32.395Z level=DEBUG source=gpu.go:468 msg="Searching for GPU library" name=libcuda.so* time=2024-09-17T05:56:32.395Z level=DEBUG source=gpu.go:491 msg="gpu library search" globs="[/usr/lib/ollama/libcuda.so* /usr/local/nvidia/lib/libcuda.so* /usr/local/nvidia/lib64/libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" time=2024-09-17T05:56:32.396Z level=DEBUG source=gpu.go:525 msg="discovered GPU libraries" paths=[/usr/lib/x86_64-linux-gnu/libcuda.so.525.105.17] CUDA driver version: 12.0 time=2024-09-17T05:56:32.404Z level=DEBUG source=gpu.go:119 msg="detected GPUs" count=1 library=/usr/lib/x86_64-linux-gnu/libcuda.so.525.105.17 time=2024-09-17T05:56:32.404Z level=INFO source=gpu.go:252 msg="error looking up nvidia GPU memory" error="cuda driver library failed to get device context 801" time=2024-09-17T05:56:32.404Z level=DEBUG source=amd_linux.go:371 msg="amdgpu driver not detected /sys/module/amdgpu" time=2024-09-17T05:56:32.404Z level=INFO source=gpu.go:347 msg="no compatible GPUs were discovered" releasing cuda driver library time=2024-09-17T05:56:32.404Z level=INFO source=types.go:107 msg="inference compute" id=0 library=cpu variant=avx2 compute="" driver=0.0 name="" total="94.3 GiB" available="92.7 GiB"

nvidia-smi in the container work ok:

`
Tue Sep 17 06:03:42 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A40-48Q On | 00000000:00:05.0 Off | 0 |
| N/A N/A P8 N/A / N/A | 0MiB / 49152MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
`

Did go through the proposed troubleshooting, but no luck, no other errors found. What do I do now? Thanks for any help.

<!-- gh-comment-id:2354606021 --> @vaclcer commented on GitHub (Sep 17, 2024): Hello, reporting the same problem with error "cuda driver library failed to get device context 801": `time=2024-09-17T05:56:32.395Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx2 cuda_v11 cuda_v12 rocm_v60102 cpu cpu_avx]" time=2024-09-17T05:56:32.395Z level=DEBUG source=payload.go:45 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY" time=2024-09-17T05:56:32.395Z level=DEBUG source=sched.go:105 msg="starting llm scheduler" time=2024-09-17T05:56:32.395Z level=INFO source=gpu.go:200 msg="looking for compatible GPUs" time=2024-09-17T05:56:32.395Z level=DEBUG source=gpu.go:86 msg="searching for GPU discovery libraries for NVIDIA" time=2024-09-17T05:56:32.395Z level=DEBUG source=gpu.go:468 msg="Searching for GPU library" name=libcuda.so* time=2024-09-17T05:56:32.395Z level=DEBUG source=gpu.go:491 msg="gpu library search" globs="[/usr/lib/ollama/libcuda.so* /usr/local/nvidia/lib/libcuda.so* /usr/local/nvidia/lib64/libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" time=2024-09-17T05:56:32.396Z level=DEBUG source=gpu.go:525 msg="discovered GPU libraries" paths=[/usr/lib/x86_64-linux-gnu/libcuda.so.525.105.17] CUDA driver version: 12.0 time=2024-09-17T05:56:32.404Z level=DEBUG source=gpu.go:119 msg="detected GPUs" count=1 library=/usr/lib/x86_64-linux-gnu/libcuda.so.525.105.17 time=2024-09-17T05:56:32.404Z level=INFO source=gpu.go:252 msg="error looking up nvidia GPU memory" error="cuda driver library failed to get device context 801" time=2024-09-17T05:56:32.404Z level=DEBUG source=amd_linux.go:371 msg="amdgpu driver not detected /sys/module/amdgpu" time=2024-09-17T05:56:32.404Z level=INFO source=gpu.go:347 msg="no compatible GPUs were discovered" releasing cuda driver library time=2024-09-17T05:56:32.404Z level=INFO source=types.go:107 msg="inference compute" id=0 library=cpu variant=avx2 compute="" driver=0.0 name="" total="94.3 GiB" available="92.7 GiB" ` nvidia-smi in the container work ok: ` Tue Sep 17 06:03:42 2024 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A40-48Q On | 00000000:00:05.0 Off | 0 | | N/A N/A P8 N/A / N/A | 0MiB / 49152MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ ` Did go through the proposed troubleshooting, but no luck, no other errors found. What do I do now? Thanks for any help.
Author
Owner

@phonkd commented on GitHub (Sep 19, 2024):

After changing the cpu type to host (of my qemu) vm it worked.

<!-- gh-comment-id:2362229695 --> @phonkd commented on GitHub (Sep 19, 2024): After changing the cpu type to host (of my qemu) vm it worked.
Author
Owner

@fahadshery commented on GitHub (Sep 20, 2024):

nvidia-smi outside of the container shows an ollama runner using the GPU. Is that running inside the container or is the ollama-as-a-service still running?

ollama works fine but it's intermittent. I have no issues with other containers using the GPU. Therefore, I am running ollama as a service and have no issues..

I ran into the same problem. It turned out to be a CPU type configuration in my proxmox VM. I configured x86 and when i changed that to 'host', the issue was solved.

I will check the CPU type and change it to host but this might not help since we're using vGPU

<!-- gh-comment-id:2363574303 --> @fahadshery commented on GitHub (Sep 20, 2024): > > > nvidia-smi outside of the container shows an ollama runner using the GPU. Is that running inside the container or is the ollama-as-a-service still running? ollama works fine but it's intermittent. I have no issues with other containers using the GPU. Therefore, I am running ollama as a service and have no issues.. > > I ran into the same problem. It turned out to be a CPU type configuration in my proxmox VM. I configured x86 and when i changed that to 'host', the issue was solved. I will check the CPU type and change it to host but this might not help since we're using vGPU
Author
Owner

@fahadshery commented on GitHub (Sep 20, 2024):

After changing the cpu type to host (of my qemu) vm it worked.

are there no reloading model issues?

<!-- gh-comment-id:2363575285 --> @fahadshery commented on GitHub (Sep 20, 2024): > After changing the cpu type to host (of my qemu) vm it worked. are there no reloading model issues?
Author
Owner

@dhiltgen commented on GitHub (Sep 24, 2024):

@vaclcer your driver version 525.105.17 is well over a year old. Perhaps try upgrading the driver and see if maybe this is a bug nvidia has already fixed?

@fahadshery from your logs, you're running a newer driver, but given the intermittent nature of this, it might also be worth trying to upgrade to the latest driver to see if that clears it up.

<!-- gh-comment-id:2371972852 --> @dhiltgen commented on GitHub (Sep 24, 2024): @vaclcer your driver version 525.105.17 is well over a [year old](https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-525-105-17/index.html). Perhaps try upgrading the driver and see if maybe this is a bug nvidia has already fixed? @fahadshery from your logs, you're running a newer driver, but given the intermittent nature of this, it might also be worth trying to upgrade to the latest driver to see if that clears it up.
Author
Owner

@fahadshery commented on GitHub (Sep 26, 2024):

@fahadshery from your logs, you're running a newer driver, but given the intermittent nature of this, it might also be worth trying to upgrade to the latest driver to see if that clears it up.

ok, I am gona upgrade the drivers to the latest 550.90.05 and try again and report back

<!-- gh-comment-id:2377325031 --> @fahadshery commented on GitHub (Sep 26, 2024): > @fahadshery from your logs, you're running a newer driver, but given the intermittent nature of this, it might also be worth trying to upgrade to the latest driver to see if that clears it up. ok, I am gona upgrade the drivers to the latest `550.90.05` and try again and report back
Author
Owner

@dstaicova commented on GitHub (Oct 1, 2024):

Just to say, I'm getting the same error without docker:

>OLLAMA_DEBUG=1 ollama serve time=2024-10-01T18:59:04.108+03:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs" time=2024-10-01T18:59:04.109+03:00 level=DEBUG source=gpu.go:86 msg="searching for GPU discovery libraries for NVIDIA" time=2024-10-01T18:59:04.109+03:00 level=DEBUG source=gpu.go:468 msg="Searching for GPU library" name=libcuda.so* time=2024-10-01T18:59:04.109+03:00 level=DEBUG source=gpu.go:491 msg="gpu library search" globs="[/usr/local/lib/ollama/libcuda.so* /home/denijane/scripts/libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" time=2024-10-01T18:59:04.166+03:00 level=DEBUG source=gpu.go:525 msg="discovered GPU libraries" paths="[/usr/lib/libcuda.so.550.107.02 /usr/lib/libcuda.so.550.78 /usr/lib.bak/libcuda.so.550.90.07 /usr/lib32/libcuda.so.550.107.02 /usr/lib64/libcuda.so.550.107.02 /usr/lib64/libcuda.so.550.78]" cuInit err: 999 time=2024-10-01T18:59:04.173+03:00 level=WARN source=gpu.go:562 msg="unknown error initializing cuda driver library" library=/usr/lib/libcuda.so.550.107.02 error="cuda driver library init failure: 999" time=2024-10-01T18:59:04.173+03:00 level=WARN source=gpu.go:563 msg="see https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md for more information" cuInit err: 803 time=2024-10-01T18:59:04.181+03:00 level=WARN source=gpu.go:558 msg="version mismatch between driver and cuda driver library - reboot or upgrade may be required" library=/usr/lib/libcuda.so.550.78 error="cuda driver library init failure: 803" cuInit err: 803 time=2024-10-01T18:59:04.193+03:00 level=WARN source=gpu.go:558 msg="version mismatch between driver and cuda driver library - reboot or upgrade may be required" library=/usr/lib.bak/libcuda.so.550.90.07 error="cuda driver library init failure: 803" library /usr/lib32/libcuda.so.550.107.02 load err: /usr/lib32/libcuda.so.550.107.02: wrong ELF class: ELFCLASS32 time=2024-10-01T18:59:04.193+03:00 level=DEBUG source=gpu.go:566 msg="skipping 32bit library" library=/usr/lib32/libcuda.so.550.107.02 cuInit err: 999 time=2024-10-01T18:59:04.202+03:00 level=WARN source=gpu.go:562 msg="unknown error initializing cuda driver library" library=/usr/lib64/libcuda.so.550.107.02 error="cuda driver library init failure: 999" time=2024-10-01T18:59:04.202+03:00 level=WARN source=gpu.go:563 msg="see https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md for more information" cuInit err: 803 time=2024-10-01T18:59:04.205+03:00 level=WARN source=gpu.go:558 msg="version mismatch between driver and cuda driver library - reboot or upgrade may be required" library=/usr/lib64/libcuda.so.550.78 error="cuda driver library init failure: 803"
I'm on Manjaro, with nvidia: 550.107.02, cuda: 12.4. I just installed the newest ollama and noticed it starts the models really slowly. Slower than before. The nvidia is working (I can see games in nvidia-smi), but I'm not sure when last I saw ollama to use the gpu as it was pretty busy 2-3 weeks.

<!-- gh-comment-id:2386417077 --> @dstaicova commented on GitHub (Oct 1, 2024): Just to say, I'm getting the same error without docker: `>OLLAMA_DEBUG=1 ollama serve time=2024-10-01T18:59:04.108+03:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs" time=2024-10-01T18:59:04.109+03:00 level=DEBUG source=gpu.go:86 msg="searching for GPU discovery libraries for NVIDIA" time=2024-10-01T18:59:04.109+03:00 level=DEBUG source=gpu.go:468 msg="Searching for GPU library" name=libcuda.so* time=2024-10-01T18:59:04.109+03:00 level=DEBUG source=gpu.go:491 msg="gpu library search" globs="[/usr/local/lib/ollama/libcuda.so* /home/denijane/scripts/libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" time=2024-10-01T18:59:04.166+03:00 level=DEBUG source=gpu.go:525 msg="discovered GPU libraries" paths="[/usr/lib/libcuda.so.550.107.02 /usr/lib/libcuda.so.550.78 /usr/lib.bak/libcuda.so.550.90.07 /usr/lib32/libcuda.so.550.107.02 /usr/lib64/libcuda.so.550.107.02 /usr/lib64/libcuda.so.550.78]" cuInit err: 999 time=2024-10-01T18:59:04.173+03:00 level=WARN source=gpu.go:562 msg="unknown error initializing cuda driver library" library=/usr/lib/libcuda.so.550.107.02 error="cuda driver library init failure: 999" time=2024-10-01T18:59:04.173+03:00 level=WARN source=gpu.go:563 msg="see https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md for more information" cuInit err: 803 time=2024-10-01T18:59:04.181+03:00 level=WARN source=gpu.go:558 msg="version mismatch between driver and cuda driver library - reboot or upgrade may be required" library=/usr/lib/libcuda.so.550.78 error="cuda driver library init failure: 803" cuInit err: 803 time=2024-10-01T18:59:04.193+03:00 level=WARN source=gpu.go:558 msg="version mismatch between driver and cuda driver library - reboot or upgrade may be required" library=/usr/lib.bak/libcuda.so.550.90.07 error="cuda driver library init failure: 803" library /usr/lib32/libcuda.so.550.107.02 load err: /usr/lib32/libcuda.so.550.107.02: wrong ELF class: ELFCLASS32 time=2024-10-01T18:59:04.193+03:00 level=DEBUG source=gpu.go:566 msg="skipping 32bit library" library=/usr/lib32/libcuda.so.550.107.02 cuInit err: 999 time=2024-10-01T18:59:04.202+03:00 level=WARN source=gpu.go:562 msg="unknown error initializing cuda driver library" library=/usr/lib64/libcuda.so.550.107.02 error="cuda driver library init failure: 999" time=2024-10-01T18:59:04.202+03:00 level=WARN source=gpu.go:563 msg="see https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md for more information" cuInit err: 803 time=2024-10-01T18:59:04.205+03:00 level=WARN source=gpu.go:558 msg="version mismatch between driver and cuda driver library - reboot or upgrade may be required" library=/usr/lib64/libcuda.so.550.78 error="cuda driver library init failure: 803" ` I'm on Manjaro, with nvidia: 550.107.02, cuda: 12.4. I just installed the newest ollama and noticed it starts the models really slowly. Slower than before. The nvidia is working (I can see games in nvidia-smi), but I'm not sure when last I saw ollama to use the gpu as it was pretty busy 2-3 weeks.
Author
Owner

@dhiltgen commented on GitHub (Oct 17, 2024):

@denijane you're getting 2 different errors from 2 different libraries we try. The 999 error is a generic "unknown error" code, which isn't super helpful, however the other code 803 is enlightening.

    /**
     * This error indicates that there is a mismatch between the versions of
     * the display driver and the CUDA driver. Refer to the compatibility documentation
     * for supported versions.
     */
    CUDA_ERROR_SYSTEM_DRIVER_MISMATCH         = 803,

If you have already rebooted, somehow your system has gotten into an inconsistent state where the driver you're booting doesn't match the libraries installed.

<!-- gh-comment-id:2420200818 --> @dhiltgen commented on GitHub (Oct 17, 2024): @denijane you're getting 2 different errors from 2 different libraries we try. The 999 error is a generic "unknown error" code, which isn't super helpful, however the other code 803 is enlightening. ``` /** * This error indicates that there is a mismatch between the versions of * the display driver and the CUDA driver. Refer to the compatibility documentation * for supported versions. */ CUDA_ERROR_SYSTEM_DRIVER_MISMATCH = 803, ``` If you have already rebooted, somehow your system has gotten into an inconsistent state where the driver you're booting doesn't match the libraries installed.
Author
Owner

@MrHongping commented on GitHub (Oct 19, 2024):

I encountered the same problem. My GPU is Tesla P4 and I am using PVE virtualization GPU for Ubuntu 22's virtual machine. The Nvidia smi command checks that the virtualized GPU is working properly, but I am unable to use the GPU for acceleration even after starting up in Docker or virtual machine environments.

root@gpu-server:~# nvidia-smi
Sat Oct 19 12:17:19 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  GRID P4-8A                     On  | 00000000:00:10.0 Off |                  N/A |
| N/A   N/A    P8              N/A /  N/A |      0MiB /  8192MiB |      0%   Prohibited |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+


root@gpu-server:~# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

root@gpu-server:~# ollama serve
2024/10/19 11:59:03 routes.go:1158: INFO server config env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2024-10-19T11:59:03.005Z level=INFO source=images.go:754 msg="total blobs: 5"
time=2024-10-19T11:59:03.005Z level=INFO source=images.go:761 msg="total unused blobs removed: 0"
time=2024-10-19T11:59:03.005Z level=INFO source=routes.go:1205 msg="Listening on 127.0.0.1:11434 (version 0.3.13)"
time=2024-10-19T11:59:03.006Z level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama2843459739/runners
time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=cpu payload=linux/amd64/cpu/libggml.so.gz
time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=cpu payload=linux/amd64/cpu/libllama.so.gz
time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=cpu payload=linux/amd64/cpu/ollama_llama_server.gz
time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=cpu_avx payload=linux/amd64/cpu_avx/libggml.so.gz
time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=cpu_avx payload=linux/amd64/cpu_avx/libllama.so.gz
time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=cpu_avx payload=linux/amd64/cpu_avx/ollama_llama_server.gz
time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=cpu_avx2 payload=linux/amd64/cpu_avx2/libggml.so.gz
time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=cpu_avx2 payload=linux/amd64/cpu_avx2/libllama.so.gz
time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=cpu_avx2 payload=linux/amd64/cpu_avx2/ollama_llama_server.gz
time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=cuda_v11 payload=linux/amd64/cuda_v11/libggml.so.gz
time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=cuda_v11 payload=linux/amd64/cuda_v11/libllama.so.gz
time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=cuda_v11 payload=linux/amd64/cuda_v11/ollama_llama_server.gz
time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=cuda_v12 payload=linux/amd64/cuda_v12/libggml.so.gz
time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=cuda_v12 payload=linux/amd64/cuda_v12/libllama.so.gz
time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=cuda_v12 payload=linux/amd64/cuda_v12/ollama_llama_server.gz
time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=rocm_v60102 payload=linux/amd64/rocm_v60102/libggml.so.gz
time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=rocm_v60102 payload=linux/amd64/rocm_v60102/libllama.so.gz
time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=rocm_v60102 payload=linux/amd64/rocm_v60102/ollama_llama_server.gz
time=2024-10-19T11:59:14.631Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama2843459739/runners/cpu/ollama_llama_server
time=2024-10-19T11:59:14.631Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama2843459739/runners/cpu_avx/ollama_llama_server
time=2024-10-19T11:59:14.631Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama2843459739/runners/cpu_avx2/ollama_llama_server
time=2024-10-19T11:59:14.631Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama2843459739/runners/cuda_v11/ollama_llama_server
time=2024-10-19T11:59:14.631Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama2843459739/runners/cuda_v12/ollama_llama_server
time=2024-10-19T11:59:14.631Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama2843459739/runners/rocm_v60102/ollama_llama_server
time=2024-10-19T11:59:14.631Z level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cuda_v11 cuda_v12 rocm_v60102 cpu cpu_avx cpu_avx2]"
time=2024-10-19T11:59:14.631Z level=DEBUG source=common.go:50 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
time=2024-10-19T11:59:14.631Z level=DEBUG source=sched.go:105 msg="starting llm scheduler"
time=2024-10-19T11:59:14.631Z level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
time=2024-10-19T11:59:14.631Z level=DEBUG source=gpu.go:86 msg="searching for GPU discovery libraries for NVIDIA"
time=2024-10-19T11:59:14.631Z level=DEBUG source=gpu.go:468 msg="Searching for GPU library" name=libcuda.so*
time=2024-10-19T11:59:14.631Z level=DEBUG source=gpu.go:491 msg="gpu library search" globs="[/usr/local/lib/ollama/libcuda.so* /root/libcuda.so* /usr/local/cuda-12.2/lib64/libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
time=2024-10-19T11:59:14.634Z level=DEBUG source=gpu.go:525 msg="discovered GPU libraries" paths="[/usr/lib/x86_64-linux-gnu/libcuda.so.535.161.07 /usr/lib32/libcuda.so.535.161.07]"
CUDA driver version: 12.2
time=2024-10-19T11:59:14.640Z level=DEBUG source=gpu.go:118 msg="detected GPUs" count=1 library=/usr/lib/x86_64-linux-gnu/libcuda.so.535.161.07
time=2024-10-19T11:59:14.640Z level=INFO source=gpu.go:252 msg="error looking up nvidia GPU memory" error="cuda driver library failed to get device context 801"
time=2024-10-19T11:59:14.640Z level=DEBUG source=amd_linux.go:376 msg="amdgpu driver not detected /sys/module/amdgpu"
time=2024-10-19T11:59:14.640Z level=INFO source=gpu.go:347 msg="no compatible GPUs were discovered"
releasing cuda driver library
time=2024-10-19T11:59:14.640Z level=INFO source=types.go:107 msg="inference compute" id=0 library=cpu variant=avx2 compute="" driver=0.0 name="" total="7.8 GiB" available="7.2 GiB"
time=2024-10-19T12:02:44.961Z level=DEBUG source=sched.go:318 msg="shutting down scheduler completed loop"
time=2024-10-19T12:02:44.961Z level=DEBUG source=common.go:73 msg="cleaning up" dir=/tmp/ollama2843459739
time=2024-10-19T12:02:44.961Z level=DEBUG source=sched.go:119 msg="shutting down scheduler pending loop"

<!-- gh-comment-id:2423804266 --> @MrHongping commented on GitHub (Oct 19, 2024): I encountered the same problem. My GPU is Tesla P4 and I am using PVE virtualization GPU for Ubuntu 22's virtual machine. The Nvidia smi command checks that the virtualized GPU is working properly, but I am unable to use the GPU for acceleration even after starting up in Docker or virtual machine environments. ``` text root@gpu-server:~# nvidia-smi Sat Oct 19 12:17:19 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 GRID P4-8A On | 00000000:00:10.0 Off | N/A | | N/A N/A P8 N/A / N/A | 0MiB / 8192MiB | 0% Prohibited | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+ root@gpu-server:~# nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Thu_Nov_18_09:45:30_PST_2021 Cuda compilation tools, release 11.5, V11.5.119 Build cuda_11.5.r11.5/compiler.30672275_0 ``` ``` text root@gpu-server:~# ollama serve 2024/10/19 11:59:03 routes.go:1158: INFO server config env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2024-10-19T11:59:03.005Z level=INFO source=images.go:754 msg="total blobs: 5" time=2024-10-19T11:59:03.005Z level=INFO source=images.go:761 msg="total unused blobs removed: 0" time=2024-10-19T11:59:03.005Z level=INFO source=routes.go:1205 msg="Listening on 127.0.0.1:11434 (version 0.3.13)" time=2024-10-19T11:59:03.006Z level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama2843459739/runners time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=cpu payload=linux/amd64/cpu/libggml.so.gz time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=cpu payload=linux/amd64/cpu/libllama.so.gz time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=cpu payload=linux/amd64/cpu/ollama_llama_server.gz time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=cpu_avx payload=linux/amd64/cpu_avx/libggml.so.gz time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=cpu_avx payload=linux/amd64/cpu_avx/libllama.so.gz time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=cpu_avx payload=linux/amd64/cpu_avx/ollama_llama_server.gz time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=cpu_avx2 payload=linux/amd64/cpu_avx2/libggml.so.gz time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=cpu_avx2 payload=linux/amd64/cpu_avx2/libllama.so.gz time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=cpu_avx2 payload=linux/amd64/cpu_avx2/ollama_llama_server.gz time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=cuda_v11 payload=linux/amd64/cuda_v11/libggml.so.gz time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=cuda_v11 payload=linux/amd64/cuda_v11/libllama.so.gz time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=cuda_v11 payload=linux/amd64/cuda_v11/ollama_llama_server.gz time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=cuda_v12 payload=linux/amd64/cuda_v12/libggml.so.gz time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=cuda_v12 payload=linux/amd64/cuda_v12/libllama.so.gz time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=cuda_v12 payload=linux/amd64/cuda_v12/ollama_llama_server.gz time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=rocm_v60102 payload=linux/amd64/rocm_v60102/libggml.so.gz time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=rocm_v60102 payload=linux/amd64/rocm_v60102/libllama.so.gz time=2024-10-19T11:59:03.006Z level=DEBUG source=common.go:168 msg=extracting runner=rocm_v60102 payload=linux/amd64/rocm_v60102/ollama_llama_server.gz time=2024-10-19T11:59:14.631Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama2843459739/runners/cpu/ollama_llama_server time=2024-10-19T11:59:14.631Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama2843459739/runners/cpu_avx/ollama_llama_server time=2024-10-19T11:59:14.631Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama2843459739/runners/cpu_avx2/ollama_llama_server time=2024-10-19T11:59:14.631Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama2843459739/runners/cuda_v11/ollama_llama_server time=2024-10-19T11:59:14.631Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama2843459739/runners/cuda_v12/ollama_llama_server time=2024-10-19T11:59:14.631Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama2843459739/runners/rocm_v60102/ollama_llama_server time=2024-10-19T11:59:14.631Z level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cuda_v11 cuda_v12 rocm_v60102 cpu cpu_avx cpu_avx2]" time=2024-10-19T11:59:14.631Z level=DEBUG source=common.go:50 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY" time=2024-10-19T11:59:14.631Z level=DEBUG source=sched.go:105 msg="starting llm scheduler" time=2024-10-19T11:59:14.631Z level=INFO source=gpu.go:199 msg="looking for compatible GPUs" time=2024-10-19T11:59:14.631Z level=DEBUG source=gpu.go:86 msg="searching for GPU discovery libraries for NVIDIA" time=2024-10-19T11:59:14.631Z level=DEBUG source=gpu.go:468 msg="Searching for GPU library" name=libcuda.so* time=2024-10-19T11:59:14.631Z level=DEBUG source=gpu.go:491 msg="gpu library search" globs="[/usr/local/lib/ollama/libcuda.so* /root/libcuda.so* /usr/local/cuda-12.2/lib64/libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" time=2024-10-19T11:59:14.634Z level=DEBUG source=gpu.go:525 msg="discovered GPU libraries" paths="[/usr/lib/x86_64-linux-gnu/libcuda.so.535.161.07 /usr/lib32/libcuda.so.535.161.07]" CUDA driver version: 12.2 time=2024-10-19T11:59:14.640Z level=DEBUG source=gpu.go:118 msg="detected GPUs" count=1 library=/usr/lib/x86_64-linux-gnu/libcuda.so.535.161.07 time=2024-10-19T11:59:14.640Z level=INFO source=gpu.go:252 msg="error looking up nvidia GPU memory" error="cuda driver library failed to get device context 801" time=2024-10-19T11:59:14.640Z level=DEBUG source=amd_linux.go:376 msg="amdgpu driver not detected /sys/module/amdgpu" time=2024-10-19T11:59:14.640Z level=INFO source=gpu.go:347 msg="no compatible GPUs were discovered" releasing cuda driver library time=2024-10-19T11:59:14.640Z level=INFO source=types.go:107 msg="inference compute" id=0 library=cpu variant=avx2 compute="" driver=0.0 name="" total="7.8 GiB" available="7.2 GiB" time=2024-10-19T12:02:44.961Z level=DEBUG source=sched.go:318 msg="shutting down scheduler completed loop" time=2024-10-19T12:02:44.961Z level=DEBUG source=common.go:73 msg="cleaning up" dir=/tmp/ollama2843459739 time=2024-10-19T12:02:44.961Z level=DEBUG source=sched.go:119 msg="shutting down scheduler pending loop" ```
Author
Owner

@dhiltgen commented on GitHub (Nov 6, 2024):

I've posted a new PR documenting a workaround some users are seeing success with for a slightly different failure mode, but it might be helpful in these cases as well. If you are experiencing the sporadic 801, please give it a try and let us know if it resolves the problem.

#7519

<!-- gh-comment-id:2458482493 --> @dhiltgen commented on GitHub (Nov 6, 2024): I've posted a new PR documenting a workaround some users are seeing success with for a slightly different failure mode, but it might be helpful in these cases as well. If you are experiencing the sporadic 801, please give it a try and let us know if it resolves the problem. #7519
Author
Owner

@DoctorDream commented on GitHub (Feb 24, 2025):

我发布了一个新的 PR,记录了一些用户看到的解决方法,但对于略有不同的失败模式,它也可能有所帮助。如果您遇到零星的 801,请尝试一下,并告诉我们它是否解决了问题。

#7519

After research, it has been found that this problem mostly occurs in virtual machines. Through numerous attempts, I've discovered that the cause of this problem (cuda driver library failed to get device context 801) may not be related to the graphics card itself, but rather to the CPU instruction set.
The default CPU virtualization mode in the VM doesn't have the AVX2 instruction set (which can be checked via lscpu). However, after switching it to the host virtualization mode, the AVX2 instruction set appears, and at this point, ollama runs normally.

<!-- gh-comment-id:2677290352 --> @DoctorDream commented on GitHub (Feb 24, 2025): > 我发布了一个新的 PR,记录了一些用户看到的解决方法,但对于略有不同的失败模式,它也可能有所帮助。如果您遇到零星的 801,请尝试一下,并告诉我们它是否解决了问题。 > > [#7519](https://github.com/ollama/ollama/pull/7519) After research, it has been found that this problem mostly occurs in virtual machines. Through numerous attempts, I've discovered that the cause of this problem (`cuda driver library failed to get device context 801`) may not be related to the graphics card itself, but rather to the CPU instruction set. The default CPU virtualization mode in the VM doesn't have the AVX2 instruction set (which can be checked via `lscpu`). However, after switching it to the host virtualization mode, the AVX2 instruction set appears, and at this point, ollama runs normally.
Author
Owner

@tharun571 commented on GitHub (Jan 20, 2026):

The "cuda driver library failed to get device context 801" error in Docker containers is frustrating - especially when GPU works fine outside Docker!

Common root causes for Docker GPU detection issues:

  1. nvidia-container-runtime not configured: Check if docker info | grep -i runtime shows nvidia
  2. Missing --gpus flag: Ensure you're using --gpus all or the runtime config in docker-compose
  3. Driver/container toolkit version mismatch: Host driver must support container's CUDA version
  4. Device permissions: Container might not have access to /dev/nvidia* devices

Quick validation steps:

# Inside container, check if GPU is visible:
nvidia-smi
# Check CUDA device access:
ls -la /dev/nvidia*

For intermittent failures (works sometimes, fails others), it's often a race condition with device initialization.

I built an OSS diagnostic tool for these exact GPU+Docker issues: env-doctor

It checks:

  • Docker GPU runtime configuration
  • Host driver vs container CUDA compatibility
  • GPU device forwarding and permissions
  • Detects missing nvidia-container-toolkit setup

It works both inside and outside containers to pinpoint where the GPU detection breaks.

Full disclosure: I'm the author. Sharing because Docker GPU issues are notoriously hard to debug, and this automates the validation chain. Hope it helps troubleshoot!

<!-- gh-comment-id:3774350626 --> @tharun571 commented on GitHub (Jan 20, 2026): The "cuda driver library failed to get device context 801" error in Docker containers is frustrating - especially when GPU works fine outside Docker! Common root causes for Docker GPU detection issues: 1. **nvidia-container-runtime not configured**: Check if `docker info | grep -i runtime` shows nvidia 2. **Missing --gpus flag**: Ensure you're using `--gpus all` or the runtime config in docker-compose 3. **Driver/container toolkit version mismatch**: Host driver must support container's CUDA version 4. **Device permissions**: Container might not have access to `/dev/nvidia*` devices Quick validation steps: ```bash # Inside container, check if GPU is visible: nvidia-smi # Check CUDA device access: ls -la /dev/nvidia* ``` For intermittent failures (works sometimes, fails others), it's often a race condition with device initialization. I built an OSS diagnostic tool for these exact GPU+Docker issues: **[env-doctor](https://github.com/mitulgarg/env-doctor)** It checks: - Docker GPU runtime configuration - Host driver vs container CUDA compatibility - GPU device forwarding and permissions - Detects missing nvidia-container-toolkit setup It works both inside and outside containers to pinpoint where the GPU detection breaks. Full disclosure: I'm the author. Sharing because Docker GPU issues are notoriously hard to debug, and this automates the validation chain. Hope it helps troubleshoot!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#3995