[GH-ISSUE #11841] the runner process fails to pick up GPUs with SLURM sbatch or srun with singularity #69917

Closed
opened 2026-05-04 19:46:54 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @hwang2006 on GitHub (Aug 10, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11841

What I observed (symptoms)
Under srun/sbatch + Singularity, Ollama would start and detect the A100 (“inference compute… library=cuda”), but when the runner spawned it would load:

load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so

…and never show “loaded CUDA backend / using device CUDA0”.

As a result, generation was CPU-only → Gradio sat “waiting” forever or felt painfully slow.

When I ran the same script on a login node for testing purpose, the runner did pick CUDA (you saw “using device CUDA0 … runner started”).

$ srun -p amd_a100nv_8 --comment=pytorch --gres=gpu:1 ./ollama_gradio_run.sh
srun: job 540347 queued and waiting for resources
srun: job 540347 has been allocated resources

Starting Ollama + Gradio
Date: Sun Aug 10 18:24:57 KST 2025
Server: gpu34
SLURM Job ID: 540347
Gradio Port (requested): 7860
Ollama Port: 11434
Default Model: 0

load module-environment
🔍 Python / GPU:
/scratch/qualis/miniconda3/envs/deepseek/bin/python
Python executable: /scratch/qualis/miniconda3/envs/deepseek/bin/python
fastapi 0.116.1
gradio 5.41.1
gradio_client 1.11.0
uvicorn 0.35.0
NVIDIA A100-SXM4-80GB, 81920, 81038
Using CUDA toolkit at: /apps/cuda/12.1
🚀 Starting Ollama server…
Ollama PID: 125982
Ollama API is up!
📋 Available models:

  • mistral:7b (7.2B, Q4_K_M)
  • qwen3:8b (8.2B, Q4_K_M)
  • gpt-oss:latest (20.9B, MXFP4)
  • gpt-oss:120b (116.8B, MXFP4)
  • tinyllama:latest (1B, Q4_0)
  • phi3:latest (3.8B, Q4_0)
  • gemma:latest (9B, Q4_0)
  • llama3:latest (8.0B, Q4_0)
    🌐 Starting Gradio web interface...
    Gradio PID: 126331
    Waiting for Gradio UI at http://127.0.0.1:7860/ ...
    ... still waiting (10s)
    ... still waiting (20s)
    ... still waiting (30s)
    ... still waiting (40s)
    ... still waiting (50s)
    ... still waiting (60s)
    ... still waiting (70s)
    ... still waiting (80s)
    ... still waiting (90s)
    ... still waiting (100s)
    ... still waiting (110s)
    ... still waiting (120s)
    ... still waiting (130s)
    ... still waiting (140s)
    ... still waiting (150s)
    ... still waiting (160s)
    ... still waiting (170s)
    ... still waiting (180s)
    ... still waiting (190s)
    ... still waiting (200s)
    ... still waiting (210s)
    ... still waiting (220s)
    ... still waiting (230s)
    ... still waiting (240s)
    ... still waiting (250s)
    ... still waiting (260s)
    ... still waiting (270s)
    ... still waiting (280s)
    ... still waiting (290s)
    ... still waiting (300s)
    ... still waiting (310s)
    ... still waiting (320s)
    ... still waiting (330s)
    ... still waiting (340s)
    ... still waiting (350s)
    ... still waiting (360s)
    ... still waiting (370s)
    ... still waiting (380s)
    ... still waiting (390s)
    ... still waiting (400s)
    ... still waiting (410s)
    ... still waiting (420s)
    ... still waiting (430s)
    ... still waiting (440s)
    ... still waiting (450s)
    ... still waiting (460s)
    ... still waiting (470s)
    ... still waiting (480s)
    ... still waiting (490s)
    ... still waiting (500s)
    ... still waiting (510s)
    ... still waiting (520s)
    ... still waiting (530s)
    ... still waiting (540s)
    ... still waiting (550s)
    Gradio UI is up!
    =========================================
    🎉 All services started successfully!
    Gradio URL: http://gpu34:7860
    Local access (tunnel): http://localhost:7860 → use:
    ssh -N -L 7860:gpu34:7860 -L 11434:gpu34:11434 qualis@neuron.ksc.re.kr
    Ollama API: http://gpu34:11434
    Logs:
    Ollama: /scratch/qualis/deepseek/ollama_server_540347.log
    Gradio: /scratch/qualis/deepseek/gradio_server_540347.log
    =========================================

ollama_server_540347.log

ollama_gradio_run.sh.txt

Originally created by @hwang2006 on GitHub (Aug 10, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11841 What I observed (symptoms) Under srun/sbatch + Singularity, Ollama would start and detect the A100 (“inference compute… library=cuda”), but when the runner spawned it would load: load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so …and never show “loaded CUDA backend / using device CUDA0”. As a result, generation was CPU-only → Gradio sat “waiting” forever or felt painfully slow. When I ran the same script on a login node for testing purpose, the runner did pick CUDA (you saw “using device CUDA0 … runner started”). $ srun -p amd_a100nv_8 --comment=pytorch --gres=gpu:1 ./ollama_gradio_run.sh srun: job 540347 queued and waiting for resources srun: job 540347 has been allocated resources ======================================== Starting Ollama + Gradio Date: Sun Aug 10 18:24:57 KST 2025 Server: gpu34 SLURM Job ID: 540347 Gradio Port (requested): 7860 Ollama Port: 11434 Default Model: 0 ======================================== load module-environment 🔍 Python / GPU: /scratch/qualis/miniconda3/envs/deepseek/bin/python Python executable: /scratch/qualis/miniconda3/envs/deepseek/bin/python fastapi 0.116.1 gradio 5.41.1 gradio_client 1.11.0 uvicorn 0.35.0 NVIDIA A100-SXM4-80GB, 81920, 81038 Using CUDA toolkit at: /apps/cuda/12.1 🚀 Starting Ollama server… Ollama PID: 125982 ✅ Ollama API is up! 📋 Available models: - mistral:7b (7.2B, Q4_K_M) - qwen3:8b (8.2B, Q4_K_M) - gpt-oss:latest (20.9B, MXFP4) - gpt-oss:120b (116.8B, MXFP4) - tinyllama:latest (1B, Q4_0) - phi3:latest (3.8B, Q4_0) - gemma:latest (9B, Q4_0) - llama3:latest (8.0B, Q4_0) 🌐 Starting Gradio web interface... Gradio PID: 126331 ⏳ Waiting for Gradio UI at http://127.0.0.1:7860/ ... ... still waiting (10s) ... still waiting (20s) ... still waiting (30s) ... still waiting (40s) ... still waiting (50s) ... still waiting (60s) ... still waiting (70s) ... still waiting (80s) ... still waiting (90s) ... still waiting (100s) ... still waiting (110s) ... still waiting (120s) ... still waiting (130s) ... still waiting (140s) ... still waiting (150s) ... still waiting (160s) ... still waiting (170s) ... still waiting (180s) ... still waiting (190s) ... still waiting (200s) ... still waiting (210s) ... still waiting (220s) ... still waiting (230s) ... still waiting (240s) ... still waiting (250s) ... still waiting (260s) ... still waiting (270s) ... still waiting (280s) ... still waiting (290s) ... still waiting (300s) ... still waiting (310s) ... still waiting (320s) ... still waiting (330s) ... still waiting (340s) ... still waiting (350s) ... still waiting (360s) ... still waiting (370s) ... still waiting (380s) ... still waiting (390s) ... still waiting (400s) ... still waiting (410s) ... still waiting (420s) ... still waiting (430s) ... still waiting (440s) ... still waiting (450s) ... still waiting (460s) ... still waiting (470s) ... still waiting (480s) ... still waiting (490s) ... still waiting (500s) ... still waiting (510s) ... still waiting (520s) ... still waiting (530s) ... still waiting (540s) ... still waiting (550s) ✅ Gradio UI is up! ========================================= 🎉 All services started successfully! Gradio URL: http://gpu34:7860 Local access (tunnel): http://localhost:7860 → use: ssh -N -L 7860:gpu34:7860 -L 11434:gpu34:11434 qualis@neuron.ksc.re.kr Ollama API: http://gpu34:11434 Logs: Ollama: /scratch/qualis/deepseek/ollama_server_540347.log Gradio: /scratch/qualis/deepseek/gradio_server_540347.log ========================================= [ollama_server_540347.log](https://github.com/user-attachments/files/21704738/ollama_server_540347.log) [ollama_gradio_run.sh.txt](https://github.com/user-attachments/files/21704730/ollama_gradio_run.sh.txt)
GiteaMirror added the feature request label 2026-05-04 19:46:54 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#69917