[GH-ISSUE #1413] OOM Error on Bad CUDA Driver #47264

Closed
opened 2026-04-28 03:29:10 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @farhanhubble on GitHub (Dec 7, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1413

Ollama version: 0.1.1
Reproduction:

  • nvidia-smi

    Failed to initialize NVML: Driver/library version mismatch
    NVML library version: 535.129
    
  • Run server

    IP='0.0.0.0'
    PORT='11434'
    EXE='bin/ollama'
    ARGS='serve'
    ENV="OLLAMA_HOST=$IP:$PORT'"
    CMD="$ENV $EXE $ARGS"
    echo Running $CMD
    eval $CMD
    
  • Try embedding a slightly long payload

    import requests
    response = requests.post('http://localhost:11434/api/embeddings', json={
        'model': 'llama2:latest',
        'prompt': 'Here is an article about llamas...'*30
    })
    
  • Error

    CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml/ggml-cuda.cu:4856: out of memory
    
  • Logs

    {"timestamp":1701923173,"level":"INFO","function":"main","line":1192,"message":"system info","n_threads":32,"total_threads":64,"system_info":"AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | "}
    llama.cpp: loading model from [REDACTED]
    llama_model_load_internal: format     = ggjt v3 (latest)
    llama_model_load_internal: n_vocab    = 32000
    llama_model_load_internal: n_ctx      = 2048
    llama_model_load_internal: n_embd     = 4096
    llama_model_load_internal: n_mult     = 256
    llama_model_load_internal: n_head     = 32
    llama_model_load_internal: n_head_kv  = 32
    llama_model_load_internal: n_layer    = 32
    llama_model_load_internal: n_rot      = 128
    llama_model_load_internal: n_gqa      = 1
    llama_model_load_internal: rnorm_eps  = 5.0e-06
    llama_model_load_internal: n_ff       = 11008
    llama_model_load_internal: freq_base  = 10000.0
    llama_model_load_internal: freq_scale = 1
    llama_model_load_internal: ftype      = 2 (mostly Q4_0)
    llama_model_load_internal: model size = 7B
    llama_model_load_internal: ggml ctx size =    0.08 MB
    llama_model_load_internal: using CUDA for GPU acceleration
    llama_model_load_internal: mem required  = 4013.73 MB (+ 1024.00 MB per state)
    llama_model_load_internal: offloading 0 repeating layers to GPU
    llama_model_load_internal: offloaded 0/35 layers to GPU
    llama_model_load_internal: total VRAM used: 384 MB
    llama_new_context_with_model: kv self size  = 1024.00 MB
    
  • Fix: Hide GPU with CUDA_VISIBLE_DEVICES=''

Originally created by @farhanhubble on GitHub (Dec 7, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1413 **Ollama version**: 0.1.1 **Reproduction**: - `nvidia-smi` ``` Failed to initialize NVML: Driver/library version mismatch NVML library version: 535.129 ``` - Run server ``` IP='0.0.0.0' PORT='11434' EXE='bin/ollama' ARGS='serve' ENV="OLLAMA_HOST=$IP:$PORT'" CMD="$ENV $EXE $ARGS" echo Running $CMD eval $CMD ``` - Try embedding a slightly long payload ``` import requests response = requests.post('http://localhost:11434/api/embeddings', json={ 'model': 'llama2:latest', 'prompt': 'Here is an article about llamas...'*30 }) ``` - Error ``` CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml/ggml-cuda.cu:4856: out of memory ``` - Logs ``` {"timestamp":1701923173,"level":"INFO","function":"main","line":1192,"message":"system info","n_threads":32,"total_threads":64,"system_info":"AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | "} llama.cpp: loading model from [REDACTED] llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_head_kv = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: n_gqa = 1 llama_model_load_internal: rnorm_eps = 5.0e-06 llama_model_load_internal: n_ff = 11008 llama_model_load_internal: freq_base = 10000.0 llama_model_load_internal: freq_scale = 1 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.08 MB llama_model_load_internal: using CUDA for GPU acceleration llama_model_load_internal: mem required = 4013.73 MB (+ 1024.00 MB per state) llama_model_load_internal: offloading 0 repeating layers to GPU llama_model_load_internal: offloaded 0/35 layers to GPU llama_model_load_internal: total VRAM used: 384 MB llama_new_context_with_model: kv self size = 1024.00 MB ``` - Fix: Hide GPU with ` CUDA_VISIBLE_DEVICES=''`
GiteaMirror added the bug label 2026-04-28 03:29:10 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#47264