[GH-ISSUE #3267] CUDA Error when changing models #27772

Closed
opened 2026-04-22 05:21:03 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @iamashwin99 on GitHub (Mar 20, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3267

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

I ran a query on ollama on 0.1.29 first using llama2 then nomic-embed-text and then back to llama2 .
On the third change of model I get the cuda error:

llama_new_context_with_model:      CUDA7 compute buffer size =     3.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     1.50 MiB
llama_new_context_with_model: graph splits (measure): 9
loading library /tmp/ollama126694761/runners/cuda_v11/libext_server.so
{"function":"initialize","level":"INFO","line":440,"msg":"initializing slots","n_slots":1,"tid":"140511969015552","timestamp":1710928189}
{"function":"initialize","level":"INFO","line":449,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"140511969015552","timestamp":1710928189}
time=2024-03-20T10:49:49.921+01:00 level=INFO source=dyn_ext_server.go:162 msg="Starting llama main loop"
{"function":"update_slots","level":"INFO","line":1590,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"140509200766720","timestamp":1710928189}
{"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":0,"tid":"140509200766720","timestamp":1710928189}
{"function":"update_slots","level":"INFO","line":1848,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":0,"tid":"140509200766720","timestamp":1710928189}
{"function":"update_slots","level":"INFO","line":1652,"msg":"slot released","n_cache_tokens":8,"n_ctx":2048,"n_past":8,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"140509200766720","timestamp":1710928189,"truncated":false}
[GIN] 2024/03/20 - 10:49:50 | 200 |  4.469532267s |    10.254.6.122 | POST     "/api/embeddings"
time=2024-03-20T10:49:50.059+01:00 level=INFO source=routes.go:79 msg="changing loaded model"
CUDA error: invalid argument
  current device: 6, in function ggml_free_cublas at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml-cuda.cu:12501
  cuMemUnmap(g_cuda_pool_addr[id], g_cuda_pool_size[id])
GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml-cuda.cu:256: !"CUDA error"
[New LWP 586182]
[New LWP 586183]
[New LWP 586184]

Full log at https://sprunge.us/qACpmh

What did you expect to see?

No errors.

Steps to reproduce

ollama serve
ollama pull nomic-embed-text
ollama pull llama2

curl -X POST http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt":"Why is the sky blue?"
 }'

curl http://localhost:11434/api/embeddings -d '{
  "model": "nomic-embed-text",
  "prompt": "The sky is blue because of Rayleigh scattering"
}'

curl -X POST http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt":"Why is the sky blue?"
 }'
# Fails here

Are there any recent changes that introduced the issue?

No response

OS

Linux

Architecture

amd64

Platform

No response

Ollama version

0.1.29

GPU

No response

GPU info

8x Tesla V100

CPU

Intel

Other software

No response

Originally created by @iamashwin99 on GitHub (Mar 20, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3267 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? I ran a query on ollama on 0.1.29 first using `llama2` then `nomic-embed-text` and then back to `llama2` . On the third change of model I get the cuda error: ```console llama_new_context_with_model: CUDA7 compute buffer size = 3.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 1.50 MiB llama_new_context_with_model: graph splits (measure): 9 loading library /tmp/ollama126694761/runners/cuda_v11/libext_server.so {"function":"initialize","level":"INFO","line":440,"msg":"initializing slots","n_slots":1,"tid":"140511969015552","timestamp":1710928189} {"function":"initialize","level":"INFO","line":449,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"140511969015552","timestamp":1710928189} time=2024-03-20T10:49:49.921+01:00 level=INFO source=dyn_ext_server.go:162 msg="Starting llama main loop" {"function":"update_slots","level":"INFO","line":1590,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"140509200766720","timestamp":1710928189} {"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":0,"tid":"140509200766720","timestamp":1710928189} {"function":"update_slots","level":"INFO","line":1848,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":0,"tid":"140509200766720","timestamp":1710928189} {"function":"update_slots","level":"INFO","line":1652,"msg":"slot released","n_cache_tokens":8,"n_ctx":2048,"n_past":8,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"140509200766720","timestamp":1710928189,"truncated":false} [GIN] 2024/03/20 - 10:49:50 | 200 | 4.469532267s | 10.254.6.122 | POST "/api/embeddings" time=2024-03-20T10:49:50.059+01:00 level=INFO source=routes.go:79 msg="changing loaded model" CUDA error: invalid argument current device: 6, in function ggml_free_cublas at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml-cuda.cu:12501 cuMemUnmap(g_cuda_pool_addr[id], g_cuda_pool_size[id]) GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml-cuda.cu:256: !"CUDA error" [New LWP 586182] [New LWP 586183] [New LWP 586184] ``` Full log at https://sprunge.us/qACpmh ### What did you expect to see? No errors. ### Steps to reproduce ```console ollama serve ollama pull nomic-embed-text ollama pull llama2 curl -X POST http://localhost:11434/api/generate -d '{ "model": "llama2", "prompt":"Why is the sky blue?" }' curl http://localhost:11434/api/embeddings -d '{ "model": "nomic-embed-text", "prompt": "The sky is blue because of Rayleigh scattering" }' curl -X POST http://localhost:11434/api/generate -d '{ "model": "llama2", "prompt":"Why is the sky blue?" }' # Fails here ``` ### Are there any recent changes that introduced the issue? _No response_ ### OS Linux ### Architecture amd64 ### Platform _No response_ ### Ollama version 0.1.29 ### GPU _No response_ ### GPU info 8x Tesla V100 ### CPU Intel ### Other software _No response_
GiteaMirror added the nvidiabug labels 2026-04-22 05:21:03 -05:00
Author
Owner

@iamashwin99 commented on GitHub (Mar 20, 2024):

Same behavior on v 0.1.27 and 0.1.26 as well

<!-- gh-comment-id:2009183976 --> @iamashwin99 commented on GitHub (Mar 20, 2024): Same behavior on v `0.1.27` and `0.1.26` as well
Author
Owner

@dhiltgen commented on GitHub (Mar 20, 2024):

This sounds like it may be global state leakage. It's in the class of defects that should be resolved when we merge #3218

<!-- gh-comment-id:2009887849 --> @dhiltgen commented on GitHub (Mar 20, 2024): This sounds like it may be global state leakage. It's in the class of defects that should be resolved when we merge #3218
Author
Owner

@dhiltgen commented on GitHub (Apr 15, 2024):

Please retry with 0.1.32 and if you're still seeing the problem let us know.

<!-- gh-comment-id:2057949741 --> @dhiltgen commented on GitHub (Apr 15, 2024): Please retry with 0.1.32 and if you're still seeing the problem let us know.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#27772