[GH-ISSUE #6471] Issue when running smollm:360m and also smollm:135m #4071

Closed
opened 2026-04-12 14:58:41 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @NEWbie0709 on GitHub (Aug 23, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6471

What is the issue?

I tried running with the 1.7b version, and it ran successfully.
image
However, when running these two smaller versions, it shows the following error.
image

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.3.6

Originally created by @NEWbie0709 on GitHub (Aug 23, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6471 ### What is the issue? I tried running with the 1.7b version, and it ran successfully. ![image](https://github.com/user-attachments/assets/6074c785-cbb2-43e0-b82d-32fe74184840) However, when running these two smaller versions, it shows the following error. ![image](https://github.com/user-attachments/assets/419da9f0-0ea2-4795-bdab-78e457dfbd08) ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.3.6
GiteaMirror added the bug label 2026-04-12 14:58:41 -05:00
Author
Owner

@rick-github commented on GitHub (Aug 23, 2024):

Server logs may help in debugging, but as a first guess I'd say that your machine doesn't have enough (V)RAM to host the 135m (92MB) or 360m (229MB) models.

<!-- gh-comment-id:2306863536 --> @rick-github commented on GitHub (Aug 23, 2024): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may help in debugging, but as a first guess I'd say that your machine doesn't have enough (V)RAM to host the 135m (92MB) or 360m (229MB) models.
Author
Owner

@NEWbie0709 commented on GitHub (Aug 26, 2024):

this is the server logs
2024/08/26 10:31:14 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\Users\Tianyi\.ollama\models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://] OLLAMA_RUNNERS_DIR:C:\Users\Tianyi\AppData\Local\Programs\Ollama\ollama_runners OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-08-26T10:31:14.730+08:00 level=INFO source=images.go:782 msg="total blobs: 27"
time=2024-08-26T10:31:14.760+08:00 level=INFO source=images.go:790 msg="total unused blobs removed: 0"
time=2024-08-26T10:31:14.762+08:00 level=INFO source=routes.go:1172 msg="Listening on 127.0.0.1:11434 (version 0.3.6)"
time=2024-08-26T10:31:14.764+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [rocm_v6.1 cpu cpu_avx cpu_avx2 cuda_v11.3]"
time=2024-08-26T10:31:14.764+08:00 level=INFO source=gpu.go:204 msg="looking for compatible GPUs"
time=2024-08-26T10:31:15.009+08:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-db6c762b-e2a3-42a7-5a9f-7cfef49069d4 library=cuda compute=8.6 driver=12.6 name="NVIDIA GeForce RTX 3060" total="12.0 GiB" available="11.0 GiB"
[GIN] 2024/08/26 - 10:31:15 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2024/08/26 - 10:31:15 | 200 | 40.7308ms | 127.0.0.1 | POST "/api/show"
time=2024-08-26T10:31:15.090+08:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Tianyi.ollama\models\blobs\sha256-eb2c714d40d4b35ba4b8ee98475a06d51d8080a17d2d2a75a23665985c739b94 gpu=GPU-db6c762b-e2a3-42a7-5a9f-7cfef49069d4 parallel=4 available=11707162624 required="895.2 MiB"
time=2024-08-26T10:31:15.090+08:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=31 layers.offload=31 layers.split="" memory.available="[10.9 GiB]" memory.required.full="895.2 MiB" memory.required.partial="895.2 MiB" memory.required.kv="180.0 MiB" memory.required.allocations="[895.2 MiB]" memory.weights.total="237.1 MiB" memory.weights.repeating="208.4 MiB" memory.weights.nonrepeating="28.7 MiB" memory.graph.full="164.5 MiB" memory.graph.partial="168.4 MiB"
time=2024-08-26T10:31:15.100+08:00 level=INFO source=server.go:393 msg="starting llama server" cmd="C:\Users\Tianyi\AppData\Local\Programs\Ollama\ollama_runners\cuda_v11.3\ollama_llama_server.exe --model C:\Users\Tianyi\.ollama\models\blobs\sha256-eb2c714d40d4b35ba4b8ee98475a06d51d8080a17d2d2a75a23665985c739b94 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 31 --no-mmap --parallel 4 --port 50650"
time=2024-08-26T10:31:15.195+08:00 level=INFO source=sched.go:445 msg="loaded runners" count=1
time=2024-08-26T10:31:15.195+08:00 level=INFO source=server.go:593 msg="waiting for llama runner to start responding"
time=2024-08-26T10:31:15.196+08:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=3535 commit="1e6f6554" tid="8360" timestamp=1724639475
INFO [wmain] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="8360" timestamp=1724639475 total_threads=16
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="50650" tid="8360" timestamp=1724639475
llama_model_loader: loaded meta data with 39 key-value pairs and 272 tensors from C:\Users\Tianyi.ollama\models\blobs\sha256-eb2c714d40d4b35ba4b8ee98475a06d51d8080a17d2d2a75a23665985c739b94 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = SmolLM 135M
llama_model_loader: - kv 3: general.organization str = HuggingFaceTB
llama_model_loader: - kv 4: general.finetune str = Instruct
llama_model_loader: - kv 5: general.basename str = SmolLM
llama_model_loader: - kv 6: general.size_label str = 135M
llama_model_loader: - kv 7: general.license str = apache-2.0
llama_model_loader: - kv 8: general.base_model.count u32 = 1
llama_model_loader: - kv 9: general.base_model.0.name str = SmolLM 135M
llama_model_loader: - kv 10: general.base_model.0.organization str = HuggingFaceTB
llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/HuggingFaceTB/...
llama_model_loader: - kv 12: general.tags arr[str,3] = ["alignment-handbook", "trl", "sft"]
llama_model_loader: - kv 13: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 14: general.datasets arr[str,4] = ["Magpie-Align/Magpie-Pro-300K-Filter...
llama_model_loader: - kv 15: llama.block_count u32 = 30
llama_model_loader: - kv 16: llama.context_length u32 = 2048
llama_model_loader: - kv 17: llama.embedding_length u32 = 576
llama_model_loader: - kv 18: llama.feed_forward_length u32 = 1536
llama_model_loader: - kv 19: llama.attention.head_count u32 = 9
llama_model_loader: - kv 20: llama.attention.head_count_kv u32 = 3
llama_model_loader: - kv 21: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 22: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 23: general.file_type u32 = 2
llama_model_loader: - kv 24: llama.vocab_size u32 = 49152
llama_model_loader: - kv 25: llama.rope.dimension_count u32 = 64
llama_model_loader: - kv 26: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 28: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 29: tokenizer.ggml.pre str = smollm
llama_model_loader: - kv 30: tokenizer.ggml.tokens arr[str,49152] = ["<|endoftext|>", "<|im_start|>", "<|...
llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,49152] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 32: tokenizer.ggml.merges arr[str,48900] = ["Ġ t", "Ġ a", "i n", "h e", "Ġ Ġ...
llama_model_loader: - kv 33: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 34: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 35: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 36: tokenizer.ggml.padding_token_id u32 = 2
llama_model_loader: - kv 37: tokenizer.chat_template str = {% for message in messages %}{{'<|im_...
llama_model_loader: - kv 38: general.quantization_version u32 = 2
llama_model_loader: - type f32: 61 tensors
llama_model_loader: - type q4_0: 210 tensors
llama_model_loader: - type q8_0: 1 tensors
llm_load_vocab: special tokens cache size = 17
llm_load_vocab: token to piece cache size = 0.3170 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 49152
llm_load_print_meta: n_merges = 48900
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 576
llm_load_print_meta: n_layer = 30
llm_load_print_meta: n_head = 9
llm_load_print_meta: n_head_kv = 3
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 3
llm_load_print_meta: n_embd_k_gqa = 192
llm_load_print_meta: n_embd_v_gqa = 192
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 1536
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 134.52 M
llm_load_print_meta: model size = 85.77 MiB (5.35 BPW)
llm_load_print_meta: general.name = SmolLM 135M
llm_load_print_meta: BOS token = 1 '<|im_start|>'
llm_load_print_meta: EOS token = 2 '<|im_end|>'
llm_load_print_meta: UNK token = 0 '<|endoftext|>'
llm_load_print_meta: PAD token = 2 '<|im_end|>'
llm_load_print_meta: LF token = 143 'Ä'
llm_load_print_meta: EOT token = 0 '<|endoftext|>'
llm_load_print_meta: max token length = 162
time=2024-08-26T10:31:15.964+08:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size = 0.25 MiB
llm_load_tensors: offloading 30 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 31/31 layers to GPU
llm_load_tensors: CUDA_Host buffer size = 28.69 MiB
llm_load_tensors: CUDA0 buffer size = 85.82 MiB
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 180.00 MiB
llama_new_context_with_model: KV self size = 180.00 MiB, K (f16): 90.00 MiB, V (f16): 90.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.76 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 164.50 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 17.13 MiB
llama_new_context_with_model: graph nodes = 966
llama_new_context_with_model: graph splits = 2
INFO [wmain] model loaded | tid="8360" timestamp=1724639477
time=2024-08-26T10:31:17.560+08:00 level=INFO source=server.go:632 msg="llama runner started in 2.36 seconds"
[GIN] 2024/08/26 - 10:31:17 | 200 | 2.4895053s | 127.0.0.1 | POST "/api/chat"
CUDA error: CUBLAS_STATUS_NOT_SUPPORTED
current device: 0, in function ggml_cuda_mul_mat_batched_cublas at C:\a\ollama\ollama\llm\llama.cpp\ggml\src\ggml-cuda.cu:1889
cublasGemmBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0
ne23), CUDA_R_16F, nb01/nb00, (const void **) (ptrs_src.get() + 1ne23), CUDA_R_16F, nb11/nb10, beta, ( void **) (ptrs_dst.get() + 0ne23), cu_data_type, ne01, ne23, cu_compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)
C:\a\ollama\ollama\llm\llama.cpp\ggml\src\ggml-cuda.cu:101: CUDA error
[GIN] 2024/08/26 - 10:31:29 | 200 | 8.9100593s | 127.0.0.1 | POST "/api/chat"

<!-- gh-comment-id:2309183234 --> @NEWbie0709 commented on GitHub (Aug 26, 2024): this is the server logs 2024/08/26 10:31:14 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\Tianyi\\.ollama\\models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR:C:\\Users\\Tianyi\\AppData\\Local\\Programs\\Ollama\\ollama_runners OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" time=2024-08-26T10:31:14.730+08:00 level=INFO source=images.go:782 msg="total blobs: 27" time=2024-08-26T10:31:14.760+08:00 level=INFO source=images.go:790 msg="total unused blobs removed: 0" time=2024-08-26T10:31:14.762+08:00 level=INFO source=routes.go:1172 msg="Listening on 127.0.0.1:11434 (version 0.3.6)" time=2024-08-26T10:31:14.764+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [rocm_v6.1 cpu cpu_avx cpu_avx2 cuda_v11.3]" time=2024-08-26T10:31:14.764+08:00 level=INFO source=gpu.go:204 msg="looking for compatible GPUs" time=2024-08-26T10:31:15.009+08:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-db6c762b-e2a3-42a7-5a9f-7cfef49069d4 library=cuda compute=8.6 driver=12.6 name="NVIDIA GeForce RTX 3060" total="12.0 GiB" available="11.0 GiB" [GIN] 2024/08/26 - 10:31:15 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/08/26 - 10:31:15 | 200 | 40.7308ms | 127.0.0.1 | POST "/api/show" time=2024-08-26T10:31:15.090+08:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Tianyi\.ollama\models\blobs\sha256-eb2c714d40d4b35ba4b8ee98475a06d51d8080a17d2d2a75a23665985c739b94 gpu=GPU-db6c762b-e2a3-42a7-5a9f-7cfef49069d4 parallel=4 available=11707162624 required="895.2 MiB" time=2024-08-26T10:31:15.090+08:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=31 layers.offload=31 layers.split="" memory.available="[10.9 GiB]" memory.required.full="895.2 MiB" memory.required.partial="895.2 MiB" memory.required.kv="180.0 MiB" memory.required.allocations="[895.2 MiB]" memory.weights.total="237.1 MiB" memory.weights.repeating="208.4 MiB" memory.weights.nonrepeating="28.7 MiB" memory.graph.full="164.5 MiB" memory.graph.partial="168.4 MiB" time=2024-08-26T10:31:15.100+08:00 level=INFO source=server.go:393 msg="starting llama server" cmd="C:\\Users\\Tianyi\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\Tianyi\\.ollama\\models\\blobs\\sha256-eb2c714d40d4b35ba4b8ee98475a06d51d8080a17d2d2a75a23665985c739b94 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 31 --no-mmap --parallel 4 --port 50650" time=2024-08-26T10:31:15.195+08:00 level=INFO source=sched.go:445 msg="loaded runners" count=1 time=2024-08-26T10:31:15.195+08:00 level=INFO source=server.go:593 msg="waiting for llama runner to start responding" time=2024-08-26T10:31:15.196+08:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info | build=3535 commit="1e6f6554" tid="8360" timestamp=1724639475 INFO [wmain] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="8360" timestamp=1724639475 total_threads=16 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="50650" tid="8360" timestamp=1724639475 llama_model_loader: loaded meta data with 39 key-value pairs and 272 tensors from C:\Users\Tianyi\.ollama\models\blobs\sha256-eb2c714d40d4b35ba4b8ee98475a06d51d8080a17d2d2a75a23665985c739b94 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = SmolLM 135M llama_model_loader: - kv 3: general.organization str = HuggingFaceTB llama_model_loader: - kv 4: general.finetune str = Instruct llama_model_loader: - kv 5: general.basename str = SmolLM llama_model_loader: - kv 6: general.size_label str = 135M llama_model_loader: - kv 7: general.license str = apache-2.0 llama_model_loader: - kv 8: general.base_model.count u32 = 1 llama_model_loader: - kv 9: general.base_model.0.name str = SmolLM 135M llama_model_loader: - kv 10: general.base_model.0.organization str = HuggingFaceTB llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/HuggingFaceTB/... llama_model_loader: - kv 12: general.tags arr[str,3] = ["alignment-handbook", "trl", "sft"] llama_model_loader: - kv 13: general.languages arr[str,1] = ["en"] llama_model_loader: - kv 14: general.datasets arr[str,4] = ["Magpie-Align/Magpie-Pro-300K-Filter... llama_model_loader: - kv 15: llama.block_count u32 = 30 llama_model_loader: - kv 16: llama.context_length u32 = 2048 llama_model_loader: - kv 17: llama.embedding_length u32 = 576 llama_model_loader: - kv 18: llama.feed_forward_length u32 = 1536 llama_model_loader: - kv 19: llama.attention.head_count u32 = 9 llama_model_loader: - kv 20: llama.attention.head_count_kv u32 = 3 llama_model_loader: - kv 21: llama.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 22: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 23: general.file_type u32 = 2 llama_model_loader: - kv 24: llama.vocab_size u32 = 49152 llama_model_loader: - kv 25: llama.rope.dimension_count u32 = 64 llama_model_loader: - kv 26: tokenizer.ggml.add_space_prefix bool = false llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 28: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 29: tokenizer.ggml.pre str = smollm llama_model_loader: - kv 30: tokenizer.ggml.tokens arr[str,49152] = ["<|endoftext|>", "<|im_start|>", "<|... llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,49152] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 32: tokenizer.ggml.merges arr[str,48900] = ["Ġ t", "Ġ a", "i n", "h e", "Ġ Ġ... llama_model_loader: - kv 33: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 34: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 35: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 36: tokenizer.ggml.padding_token_id u32 = 2 llama_model_loader: - kv 37: tokenizer.chat_template str = {% for message in messages %}{{'<|im_... llama_model_loader: - kv 38: general.quantization_version u32 = 2 llama_model_loader: - type f32: 61 tensors llama_model_loader: - type q4_0: 210 tensors llama_model_loader: - type q8_0: 1 tensors llm_load_vocab: special tokens cache size = 17 llm_load_vocab: token to piece cache size = 0.3170 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 49152 llm_load_print_meta: n_merges = 48900 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 2048 llm_load_print_meta: n_embd = 576 llm_load_print_meta: n_layer = 30 llm_load_print_meta: n_head = 9 llm_load_print_meta: n_head_kv = 3 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 64 llm_load_print_meta: n_embd_head_v = 64 llm_load_print_meta: n_gqa = 3 llm_load_print_meta: n_embd_k_gqa = 192 llm_load_print_meta: n_embd_v_gqa = 192 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 1536 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 2048 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 134.52 M llm_load_print_meta: model size = 85.77 MiB (5.35 BPW) llm_load_print_meta: general.name = SmolLM 135M llm_load_print_meta: BOS token = 1 '<|im_start|>' llm_load_print_meta: EOS token = 2 '<|im_end|>' llm_load_print_meta: UNK token = 0 '<|endoftext|>' llm_load_print_meta: PAD token = 2 '<|im_end|>' llm_load_print_meta: LF token = 143 'Ä' llm_load_print_meta: EOT token = 0 '<|endoftext|>' llm_load_print_meta: max token length = 162 time=2024-08-26T10:31:15.964+08:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.25 MiB llm_load_tensors: offloading 30 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 31/31 layers to GPU llm_load_tensors: CUDA_Host buffer size = 28.69 MiB llm_load_tensors: CUDA0 buffer size = 85.82 MiB llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 180.00 MiB llama_new_context_with_model: KV self size = 180.00 MiB, K (f16): 90.00 MiB, V (f16): 90.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.76 MiB llama_new_context_with_model: CUDA0 compute buffer size = 164.50 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 17.13 MiB llama_new_context_with_model: graph nodes = 966 llama_new_context_with_model: graph splits = 2 INFO [wmain] model loaded | tid="8360" timestamp=1724639477 time=2024-08-26T10:31:17.560+08:00 level=INFO source=server.go:632 msg="llama runner started in 2.36 seconds" [GIN] 2024/08/26 - 10:31:17 | 200 | 2.4895053s | 127.0.0.1 | POST "/api/chat" CUDA error: CUBLAS_STATUS_NOT_SUPPORTED current device: 0, in function ggml_cuda_mul_mat_batched_cublas at C:\a\ollama\ollama\llm\llama.cpp\ggml\src\ggml-cuda.cu:1889 cublasGemmBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), CUDA_R_16F, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), CUDA_R_16F, nb11/nb10, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP) C:\a\ollama\ollama\llm\llama.cpp\ggml\src\ggml-cuda.cu:101: CUDA error [GIN] 2024/08/26 - 10:31:29 | 200 | 8.9100593s | 127.0.0.1 | POST "/api/chat"
Author
Owner

@rick-github commented on GitHub (Aug 27, 2024):

There does appear to be something wrong with the models in ollama, all of the q4, q8 and fp16 quants of the 360m annd 135m models fail with the same error.

$ ~/bin/ollama run smollm:360m-instruct-v0.2-fp16 hello
Error: an unknown error was encountered while running the model CUDA error: CUBLAS_STATUS_NOT_SUPPORTED
  current device: 0, in function ggml_cuda_mul_mat_batched_cublas at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:1881
  cublasGemmBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), CUDA_R_16F, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), CUDA_R_16F, nb11/nb10, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)
/go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:101: CUDA error

I even download the safetensors from HF and converted to GGUF with llama.cpp and the same error occurred.

However, the GGUF works with llama-cli from llama.cpp (or at least it doesn't crash, the output is not great). So this seems like a problem with the llama.cpp backend in ollama.

<!-- gh-comment-id:2313610510 --> @rick-github commented on GitHub (Aug 27, 2024): There does appear to be something wrong with the models in ollama, all of the q4, q8 and fp16 quants of the 360m annd 135m models fail with the same error. ``` $ ~/bin/ollama run smollm:360m-instruct-v0.2-fp16 hello Error: an unknown error was encountered while running the model CUDA error: CUBLAS_STATUS_NOT_SUPPORTED current device: 0, in function ggml_cuda_mul_mat_batched_cublas at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:1881 cublasGemmBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), CUDA_R_16F, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), CUDA_R_16F, nb11/nb10, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP) /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:101: CUDA error ``` I even download the safetensors from HF and converted to GGUF with llama.cpp and the same error occurred. However, the GGUF works with llama-cli from llama.cpp (or at least it doesn't crash, the output is not great). So this seems like a problem with the llama.cpp backend in ollama.
Author
Owner

@rick-github commented on GitHub (Aug 27, 2024):

Upgrading ollama to 0.3.7 makes the 360m models work, the 135 models output more text but still crash.

$ ollama run smollm:135m-instruct-v0.2-fp16 hello
Hello! How can you are welcome. I am so glad to thank you are you are you are you are the most beautiful and i'm a very much more than just like you are you. You're we have a great, but it is an expert in your 
friend.

Error: an unknown error was encountered while running the model CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_synchronize at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:2416
  cudaStreamSynchronize(cuda_ctx->stream())
/go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:101: CUDA error
<!-- gh-comment-id:2313623907 --> @rick-github commented on GitHub (Aug 27, 2024): Upgrading ollama to 0.3.7 makes the 360m models work, the 135 models output more text but still crash. ``` $ ollama run smollm:135m-instruct-v0.2-fp16 hello Hello! How can you are welcome. I am so glad to thank you are you are you are you are the most beautiful and i'm a very much more than just like you are you. You're we have a great, but it is an expert in your friend. Error: an unknown error was encountered while running the model CUDA error: an illegal memory access was encountered current device: 0, in function ggml_backend_cuda_synchronize at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:2416 cudaStreamSynchronize(cuda_ctx->stream()) /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:101: CUDA error ```
Author
Owner

@rick-github commented on GitHub (Jan 11, 2026):

No longer an issue in 0.13.5.

$ ollama run smollm:135m-instruct-v0.2-fp16 hello
Hello! How can I help you today?
<!-- gh-comment-id:3736200526 --> @rick-github commented on GitHub (Jan 11, 2026): No longer an issue in 0.13.5. ```console $ ollama run smollm:135m-instruct-v0.2-fp16 hello Hello! How can I help you today? ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#4071