[GH-ISSUE #6637] cuda device unavailable error results in failed memory update leading to concurrent model load when no space actually available #4178

Open
opened 2026-04-12 15:06:21 -05:00 by GiteaMirror · 10 comments
Owner

Originally created by @iplayfast on GitHub (Sep 4, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6637

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

ollama run llama3.1 (is ok)

switch to a different terminal

ollama run yi-coder
Error: llama runner process has terminated: CUDA error
ollama run llama3.1 (is ok)

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.3.9

Originally created by @iplayfast on GitHub (Sep 4, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6637 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? ``` ollama run llama3.1 (is ok) ``` switch to a different terminal ``` ollama run yi-coder Error: llama runner process has terminated: CUDA error ollama run llama3.1 (is ok) ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.3.9
GiteaMirror added the nvidiabug labels 2026-04-12 15:06:21 -05:00
Author
Owner

@iplayfast commented on GitHub (Sep 4, 2024):

ollama run yi-coder
Error: llama runner process has terminated: CUDA error
chris@FORGE:~/bin$ ollama run llama3.1
>>> hello
Hello! How are you today? Is there something I can help you with or would you like to chat?

>>> /bye
chris@FORGE:~/bin$ ollama run yi-coder
Error: llama runner process has terminated: CUDA error
chris@FORGE:~/bin$ 
nvidia-smi
Wed Sep  4 13:25:44 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:01:00.0  On |                  Off |
|  0%   41C    P8              16W / 450W |   1221MiB / 24564MiB |      5%   E. Process |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1748      G   /usr/lib/xorg/Xorg                          473MiB |
|    0   N/A  N/A      4413      G   cinnamon                                     89MiB |
|    0   N/A  N/A     11862      G   ...seed-version=20240901-180126.474000      436MiB |
|    0   N/A  N/A    177764      G   ...,WinRetrieveSuggestionsOnlyOnDemand      117MiB |
|    0   N/A  N/A    499733      G   ...yOnDemand --variations-seed-version       83MiB |
<!-- gh-comment-id:2329610038 --> @iplayfast commented on GitHub (Sep 4, 2024): ``` ollama run yi-coder Error: llama runner process has terminated: CUDA error chris@FORGE:~/bin$ ollama run llama3.1 >>> hello Hello! How are you today? Is there something I can help you with or would you like to chat? >>> /bye chris@FORGE:~/bin$ ollama run yi-coder Error: llama runner process has terminated: CUDA error chris@FORGE:~/bin$ ``` ``` nvidia-smi Wed Sep 4 13:25:44 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 4090 On | 00000000:01:00.0 On | Off | | 0% 41C P8 16W / 450W | 1221MiB / 24564MiB | 5% E. Process | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 1748 G /usr/lib/xorg/Xorg 473MiB | | 0 N/A N/A 4413 G cinnamon 89MiB | | 0 N/A N/A 11862 G ...seed-version=20240901-180126.474000 436MiB | | 0 N/A N/A 177764 G ...,WinRetrieveSuggestionsOnlyOnDemand 117MiB | | 0 N/A N/A 499733 G ...yOnDemand --variations-seed-version 83MiB | ```
Author
Owner

@rick-github commented on GitHub (Sep 4, 2024):

Server logs from the failures will help in debugging.

<!-- gh-comment-id:2329641441 --> @rick-github commented on GitHub (Sep 4, 2024): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) from the failures will help in debugging.
Author
Owner

@jmorganca commented on GitHub (Sep 4, 2024):

As @rick-github mentioned, getting server logs will help us debug this. Sorry you hit an error.

<!-- gh-comment-id:2329653371 --> @jmorganca commented on GitHub (Sep 4, 2024): As @rick-github mentioned, getting server logs will help us debug this. Sorry you hit an error.
Author
Owner

@iplayfast commented on GitHub (Sep 4, 2024):

Here you go, hope it helps

journalctl -u ollama --no-pager --since="2024-09-04 12:00"
Sep 04 12:52:45 FORGE ollama[841963]: [GIN] 2024/09/04 - 12:52:45 | 200 |       16.69µs |       127.0.0.1 | HEAD     "/"
Sep 04 12:52:45 FORGE ollama[841963]: [GIN] 2024/09/04 - 12:52:45 | 404 |      47.783µs |       127.0.0.1 | POST     "/api/show"
Sep 04 12:52:46 FORGE ollama[841963]: time=2024-09-04T12:52:46.988-04:00 level=INFO source=download.go:175 msg="downloading 8169bd33ad13 in 16 314 MB part(s)"
Sep 04 12:53:53 FORGE ollama[841963]: time=2024-09-04T12:53:53.935-04:00 level=INFO source=download.go:175 msg="downloading a23e6bd35e94 in 1 693 B part(s)"
Sep 04 12:53:55 FORGE ollama[841963]: time=2024-09-04T12:53:55.833-04:00 level=INFO source=download.go:175 msg="downloading 3dc12ee097e8 in 1 135 B part(s)"
Sep 04 12:53:57 FORGE ollama[841963]: time=2024-09-04T12:53:57.755-04:00 level=INFO source=download.go:175 msg="downloading a60ed831ae4c in 1 485 B part(s)"
Sep 04 12:54:01 FORGE ollama[841963]: [GIN] 2024/09/04 - 12:54:01 | 200 |         1m15s |       127.0.0.1 | POST     "/api/pull"
Sep 04 12:54:01 FORGE ollama[841963]: [GIN] 2024/09/04 - 12:54:01 | 200 |   25.060116ms |       127.0.0.1 | POST     "/api/show"
Sep 04 12:54:01 FORGE ollama[841963]: time=2024-09-04T12:54:01.308-04:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 parallel=4 available=23707648000 required="6.4 GiB"
Sep 04 12:54:01 FORGE ollama[841963]: time=2024-09-04T12:54:01.308-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[22.1 GiB]" memory.required.full="6.4 GiB" memory.required.partial="6.4 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[6.4 GiB]" memory.weights.total="5.1 GiB" memory.weights.repeating="4.9 GiB" memory.weights.nonrepeating="205.1 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="569.0 MiB"
Sep 04 12:54:01 FORGE ollama[841963]: time=2024-09-04T12:54:01.309-04:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama2682843892/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 49 --parallel 4 --port 41089"
Sep 04 12:54:01 FORGE ollama[841963]: time=2024-09-04T12:54:01.310-04:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
Sep 04 12:54:01 FORGE ollama[841963]: time=2024-09-04T12:54:01.310-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
Sep 04 12:54:01 FORGE ollama[841963]: time=2024-09-04T12:54:01.310-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
Sep 04 12:54:01 FORGE ollama[1441911]: INFO [main] build info | build=1 commit="1e6f655" tid="126918209675264" timestamp=1725468841
Sep 04 12:54:01 FORGE ollama[1441911]: INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="126918209675264" timestamp=1725468841 total_threads=32
Sep 04 12:54:01 FORGE ollama[1441911]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="41089" tid="126918209675264" timestamp=1725468841
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: loaded meta data with 31 key-value pairs and 435 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 (version GGUF V3 (latest))
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv   1:                               general.type str              = model
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv   2:                               general.name str              = Yi Coder 9B Chat
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv   3:                           general.finetune str              = Chat
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv   4:                           general.basename str              = Yi-Coder
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv   5:                         general.size_label str              = 9B
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv   6:                            general.license str              = apache-2.0
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv   7:                          llama.block_count u32              = 48
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv   8:                       llama.context_length u32              = 131072
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv   9:                     llama.embedding_length u32              = 4096
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv  10:                  llama.feed_forward_length u32              = 11008
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 32
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 4
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv  13:                       llama.rope.freq_base f32              = 10000000.000000
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv  14:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv  15:                          general.file_type u32              = 2
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv  16:                           llama.vocab_size u32              = 64000
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv  17:                 llama.rope.dimension_count u32              = 128
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv  18:            tokenizer.ggml.add_space_prefix bool             = false
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = llama
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = default
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,64000]   = ["<unk>", "<|startoftext|>", "<|endof...
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv  22:                      tokenizer.ggml.scores arr[f32,64000]   = [-1000.000000, -1000.000000, -1000.00...
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,64000]   = [3, 3, 3, 3, 3, 3, 1, 3, 1, 3, 3, 3, ...
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv  24:                tokenizer.ggml.bos_token_id u32              = 1
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 2
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv  26:            tokenizer.ggml.padding_token_id u32              = 0
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv  27:               tokenizer.ggml.add_bos_token bool             = false
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv  28:               tokenizer.ggml.add_eos_token bool             = false
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv  30:               general.quantization_version u32              = 2
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - type  f32:   97 tensors
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - type q4_0:  337 tensors
Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - type q6_K:    1 tensors
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_vocab: special tokens cache size = 12
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_vocab: token to piece cache size = 0.3834 MB
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: format           = GGUF V3 (latest)
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: arch             = llama
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: vocab type       = SPM
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_vocab          = 64000
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_merges         = 0
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: vocab_only       = 0
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_ctx_train      = 131072
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_embd           = 4096
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_layer          = 48
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_head           = 32
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_head_kv        = 4
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_rot            = 128
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_swa            = 0
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_embd_head_k    = 128
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_embd_head_v    = 128
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_gqa            = 8
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_embd_k_gqa     = 512
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_embd_v_gqa     = 512
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_ff             = 11008
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_expert         = 0
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_expert_used    = 0
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: causal attn      = 1
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: pooling type     = 0
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: rope type        = 0
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: rope scaling     = linear
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: freq_base_train  = 10000000.0
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: freq_scale_train = 1
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_ctx_orig_yarn  = 131072
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: rope_finetuned   = unknown
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: ssm_d_conv       = 0
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: ssm_d_inner      = 0
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: ssm_d_state      = 0
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: ssm_dt_rank      = 0
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: model type       = 34B
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: model ftype      = Q4_0
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: model params     = 8.83 B
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: model size       = 4.69 GiB (4.56 BPW)
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: general.name     = Yi Coder 9B Chat
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: BOS token        = 1 '<|startoftext|>'
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: EOS token        = 2 '<|endoftext|>'
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: UNK token        = 0 '<unk>'
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: PAD token        = 0 '<unk>'
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: LF token         = 315 '<0x0A>'
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: EOT token        = 2 '<|endoftext|>'
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: max token length = 48
Sep 04 12:54:01 FORGE ollama[841963]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Sep 04 12:54:01 FORGE ollama[841963]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 04 12:54:01 FORGE ollama[841963]: ggml_cuda_init: found 1 CUDA devices:
Sep 04 12:54:01 FORGE ollama[841963]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_tensors: ggml ctx size =    0.41 MiB
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_tensors: offloading 48 repeating layers to GPU
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_tensors: offloading non-repeating layers to GPU
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_tensors: offloaded 49/49 layers to GPU
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_tensors:        CPU buffer size =   140.62 MiB
Sep 04 12:54:01 FORGE ollama[841963]: llm_load_tensors:      CUDA0 buffer size =  4661.61 MiB
Sep 04 12:54:01 FORGE ollama[841963]: time=2024-09-04T12:54:01.561-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
Sep 04 12:54:02 FORGE ollama[841963]: llama_new_context_with_model: n_ctx      = 8192
Sep 04 12:54:02 FORGE ollama[841963]: llama_new_context_with_model: n_batch    = 512
Sep 04 12:54:02 FORGE ollama[841963]: llama_new_context_with_model: n_ubatch   = 512
Sep 04 12:54:02 FORGE ollama[841963]: llama_new_context_with_model: flash_attn = 0
Sep 04 12:54:02 FORGE ollama[841963]: llama_new_context_with_model: freq_base  = 10000000.0
Sep 04 12:54:02 FORGE ollama[841963]: llama_new_context_with_model: freq_scale = 1
Sep 04 12:54:02 FORGE ollama[841963]: llama_kv_cache_init:      CUDA0 KV buffer size =   768.00 MiB
Sep 04 12:54:02 FORGE ollama[841963]: llama_new_context_with_model: KV self size  =  768.00 MiB, K (f16):  384.00 MiB, V (f16):  384.00 MiB
Sep 04 12:54:02 FORGE ollama[841963]: llama_new_context_with_model:  CUDA_Host  output buffer size =     1.04 MiB
Sep 04 12:54:02 FORGE ollama[841963]: llama_new_context_with_model:      CUDA0 compute buffer size =   560.00 MiB
Sep 04 12:54:02 FORGE ollama[841963]: llama_new_context_with_model:  CUDA_Host compute buffer size =    24.01 MiB
Sep 04 12:54:02 FORGE ollama[841963]: llama_new_context_with_model: graph nodes  = 1542
Sep 04 12:54:02 FORGE ollama[841963]: llama_new_context_with_model: graph splits = 2
Sep 04 12:54:02 FORGE ollama[1441911]: INFO [main] model loaded | tid="126918209675264" timestamp=1725468842
Sep 04 12:54:02 FORGE ollama[841963]: time=2024-09-04T12:54:02.314-04:00 level=INFO source=server.go:630 msg="llama runner started in 1.00 seconds"
Sep 04 12:54:02 FORGE ollama[841963]: [GIN] 2024/09/04 - 12:54:02 | 200 |   1.09955161s |       127.0.0.1 | POST     "/api/chat"
Sep 04 12:56:25 FORGE ollama[841963]: [GIN] 2024/09/04 - 12:56:25 | 200 |  2.112406341s |       127.0.0.1 | POST     "/api/chat"
Sep 04 12:58:05 FORGE ollama[841963]: [GIN] 2024/09/04 - 12:58:05 | 200 |      22.902µs |       127.0.0.1 | HEAD     "/"
Sep 04 12:58:05 FORGE ollama[841963]: [GIN] 2024/09/04 - 12:58:05 | 200 |   10.305179ms |       127.0.0.1 | GET      "/api/tags"
Sep 04 12:59:29 FORGE ollama[841963]: [GIN] 2024/09/04 - 12:59:29 | 200 |  2.165997807s |       127.0.0.1 | POST     "/api/chat"
Sep 04 13:00:44 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:00:44 | 200 |      31.832µs |       127.0.0.1 | HEAD     "/"
Sep 04 13:00:44 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:00:44 | 200 |   16.798501ms |       127.0.0.1 | GET      "/api/tags"
Sep 04 13:01:33 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:01:33 | 200 |      48.597µs |       127.0.0.1 | HEAD     "/"
Sep 04 13:01:33 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:01:33 | 200 |   14.299022ms |       127.0.0.1 | GET      "/api/tags"
Sep 04 13:02:11 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:02:11 | 200 |      23.888µs |       127.0.0.1 | HEAD     "/"
Sep 04 13:02:11 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:02:11 | 200 |  124.734063ms |       127.0.0.1 | DELETE   "/api/delete"
Sep 04 13:02:18 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:02:18 | 200 |      56.422µs |       127.0.0.1 | HEAD     "/"
Sep 04 13:02:18 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:02:18 | 200 |  111.964713ms |       127.0.0.1 | DELETE   "/api/delete"
Sep 04 13:02:31 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:02:31 | 200 |      68.827µs |       127.0.0.1 | HEAD     "/"
Sep 04 13:02:32 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:02:32 | 200 |  241.180652ms |       127.0.0.1 | DELETE   "/api/delete"
Sep 04 13:02:34 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:02:34 | 200 |      16.769µs |       127.0.0.1 | HEAD     "/"
Sep 04 13:02:34 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:02:34 | 200 |     8.65959ms |       127.0.0.1 | GET      "/api/tags"
Sep 04 13:02:54 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:02:54 | 200 |      20.896µs |       127.0.0.1 | HEAD     "/"
Sep 04 13:02:54 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:02:54 | 400 |      41.243µs |       127.0.0.1 | DELETE   "/api/delete"
Sep 04 13:03:09 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:03:09 | 200 |       40.24µs |       127.0.0.1 | HEAD     "/"
Sep 04 13:03:09 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:03:09 | 500 |     133.843µs |       127.0.0.1 | DELETE   "/api/delete"
Sep 04 13:03:18 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:03:18 | 200 |      35.656µs |       127.0.0.1 | HEAD     "/"
Sep 04 13:03:18 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:03:18 | 200 |   81.959669ms |       127.0.0.1 | DELETE   "/api/delete"
Sep 04 13:04:29 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:04:29.102-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:05:07 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:05:07 | 200 |      27.692µs |       127.0.0.1 | HEAD     "/"
Sep 04 13:05:07 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:05:07 | 200 |    7.418544ms |       127.0.0.1 | POST     "/api/show"
Sep 04 13:05:07 FORGE ollama[841963]: time=2024-09-04T13:05:07.994-04:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 parallel=4 available=23613407232 required="6.4 GiB"
Sep 04 13:05:07 FORGE ollama[841963]: time=2024-09-04T13:05:07.995-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[22.0 GiB]" memory.required.full="6.4 GiB" memory.required.partial="6.4 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[6.4 GiB]" memory.weights.total="5.1 GiB" memory.weights.repeating="4.9 GiB" memory.weights.nonrepeating="205.1 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="569.0 MiB"
Sep 04 13:05:07 FORGE ollama[841963]: time=2024-09-04T13:05:07.997-04:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama2682843892/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 49 --parallel 4 --port 44427"
Sep 04 13:05:07 FORGE ollama[841963]: time=2024-09-04T13:05:07.997-04:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
Sep 04 13:05:07 FORGE ollama[841963]: time=2024-09-04T13:05:07.997-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
Sep 04 13:05:07 FORGE ollama[841963]: time=2024-09-04T13:05:07.997-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
Sep 04 13:05:08 FORGE ollama[1453465]: INFO [main] build info | build=1 commit="1e6f655" tid="132648159997952" timestamp=1725469508
Sep 04 13:05:08 FORGE ollama[1453465]: INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="132648159997952" timestamp=1725469508 total_threads=32
Sep 04 13:05:08 FORGE ollama[1453465]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="44427" tid="132648159997952" timestamp=1725469508
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: loaded meta data with 31 key-value pairs and 435 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 (version GGUF V3 (latest))
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv   1:                               general.type str              = model
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv   2:                               general.name str              = Yi Coder 9B Chat
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv   3:                           general.finetune str              = Chat
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv   4:                           general.basename str              = Yi-Coder
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv   5:                         general.size_label str              = 9B
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv   6:                            general.license str              = apache-2.0
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv   7:                          llama.block_count u32              = 48
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv   8:                       llama.context_length u32              = 131072
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv   9:                     llama.embedding_length u32              = 4096
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv  10:                  llama.feed_forward_length u32              = 11008
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 32
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 4
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv  13:                       llama.rope.freq_base f32              = 10000000.000000
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv  14:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv  15:                          general.file_type u32              = 2
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv  16:                           llama.vocab_size u32              = 64000
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv  17:                 llama.rope.dimension_count u32              = 128
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv  18:            tokenizer.ggml.add_space_prefix bool             = false
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = llama
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = default
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,64000]   = ["<unk>", "<|startoftext|>", "<|endof...
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv  22:                      tokenizer.ggml.scores arr[f32,64000]   = [-1000.000000, -1000.000000, -1000.00...
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,64000]   = [3, 3, 3, 3, 3, 3, 1, 3, 1, 3, 3, 3, ...
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv  24:                tokenizer.ggml.bos_token_id u32              = 1
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 2
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv  26:            tokenizer.ggml.padding_token_id u32              = 0
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv  27:               tokenizer.ggml.add_bos_token bool             = false
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv  28:               tokenizer.ggml.add_eos_token bool             = false
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv  30:               general.quantization_version u32              = 2
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - type  f32:   97 tensors
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - type q4_0:  337 tensors
Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - type q6_K:    1 tensors
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_vocab: special tokens cache size = 12
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_vocab: token to piece cache size = 0.3834 MB
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: format           = GGUF V3 (latest)
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: arch             = llama
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: vocab type       = SPM
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_vocab          = 64000
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_merges         = 0
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: vocab_only       = 0
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_ctx_train      = 131072
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_embd           = 4096
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_layer          = 48
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_head           = 32
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_head_kv        = 4
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_rot            = 128
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_swa            = 0
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_embd_head_k    = 128
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_embd_head_v    = 128
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_gqa            = 8
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_embd_k_gqa     = 512
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_embd_v_gqa     = 512
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_ff             = 11008
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_expert         = 0
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_expert_used    = 0
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: causal attn      = 1
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: pooling type     = 0
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: rope type        = 0
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: rope scaling     = linear
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: freq_base_train  = 10000000.0
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: freq_scale_train = 1
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_ctx_orig_yarn  = 131072
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: rope_finetuned   = unknown
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: ssm_d_conv       = 0
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: ssm_d_inner      = 0
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: ssm_d_state      = 0
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: ssm_dt_rank      = 0
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: model type       = 34B
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: model ftype      = Q4_0
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: model params     = 8.83 B
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: model size       = 4.69 GiB (4.56 BPW)
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: general.name     = Yi Coder 9B Chat
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: BOS token        = 1 '<|startoftext|>'
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: EOS token        = 2 '<|endoftext|>'
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: UNK token        = 0 '<unk>'
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: PAD token        = 0 '<unk>'
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: LF token         = 315 '<0x0A>'
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: EOT token        = 2 '<|endoftext|>'
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: max token length = 48
Sep 04 13:05:08 FORGE ollama[841963]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Sep 04 13:05:08 FORGE ollama[841963]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 04 13:05:08 FORGE ollama[841963]: ggml_cuda_init: found 1 CUDA devices:
Sep 04 13:05:08 FORGE ollama[841963]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_tensors: ggml ctx size =    0.41 MiB
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_tensors: offloading 48 repeating layers to GPU
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_tensors: offloading non-repeating layers to GPU
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_tensors: offloaded 49/49 layers to GPU
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_tensors:        CPU buffer size =   140.62 MiB
Sep 04 13:05:08 FORGE ollama[841963]: llm_load_tensors:      CUDA0 buffer size =  4661.61 MiB
Sep 04 13:05:08 FORGE ollama[841963]: time=2024-09-04T13:05:08.249-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
Sep 04 13:05:08 FORGE ollama[841963]: llama_new_context_with_model: n_ctx      = 8192
Sep 04 13:05:08 FORGE ollama[841963]: llama_new_context_with_model: n_batch    = 512
Sep 04 13:05:08 FORGE ollama[841963]: llama_new_context_with_model: n_ubatch   = 512
Sep 04 13:05:08 FORGE ollama[841963]: llama_new_context_with_model: flash_attn = 0
Sep 04 13:05:08 FORGE ollama[841963]: llama_new_context_with_model: freq_base  = 10000000.0
Sep 04 13:05:08 FORGE ollama[841963]: llama_new_context_with_model: freq_scale = 1
Sep 04 13:05:08 FORGE ollama[841963]: llama_kv_cache_init:      CUDA0 KV buffer size =   768.00 MiB
Sep 04 13:05:08 FORGE ollama[841963]: llama_new_context_with_model: KV self size  =  768.00 MiB, K (f16):  384.00 MiB, V (f16):  384.00 MiB
Sep 04 13:05:08 FORGE ollama[841963]: llama_new_context_with_model:  CUDA_Host  output buffer size =     1.04 MiB
Sep 04 13:05:08 FORGE ollama[841963]: llama_new_context_with_model:      CUDA0 compute buffer size =   560.00 MiB
Sep 04 13:05:08 FORGE ollama[841963]: llama_new_context_with_model:  CUDA_Host compute buffer size =    24.01 MiB
Sep 04 13:05:08 FORGE ollama[841963]: llama_new_context_with_model: graph nodes  = 1542
Sep 04 13:05:08 FORGE ollama[841963]: llama_new_context_with_model: graph splits = 2
Sep 04 13:05:08 FORGE ollama[1453465]: INFO [main] model loaded | tid="132648159997952" timestamp=1725469508
Sep 04 13:05:09 FORGE ollama[841963]: time=2024-09-04T13:05:09.001-04:00 level=INFO source=server.go:630 msg="llama runner started in 1.00 seconds"
Sep 04 13:05:09 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:05:09 | 200 |  1.105447205s |       127.0.0.1 | POST     "/api/chat"
Sep 04 13:05:18 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:05:18 | 200 |   20.929748ms |       127.0.0.1 | POST     "/api/show"
Sep 04 13:05:23 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:05:23 | 200 |     8.70556ms |       127.0.0.1 | POST     "/api/show"
Sep 04 13:05:30 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:05:30 | 200 |   15.351273ms |       127.0.0.1 | POST     "/api/show"
Sep 04 13:06:27 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:06:27 | 200 |  291.413881ms |       127.0.0.1 | POST     "/api/generate"
Sep 04 13:06:28 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:06:28 | 200 |  1.014343573s |       127.0.0.1 | POST     "/api/generate"
Sep 04 13:06:53 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:06:53 | 200 |  1.369648167s |       127.0.0.1 | POST     "/api/generate"
Sep 04 13:06:54 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:06:54 | 200 |  1.052921831s |       127.0.0.1 | POST     "/api/generate"
Sep 04 13:09:48 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:09:48 | 200 |  2.274160768s |       127.0.0.1 | POST     "/api/generate"
Sep 04 13:09:50 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:09:50 | 200 |  1.399936035s |       127.0.0.1 | POST     "/api/generate"
Sep 04 13:09:51 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:09:51 | 200 |  1.407673273s |       127.0.0.1 | POST     "/api/generate"
Sep 04 13:09:53 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:09:53 | 200 |  1.444955083s |       127.0.0.1 | POST     "/api/generate"
Sep 04 13:10:47 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:47.793-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:10:47 FORGE ollama[841963]: time=2024-09-04T13:10:47.817-04:00 level=INFO source=sched.go:508 msg="updated VRAM based on existing loaded models" gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 library=cuda total="23.6 GiB" available="17.2 GiB"
Sep 04 13:10:47 FORGE ollama[841963]: time=2024-09-04T13:10:47.817-04:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 parallel=4 available=18496550912 required="6.2 GiB"
Sep 04 13:10:47 FORGE ollama[841963]: time=2024-09-04T13:10:47.817-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[17.2 GiB]" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
Sep 04 13:10:47 FORGE ollama[841963]: time=2024-09-04T13:10:47.818-04:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama2682843892/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 32919"
Sep 04 13:10:47 FORGE ollama[841963]: time=2024-09-04T13:10:47.818-04:00 level=INFO source=sched.go:450 msg="loaded runners" count=2
Sep 04 13:10:47 FORGE ollama[841963]: time=2024-09-04T13:10:47.818-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
Sep 04 13:10:47 FORGE ollama[841963]: time=2024-09-04T13:10:47.818-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
Sep 04 13:10:47 FORGE ollama[1460566]: INFO [main] build info | build=1 commit="1e6f655" tid="133121226891264" timestamp=1725469847
Sep 04 13:10:47 FORGE ollama[1460566]: INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="133121226891264" timestamp=1725469847 total_threads=32
Sep 04 13:10:47 FORGE ollama[1460566]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="32919" tid="133121226891264" timestamp=1725469847
Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe (version GGUF V3 (latest))
Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv   1:                               general.type str              = model
Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv   3:                           general.finetune str              = Instruct
Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv   5:                         general.size_label str              = 8B
Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv   6:                            general.license str              = llama3.1
Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv   9:                          llama.block_count u32              = 32
Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv  17:                          general.file_type u32              = 2
Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv  28:               general.quantization_version u32              = 2
Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - type  f32:   66 tensors
Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - type q4_0:  225 tensors
Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - type q6_K:    1 tensors
Sep 04 13:10:48 FORGE ollama[841963]: time=2024-09-04T13:10:48.069-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_vocab: special tokens cache size = 256
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_vocab: token to piece cache size = 0.7999 MB
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: format           = GGUF V3 (latest)
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: arch             = llama
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: vocab type       = BPE
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_vocab          = 128256
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_merges         = 280147
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: vocab_only       = 0
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_ctx_train      = 131072
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_embd           = 4096
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_layer          = 32
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_head           = 32
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_head_kv        = 8
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_rot            = 128
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_swa            = 0
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_embd_head_k    = 128
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_embd_head_v    = 128
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_gqa            = 4
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_embd_k_gqa     = 1024
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_embd_v_gqa     = 1024
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_ff             = 14336
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_expert         = 0
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_expert_used    = 0
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: causal attn      = 1
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: pooling type     = 0
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: rope type        = 0
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: rope scaling     = linear
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: freq_base_train  = 500000.0
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: freq_scale_train = 1
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_ctx_orig_yarn  = 131072
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: rope_finetuned   = unknown
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: ssm_d_conv       = 0
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: ssm_d_inner      = 0
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: ssm_d_state      = 0
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: ssm_dt_rank      = 0
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: model type       = 8B
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: model ftype      = Q4_0
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: model params     = 8.03 B
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW)
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: general.name     = Meta Llama 3.1 8B Instruct
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: LF token         = 128 'Ä'
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: max token length = 256
Sep 04 13:10:48 FORGE ollama[841963]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Sep 04 13:10:48 FORGE ollama[841963]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 04 13:10:48 FORGE ollama[841963]: ggml_cuda_init: found 1 CUDA devices:
Sep 04 13:10:48 FORGE ollama[841963]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Sep 04 13:10:48 FORGE ollama[841963]: CUDA error: CUDA-capable device(s) is/are busy or unavailable
Sep 04 13:10:48 FORGE ollama[841963]:   current device: 0, in function ggml_backend_cuda_get_device_memory at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:3040
Sep 04 13:10:48 FORGE ollama[841963]:   cudaMemGetInfo(free, total)
Sep 04 13:10:48 FORGE ollama[841963]: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:101: CUDA error
Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460567]
Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460568]
Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460569]
Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460570]
Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460571]
Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460572]
Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460573]
Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460574]
Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460575]
Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460576]
Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460577]
Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460578]
Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460579]
Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460580]
Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460581]
Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460582]
Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460583]
Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460584]
Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460585]
Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460586]
Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460587]
Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460588]
Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460589]
Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460590]
Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460591]
Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460592]
Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460593]
Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460594]
Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460595]
Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460596]
Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460597]
Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460598]
Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460599]
Sep 04 13:10:49 FORGE ollama[841963]: time=2024-09-04T13:10:49.023-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
Sep 04 13:10:49 FORGE ollama[1460601]: [Thread debugging using libthread_db enabled]
Sep 04 13:10:49 FORGE ollama[1460601]: Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Sep 04 13:10:49 FORGE ollama[1460601]: 0x000079124f0ea42f in __GI___wait4 (pid=1460601, stat_loc=0x7ffcb70a19b4, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
Sep 04 13:10:49 FORGE ollama[841963]: 30        ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
Sep 04 13:10:49 FORGE ollama[1460601]: #0  0x000079124f0ea42f in __GI___wait4 (pid=1460601, stat_loc=0x7ffcb70a19b4, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
Sep 04 13:10:49 FORGE ollama[1460601]: 30        in ../sysdeps/unix/sysv/linux/wait4.c
Sep 04 13:10:49 FORGE ollama[1460601]: #1  0x000079124f83ca88 in ggml_abort () from /tmp/ollama2682843892/runners/cuda_v12/libggml.so
Sep 04 13:10:49 FORGE ollama[1460601]: #2  0x000079124f909d72 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /tmp/ollama2682843892/runners/cuda_v12/libggml.so
Sep 04 13:10:49 FORGE ollama[1460601]: #3  0x000079124f916c52 in ggml_backend_cuda_get_device_memory () from /tmp/ollama2682843892/runners/cuda_v12/libggml.so
Sep 04 13:10:49 FORGE ollama[1460601]: #4  0x00007912b38c3469 in llm_load_tensors(llama_model_loader&, llama_model&, int, llama_split_mode, int, float const*, bool, bool (*)(float, void*), void*) () from /tmp/ollama2682843892/runners/cuda_v12/libllama.so
Sep 04 13:10:49 FORGE ollama[1460601]: #5  0x00007912b3904fe2 in llama_load_model_from_file () from /tmp/ollama2682843892/runners/cuda_v12/libllama.so
Sep 04 13:10:49 FORGE ollama[1460601]: #6  0x00000000004e452b in llama_init_from_gpt_params(gpt_params&) ()
Sep 04 13:10:49 FORGE ollama[1460601]: #7  0x0000000000473710 in llama_server_context::load_model(gpt_params const&) ()
Sep 04 13:10:49 FORGE ollama[1460601]: #8  0x0000000000423058 in main ()
Sep 04 13:10:49 FORGE ollama[1460601]: [Inferior 1 (process 1460566) detached]
Sep 04 13:10:51 FORGE ollama[841963]: time=2024-09-04T13:10:51.079-04:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: CUDA error"
Sep 04 13:10:51 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:10:51 | 500 |  3.297901778s |       127.0.0.1 | POST     "/api/generate"
Sep 04 13:10:51 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:51.079-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:10:51 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:51.330-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:10:51 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:51.581-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:10:51 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:51.830-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:10:52 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:52.080-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:10:52 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:52.331-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:10:52 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:52.581-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:10:52 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:52.831-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:10:53 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:53.080-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:10:53 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:53.331-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:10:53 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:53.581-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:10:53 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:53.830-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:10:54 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:54.080-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:10:54 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:54.331-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:10:54 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:54.580-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:10:54 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:54.830-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:10:55 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:55.080-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:10:55 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:55.331-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:10:55 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:55.581-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:10:55 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:55.831-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:10:56 FORGE ollama[841963]: time=2024-09-04T13:10:56.080-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.000983837 model=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
Sep 04 13:10:56 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:56.080-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:10:56 FORGE ollama[841963]: time=2024-09-04T13:10:56.329-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2507956 model=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
Sep 04 13:10:56 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:56.330-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:10:56 FORGE ollama[841963]: time=2024-09-04T13:10:56.580-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.501179435 model=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
Sep 04 13:11:11 FORGE systemd[1]: Stopping Ollama Service...
Sep 04 13:11:11 FORGE systemd[1]: ollama.service: Deactivated successfully.
Sep 04 13:11:11 FORGE systemd[1]: Stopped Ollama Service.
Sep 04 13:11:11 FORGE systemd[1]: ollama.service: Consumed 2min 50.718s CPU time.
Sep 04 13:11:11 FORGE systemd[1]: Started Ollama Service.
Sep 04 13:11:11 FORGE ollama[1461072]: 2024/09/04 13:11:11 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:4 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:4 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
Sep 04 13:11:11 FORGE ollama[1461072]: time=2024-09-04T13:11:11.592-04:00 level=INFO source=images.go:753 msg="total blobs: 268"
Sep 04 13:11:11 FORGE ollama[1461072]: time=2024-09-04T13:11:11.596-04:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
Sep 04 13:11:11 FORGE ollama[1461072]: time=2024-09-04T13:11:11.598-04:00 level=INFO source=routes.go:1172 msg="Listening on [::]:11434 (version 0.3.9)"
Sep 04 13:11:11 FORGE ollama[1461072]: time=2024-09-04T13:11:11.599-04:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama4167472154/runners
Sep 04 13:11:16 FORGE ollama[1461072]: time=2024-09-04T13:11:16.366-04:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm_v60102]"
Sep 04 13:11:16 FORGE ollama[1461072]: time=2024-09-04T13:11:16.366-04:00 level=INFO source=gpu.go:200 msg="looking for compatible GPUs"
Sep 04 13:11:16 FORGE ollama[1461072]: time=2024-09-04T13:11:16.447-04:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 library=cuda variant=v12 compute=8.9 driver=12.2 name="NVIDIA GeForce RTX 4090" total="23.6 GiB" available="21.9 GiB"
Sep 04 13:11:56 FORGE ollama[1461072]: time=2024-09-04T13:11:56.100-04:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 parallel=4 available=23934861312 required="6.2 GiB"
Sep 04 13:11:56 FORGE ollama[1461072]: time=2024-09-04T13:11:56.100-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[22.3 GiB]" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
Sep 04 13:11:56 FORGE ollama[1461072]: time=2024-09-04T13:11:56.101-04:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama4167472154/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 46523"
Sep 04 13:11:56 FORGE ollama[1461072]: time=2024-09-04T13:11:56.101-04:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
Sep 04 13:11:56 FORGE ollama[1461072]: time=2024-09-04T13:11:56.101-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
Sep 04 13:11:56 FORGE ollama[1461072]: time=2024-09-04T13:11:56.101-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
Sep 04 13:11:56 FORGE ollama[1461917]: INFO [main] build info | build=1 commit="1e6f655" tid="124675729735680" timestamp=1725469916
Sep 04 13:11:56 FORGE ollama[1461917]: INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="124675729735680" timestamp=1725469916 total_threads=32
Sep 04 13:11:56 FORGE ollama[1461917]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="46523" tid="124675729735680" timestamp=1725469916
Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe (version GGUF V3 (latest))
Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv   1:                               general.type str              = model
Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv   3:                           general.finetune str              = Instruct
Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv   5:                         general.size_label str              = 8B
Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv   6:                            general.license str              = llama3.1
Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv   9:                          llama.block_count u32              = 32
Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv  17:                          general.file_type u32              = 2
Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv  28:               general.quantization_version u32              = 2
Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - type  f32:   66 tensors
Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - type q4_0:  225 tensors
Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - type q6_K:    1 tensors
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_vocab: special tokens cache size = 256
Sep 04 13:11:56 FORGE ollama[1461072]: time=2024-09-04T13:11:56.352-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_vocab: token to piece cache size = 0.7999 MB
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: format           = GGUF V3 (latest)
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: arch             = llama
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: vocab type       = BPE
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_vocab          = 128256
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_merges         = 280147
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: vocab_only       = 0
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_ctx_train      = 131072
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_embd           = 4096
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_layer          = 32
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_head           = 32
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_head_kv        = 8
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_rot            = 128
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_swa            = 0
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_embd_head_k    = 128
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_embd_head_v    = 128
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_gqa            = 4
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_embd_k_gqa     = 1024
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_embd_v_gqa     = 1024
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_ff             = 14336
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_expert         = 0
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_expert_used    = 0
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: causal attn      = 1
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: pooling type     = 0
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: rope type        = 0
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: rope scaling     = linear
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: freq_base_train  = 500000.0
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: freq_scale_train = 1
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_ctx_orig_yarn  = 131072
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: rope_finetuned   = unknown
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_conv       = 0
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_inner      = 0
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_state      = 0
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: ssm_dt_rank      = 0
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: model type       = 8B
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: model ftype      = Q4_0
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: model params     = 8.03 B
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW)
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: general.name     = Meta Llama 3.1 8B Instruct
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: LF token         = 128 'Ä'
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: max token length = 256
Sep 04 13:11:56 FORGE ollama[1461072]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Sep 04 13:11:56 FORGE ollama[1461072]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 04 13:11:56 FORGE ollama[1461072]: ggml_cuda_init: found 1 CUDA devices:
Sep 04 13:11:56 FORGE ollama[1461072]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_tensors: ggml ctx size =    0.27 MiB
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_tensors: offloading 32 repeating layers to GPU
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_tensors: offloading non-repeating layers to GPU
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_tensors: offloaded 33/33 layers to GPU
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_tensors:        CPU buffer size =   281.81 MiB
Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_tensors:      CUDA0 buffer size =  4156.00 MiB
Sep 04 13:11:56 FORGE ollama[1461072]: llama_new_context_with_model: n_ctx      = 8192
Sep 04 13:11:56 FORGE ollama[1461072]: llama_new_context_with_model: n_batch    = 512
Sep 04 13:11:56 FORGE ollama[1461072]: llama_new_context_with_model: n_ubatch   = 512
Sep 04 13:11:56 FORGE ollama[1461072]: llama_new_context_with_model: flash_attn = 0
Sep 04 13:11:56 FORGE ollama[1461072]: llama_new_context_with_model: freq_base  = 500000.0
Sep 04 13:11:56 FORGE ollama[1461072]: llama_new_context_with_model: freq_scale = 1
Sep 04 13:11:56 FORGE ollama[1461072]: llama_kv_cache_init:      CUDA0 KV buffer size =  1024.00 MiB
Sep 04 13:11:56 FORGE ollama[1461072]: llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
Sep 04 13:11:56 FORGE ollama[1461072]: llama_new_context_with_model:  CUDA_Host  output buffer size =     2.02 MiB
Sep 04 13:11:56 FORGE ollama[1461072]: llama_new_context_with_model:      CUDA0 compute buffer size =   560.00 MiB
Sep 04 13:11:56 FORGE ollama[1461072]: llama_new_context_with_model:  CUDA_Host compute buffer size =    24.01 MiB
Sep 04 13:11:56 FORGE ollama[1461072]: llama_new_context_with_model: graph nodes  = 1030
Sep 04 13:11:56 FORGE ollama[1461072]: llama_new_context_with_model: graph splits = 2
Sep 04 13:11:57 FORGE ollama[1461917]: INFO [main] model loaded | tid="124675729735680" timestamp=1725469917
Sep 04 13:11:57 FORGE ollama[1461072]: time=2024-09-04T13:11:57.106-04:00 level=INFO source=server.go:630 msg="llama runner started in 1.00 seconds"
Sep 04 13:11:58 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:11:58 | 200 |  2.038118055s |       127.0.0.1 | POST     "/api/generate"
Sep 04 13:12:18 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:12:18 | 200 |  1.538377657s |       127.0.0.1 | POST     "/api/generate"
Sep 04 13:13:41 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:13:41 | 200 |  2.979718661s |       127.0.0.1 | POST     "/api/generate"
Sep 04 13:14:23 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:14:23 | 200 |  2.088595512s |       127.0.0.1 | POST     "/api/generate"
Sep 04 13:15:19 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:15:19 | 200 |  1.206651689s |       127.0.0.1 | POST     "/api/generate"
Sep 04 13:15:57 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:15:57 | 200 |  812.749244ms |       127.0.0.1 | POST     "/api/generate"
Sep 04 13:16:20 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:16:20 | 200 |  686.756108ms |       127.0.0.1 | POST     "/api/generate"
Sep 04 13:16:34 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:16:34 | 200 |  1.551377078s |       127.0.0.1 | POST     "/api/generate"
Sep 04 13:17:01 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:17:01 | 200 |  914.295957ms |       127.0.0.1 | POST     "/api/generate"
Sep 04 13:17:34 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:17:34 | 200 |  766.851879ms |       127.0.0.1 | POST     "/api/generate"
Sep 04 13:17:44 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:17:44 | 200 |  287.497181ms |       127.0.0.1 | POST     "/api/generate"
Sep 04 13:18:01 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:01.919-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:01 FORGE ollama[1461072]: time=2024-09-04T13:18:01.934-04:00 level=INFO source=sched.go:508 msg="updated VRAM based on existing loaded models" gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 library=cuda total="23.6 GiB" available="17.4 GiB"
Sep 04 13:18:01 FORGE ollama[1461072]: time=2024-09-04T13:18:01.934-04:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 parallel=4 available=18716137472 required="6.4 GiB"
Sep 04 13:18:01 FORGE ollama[1461072]: time=2024-09-04T13:18:01.935-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[17.4 GiB]" memory.required.full="6.4 GiB" memory.required.partial="6.4 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[6.4 GiB]" memory.weights.total="5.1 GiB" memory.weights.repeating="4.9 GiB" memory.weights.nonrepeating="205.1 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="569.0 MiB"
Sep 04 13:18:01 FORGE ollama[1461072]: time=2024-09-04T13:18:01.937-04:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama4167472154/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 49 --parallel 4 --port 41377"
Sep 04 13:18:01 FORGE ollama[1461072]: time=2024-09-04T13:18:01.938-04:00 level=INFO source=sched.go:450 msg="loaded runners" count=2
Sep 04 13:18:01 FORGE ollama[1461072]: time=2024-09-04T13:18:01.938-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
Sep 04 13:18:01 FORGE ollama[1461072]: time=2024-09-04T13:18:01.938-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
Sep 04 13:18:01 FORGE ollama[1470788]: INFO [main] build info | build=1 commit="1e6f655" tid="124231231954944" timestamp=1725470281
Sep 04 13:18:01 FORGE ollama[1470788]: INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="124231231954944" timestamp=1725470281 total_threads=32
Sep 04 13:18:01 FORGE ollama[1470788]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="41377" tid="124231231954944" timestamp=1725470281
Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: loaded meta data with 31 key-value pairs and 435 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 (version GGUF V3 (latest))
Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv   1:                               general.type str              = model
Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv   2:                               general.name str              = Yi Coder 9B Chat
Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv   3:                           general.finetune str              = Chat
Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv   4:                           general.basename str              = Yi-Coder
Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv   5:                         general.size_label str              = 9B
Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv   6:                            general.license str              = apache-2.0
Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv   7:                          llama.block_count u32              = 48
Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv   8:                       llama.context_length u32              = 131072
Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv   9:                     llama.embedding_length u32              = 4096
Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv  10:                  llama.feed_forward_length u32              = 11008
Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 32
Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 4
Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv  13:                       llama.rope.freq_base f32              = 10000000.000000
Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv  14:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv  15:                          general.file_type u32              = 2
Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv  16:                           llama.vocab_size u32              = 64000
Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv  17:                 llama.rope.dimension_count u32              = 128
Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv  18:            tokenizer.ggml.add_space_prefix bool             = false
Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = llama
Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = default
Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,64000]   = ["<unk>", "<|startoftext|>", "<|endof...
Sep 04 13:18:02 FORGE ollama[1461072]: llama_model_loader: - kv  22:                      tokenizer.ggml.scores arr[f32,64000]   = [-1000.000000, -1000.000000, -1000.00...
Sep 04 13:18:02 FORGE ollama[1461072]: llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,64000]   = [3, 3, 3, 3, 3, 3, 1, 3, 1, 3, 3, 3, ...
Sep 04 13:18:02 FORGE ollama[1461072]: llama_model_loader: - kv  24:                tokenizer.ggml.bos_token_id u32              = 1
Sep 04 13:18:02 FORGE ollama[1461072]: llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 2
Sep 04 13:18:02 FORGE ollama[1461072]: llama_model_loader: - kv  26:            tokenizer.ggml.padding_token_id u32              = 0
Sep 04 13:18:02 FORGE ollama[1461072]: llama_model_loader: - kv  27:               tokenizer.ggml.add_bos_token bool             = false
Sep 04 13:18:02 FORGE ollama[1461072]: llama_model_loader: - kv  28:               tokenizer.ggml.add_eos_token bool             = false
Sep 04 13:18:02 FORGE ollama[1461072]: llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
Sep 04 13:18:02 FORGE ollama[1461072]: llama_model_loader: - kv  30:               general.quantization_version u32              = 2
Sep 04 13:18:02 FORGE ollama[1461072]: llama_model_loader: - type  f32:   97 tensors
Sep 04 13:18:02 FORGE ollama[1461072]: llama_model_loader: - type q4_0:  337 tensors
Sep 04 13:18:02 FORGE ollama[1461072]: llama_model_loader: - type q6_K:    1 tensors
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_vocab: special tokens cache size = 12
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_vocab: token to piece cache size = 0.3834 MB
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: format           = GGUF V3 (latest)
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: arch             = llama
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: vocab type       = SPM
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_vocab          = 64000
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_merges         = 0
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: vocab_only       = 0
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_ctx_train      = 131072
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_embd           = 4096
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_layer          = 48
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_head           = 32
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_head_kv        = 4
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_rot            = 128
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_swa            = 0
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_embd_head_k    = 128
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_embd_head_v    = 128
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_gqa            = 8
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_embd_k_gqa     = 512
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_embd_v_gqa     = 512
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_ff             = 11008
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_expert         = 0
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_expert_used    = 0
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: causal attn      = 1
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: pooling type     = 0
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: rope type        = 0
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: rope scaling     = linear
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: freq_base_train  = 10000000.0
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: freq_scale_train = 1
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_ctx_orig_yarn  = 131072
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: rope_finetuned   = unknown
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_conv       = 0
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_inner      = 0
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_state      = 0
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: ssm_dt_rank      = 0
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: model type       = 34B
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: model ftype      = Q4_0
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: model params     = 8.83 B
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: model size       = 4.69 GiB (4.56 BPW)
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: general.name     = Yi Coder 9B Chat
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: BOS token        = 1 '<|startoftext|>'
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: EOS token        = 2 '<|endoftext|>'
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: UNK token        = 0 '<unk>'
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: PAD token        = 0 '<unk>'
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: LF token         = 315 '<0x0A>'
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: EOT token        = 2 '<|endoftext|>'
Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: max token length = 48
Sep 04 13:18:02 FORGE ollama[1461072]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Sep 04 13:18:02 FORGE ollama[1461072]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 04 13:18:02 FORGE ollama[1461072]: ggml_cuda_init: found 1 CUDA devices:
Sep 04 13:18:02 FORGE ollama[1461072]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Sep 04 13:18:02 FORGE ollama[1461072]: CUDA error: CUDA-capable device(s) is/are busy or unavailable
Sep 04 13:18:02 FORGE ollama[1461072]:   current device: 0, in function ggml_backend_cuda_get_device_memory at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:3040
Sep 04 13:18:02 FORGE ollama[1461072]:   cudaMemGetInfo(free, total)
Sep 04 13:18:02 FORGE ollama[1461072]: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:101: CUDA error
Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470789]
Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470790]
Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470791]
Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470792]
Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470793]
Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470794]
Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470795]
Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470796]
Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470797]
Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470798]
Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470799]
Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470800]
Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470801]
Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470802]
Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470803]
Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470804]
Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470805]
Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470806]
Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470807]
Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470808]
Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470809]
Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470810]
Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470811]
Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470812]
Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470813]
Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470814]
Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470815]
Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470816]
Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470817]
Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470818]
Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470819]
Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470820]
Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470821]
Sep 04 13:18:02 FORGE ollama[1470823]: [Thread debugging using libthread_db enabled]
Sep 04 13:18:02 FORGE ollama[1470823]: Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Sep 04 13:18:02 FORGE ollama[1470823]: 0x000070fc720ea42f in __GI___wait4 (pid=1470823, stat_loc=0x7ffd33cf1d44, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
Sep 04 13:18:02 FORGE ollama[1461072]: 30        ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
Sep 04 13:18:02 FORGE ollama[1470823]: #0  0x000070fc720ea42f in __GI___wait4 (pid=1470823, stat_loc=0x7ffd33cf1d44, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
Sep 04 13:18:02 FORGE ollama[1470823]: 30        in ../sysdeps/unix/sysv/linux/wait4.c
Sep 04 13:18:02 FORGE ollama[1470823]: #1  0x000070fc7283ca88 in ggml_abort () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so
Sep 04 13:18:02 FORGE ollama[1470823]: #2  0x000070fc72909d72 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so
Sep 04 13:18:02 FORGE ollama[1470823]: #3  0x000070fc72916c52 in ggml_backend_cuda_get_device_memory () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so
Sep 04 13:18:02 FORGE ollama[1470823]: #4  0x000070fcd6804469 in llm_load_tensors(llama_model_loader&, llama_model&, int, llama_split_mode, int, float const*, bool, bool (*)(float, void*), void*) () from /tmp/ollama4167472154/runners/cuda_v12/libllama.so
Sep 04 13:18:02 FORGE ollama[1470823]: #5  0x000070fcd6845fe2 in llama_load_model_from_file () from /tmp/ollama4167472154/runners/cuda_v12/libllama.so
Sep 04 13:18:02 FORGE ollama[1470823]: #6  0x00000000004e452b in llama_init_from_gpt_params(gpt_params&) ()
Sep 04 13:18:02 FORGE ollama[1470823]: #7  0x0000000000473710 in llama_server_context::load_model(gpt_params const&) ()
Sep 04 13:18:02 FORGE ollama[1470823]: #8  0x0000000000423058 in main ()
Sep 04 13:18:02 FORGE ollama[1461072]: time=2024-09-04T13:18:02.246-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
Sep 04 13:18:02 FORGE ollama[1470823]: [Inferior 1 (process 1470788) detached]
Sep 04 13:18:02 FORGE ollama[1461072]: time=2024-09-04T13:18:02.696-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
Sep 04 13:18:03 FORGE ollama[1461072]: time=2024-09-04T13:18:03.850-04:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: CUDA error: CUDA-capable device(s) is/are busy or unavailable\n  current device: 0, in function ggml_backend_cuda_get_device_memory at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:3040\n  cudaMemGetInfo(free, total)\n/go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:101: CUDA error"
Sep 04 13:18:03 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:18:03 | 500 |  1.943146661s |       127.0.0.1 | POST     "/api/chat"
Sep 04 13:18:03 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:03.851-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:04 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:04.102-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:04 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:04.352-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:04 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:04.603-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:04 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:04.852-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:05 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:05.102-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:05 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:05.353-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:05 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:05.602-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:05 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:05.852-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:06 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:06.102-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:06 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:06.352-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:06 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:06.603-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:06 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:06.852-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:07 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:07.102-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:07 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:07.352-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:07 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:18:07 | 200 |      11.352µs |       127.0.0.1 | HEAD     "/"
Sep 04 13:18:07 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:18:07 | 200 |    8.246808ms |       127.0.0.1 | POST     "/api/show"
Sep 04 13:18:07 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:07.602-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:07 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:07.607-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:07 FORGE ollama[1461072]: time=2024-09-04T13:18:07.624-04:00 level=INFO source=sched.go:508 msg="updated VRAM based on existing loaded models" gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 library=cuda total="23.6 GiB" available="17.4 GiB"
Sep 04 13:18:07 FORGE ollama[1461072]: time=2024-09-04T13:18:07.625-04:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 parallel=4 available=18716137472 required="6.4 GiB"
Sep 04 13:18:07 FORGE ollama[1461072]: time=2024-09-04T13:18:07.625-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[17.4 GiB]" memory.required.full="6.4 GiB" memory.required.partial="6.4 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[6.4 GiB]" memory.weights.total="5.1 GiB" memory.weights.repeating="4.9 GiB" memory.weights.nonrepeating="205.1 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="569.0 MiB"
Sep 04 13:18:07 FORGE ollama[1461072]: time=2024-09-04T13:18:07.626-04:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama4167472154/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 49 --parallel 4 --port 40973"
Sep 04 13:18:07 FORGE ollama[1461072]: time=2024-09-04T13:18:07.627-04:00 level=INFO source=sched.go:450 msg="loaded runners" count=2
Sep 04 13:18:07 FORGE ollama[1461072]: time=2024-09-04T13:18:07.627-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
Sep 04 13:18:07 FORGE ollama[1461072]: time=2024-09-04T13:18:07.627-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
Sep 04 13:18:07 FORGE ollama[1470980]: INFO [main] build info | build=1 commit="1e6f655" tid="124851219718144" timestamp=1725470287
Sep 04 13:18:07 FORGE ollama[1470980]: INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="124851219718144" timestamp=1725470287 total_threads=32
Sep 04 13:18:07 FORGE ollama[1470980]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="40973" tid="124851219718144" timestamp=1725470287
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: loaded meta data with 31 key-value pairs and 435 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 (version GGUF V3 (latest))
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv   1:                               general.type str              = model
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv   2:                               general.name str              = Yi Coder 9B Chat
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv   3:                           general.finetune str              = Chat
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv   4:                           general.basename str              = Yi-Coder
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv   5:                         general.size_label str              = 9B
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv   6:                            general.license str              = apache-2.0
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv   7:                          llama.block_count u32              = 48
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv   8:                       llama.context_length u32              = 131072
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv   9:                     llama.embedding_length u32              = 4096
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv  10:                  llama.feed_forward_length u32              = 11008
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 32
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 4
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv  13:                       llama.rope.freq_base f32              = 10000000.000000
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv  14:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv  15:                          general.file_type u32              = 2
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv  16:                           llama.vocab_size u32              = 64000
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv  17:                 llama.rope.dimension_count u32              = 128
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv  18:            tokenizer.ggml.add_space_prefix bool             = false
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = llama
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = default
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,64000]   = ["<unk>", "<|startoftext|>", "<|endof...
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv  22:                      tokenizer.ggml.scores arr[f32,64000]   = [-1000.000000, -1000.000000, -1000.00...
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,64000]   = [3, 3, 3, 3, 3, 3, 1, 3, 1, 3, 3, 3, ...
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv  24:                tokenizer.ggml.bos_token_id u32              = 1
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 2
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv  26:            tokenizer.ggml.padding_token_id u32              = 0
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv  27:               tokenizer.ggml.add_bos_token bool             = false
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv  28:               tokenizer.ggml.add_eos_token bool             = false
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv  30:               general.quantization_version u32              = 2
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - type  f32:   97 tensors
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - type q4_0:  337 tensors
Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - type q6_K:    1 tensors
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_vocab: special tokens cache size = 12
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_vocab: token to piece cache size = 0.3834 MB
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: format           = GGUF V3 (latest)
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: arch             = llama
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: vocab type       = SPM
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_vocab          = 64000
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_merges         = 0
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: vocab_only       = 0
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_ctx_train      = 131072
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_embd           = 4096
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_layer          = 48
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_head           = 32
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_head_kv        = 4
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_rot            = 128
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_swa            = 0
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_embd_head_k    = 128
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_embd_head_v    = 128
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_gqa            = 8
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_embd_k_gqa     = 512
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_embd_v_gqa     = 512
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_ff             = 11008
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_expert         = 0
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_expert_used    = 0
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: causal attn      = 1
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: pooling type     = 0
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: rope type        = 0
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: rope scaling     = linear
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: freq_base_train  = 10000000.0
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: freq_scale_train = 1
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_ctx_orig_yarn  = 131072
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: rope_finetuned   = unknown
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_conv       = 0
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_inner      = 0
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_state      = 0
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: ssm_dt_rank      = 0
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: model type       = 34B
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: model ftype      = Q4_0
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: model params     = 8.83 B
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: model size       = 4.69 GiB (4.56 BPW)
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: general.name     = Yi Coder 9B Chat
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: BOS token        = 1 '<|startoftext|>'
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: EOS token        = 2 '<|endoftext|>'
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: UNK token        = 0 '<unk>'
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: PAD token        = 0 '<unk>'
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: LF token         = 315 '<0x0A>'
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: EOT token        = 2 '<|endoftext|>'
Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: max token length = 48
Sep 04 13:18:07 FORGE ollama[1461072]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Sep 04 13:18:07 FORGE ollama[1461072]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 04 13:18:07 FORGE ollama[1461072]: ggml_cuda_init: found 1 CUDA devices:
Sep 04 13:18:07 FORGE ollama[1461072]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Sep 04 13:18:07 FORGE ollama[1461072]: CUDA error: CUDA-capable device(s) is/are busy or unavailable
Sep 04 13:18:07 FORGE ollama[1461072]:   current device: 0, in function ggml_backend_cuda_get_device_memory at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:3040
Sep 04 13:18:07 FORGE ollama[1461072]:   cudaMemGetInfo(free, total)
Sep 04 13:18:07 FORGE ollama[1461072]: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:101: CUDA error
Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470981]
Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470982]
Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470983]
Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470984]
Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470985]
Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470986]
Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470987]
Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470988]
Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470989]
Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470990]
Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470991]
Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470992]
Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470993]
Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470994]
Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470995]
Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470996]
Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470997]
Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470998]
Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470999]
Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1471000]
Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1471001]
Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1471002]
Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1471003]
Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1471004]
Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1471005]
Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1471006]
Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1471007]
Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1471008]
Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1471009]
Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1471010]
Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1471011]
Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1471012]
Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1471013]
Sep 04 13:18:07 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:07.851-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:07 FORGE ollama[1471015]: [Thread debugging using libthread_db enabled]
Sep 04 13:18:07 FORGE ollama[1471015]: Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Sep 04 13:18:07 FORGE ollama[1471015]: 0x0000718ccc2ea42f in __GI___wait4 (pid=1471015, stat_loc=0x7ffcad718ff4, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
Sep 04 13:18:07 FORGE ollama[1461072]: 30        ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
Sep 04 13:18:07 FORGE ollama[1471015]: #0  0x0000718ccc2ea42f in __GI___wait4 (pid=1471015, stat_loc=0x7ffcad718ff4, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
Sep 04 13:18:07 FORGE ollama[1471015]: 30        in ../sysdeps/unix/sysv/linux/wait4.c
Sep 04 13:18:07 FORGE ollama[1471015]: #1  0x0000718ccca3ca88 in ggml_abort () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so
Sep 04 13:18:07 FORGE ollama[1471015]: #2  0x0000718cccb09d72 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so
Sep 04 13:18:07 FORGE ollama[1471015]: #3  0x0000718cccb16c52 in ggml_backend_cuda_get_device_memory () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so
Sep 04 13:18:07 FORGE ollama[1471015]: #4  0x0000718d30b53469 in llm_load_tensors(llama_model_loader&, llama_model&, int, llama_split_mode, int, float const*, bool, bool (*)(float, void*), void*) () from /tmp/ollama4167472154/runners/cuda_v12/libllama.so
Sep 04 13:18:07 FORGE ollama[1471015]: #5  0x0000718d30b94fe2 in llama_load_model_from_file () from /tmp/ollama4167472154/runners/cuda_v12/libllama.so
Sep 04 13:18:07 FORGE ollama[1471015]: #6  0x00000000004e452b in llama_init_from_gpt_params(gpt_params&) ()
Sep 04 13:18:07 FORGE ollama[1471015]: #7  0x0000000000473710 in llama_server_context::load_model(gpt_params const&) ()
Sep 04 13:18:07 FORGE ollama[1471015]: #8  0x0000000000423058 in main ()
Sep 04 13:18:07 FORGE ollama[1461072]: time=2024-09-04T13:18:07.910-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
Sep 04 13:18:07 FORGE ollama[1471015]: [Inferior 1 (process 1470980) detached]
Sep 04 13:18:08 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:08.102-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:08 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:08.352-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:08 FORGE ollama[1461072]: time=2024-09-04T13:18:08.361-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
Sep 04 13:18:08 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:08.602-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:08 FORGE ollama[1461072]: time=2024-09-04T13:18:08.851-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.00123842 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2
Sep 04 13:18:08 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:08.852-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:09 FORGE ollama[1461072]: time=2024-09-04T13:18:09.102-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.251756147 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2
Sep 04 13:18:09 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:09.102-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:09 FORGE ollama[1461072]: time=2024-09-04T13:18:09.351-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.501123479 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2
Sep 04 13:18:09 FORGE ollama[1461072]: time=2024-09-04T13:18:09.514-04:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: CUDA error"
Sep 04 13:18:09 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:18:09 | 500 |  1.923404002s |       127.0.0.1 | POST     "/api/chat"
Sep 04 13:18:09 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:09.515-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:09 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:09.766-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:10 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:10.016-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:10 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:10.266-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:10 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:10.515-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:10 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:10.766-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:11 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:11.015-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:11 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:11.266-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:11 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:11.516-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:11 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:11.765-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:12 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:12.016-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:12 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:12.265-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:12 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:12.516-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:12 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:12.765-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:13 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:13.016-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:13 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:13.265-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:13 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:13.516-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:13 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:18:13 | 200 |      16.829µs |       127.0.0.1 | HEAD     "/"
Sep 04 13:18:13 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:18:13 | 200 |     9.61044ms |       127.0.0.1 | POST     "/api/show"
Sep 04 13:18:13 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:13.640-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:13 FORGE ollama[1461072]: time=2024-09-04T13:18:13.650-04:00 level=INFO source=sched.go:508 msg="updated VRAM based on existing loaded models" gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 library=cuda total="23.6 GiB" available="17.4 GiB"
Sep 04 13:18:13 FORGE ollama[1461072]: time=2024-09-04T13:18:13.650-04:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 parallel=4 available=18716137472 required="6.4 GiB"
Sep 04 13:18:13 FORGE ollama[1461072]: time=2024-09-04T13:18:13.650-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[17.4 GiB]" memory.required.full="6.4 GiB" memory.required.partial="6.4 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[6.4 GiB]" memory.weights.total="5.1 GiB" memory.weights.repeating="4.9 GiB" memory.weights.nonrepeating="205.1 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="569.0 MiB"
Sep 04 13:18:13 FORGE ollama[1461072]: time=2024-09-04T13:18:13.651-04:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama4167472154/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 49 --parallel 4 --port 43553"
Sep 04 13:18:13 FORGE ollama[1461072]: time=2024-09-04T13:18:13.651-04:00 level=INFO source=sched.go:450 msg="loaded runners" count=2
Sep 04 13:18:13 FORGE ollama[1461072]: time=2024-09-04T13:18:13.651-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
Sep 04 13:18:13 FORGE ollama[1461072]: time=2024-09-04T13:18:13.651-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
Sep 04 13:18:13 FORGE ollama[1471193]: INFO [main] build info | build=1 commit="1e6f655" tid="138366229303296" timestamp=1725470293
Sep 04 13:18:13 FORGE ollama[1471193]: INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="138366229303296" timestamp=1725470293 total_threads=32
Sep 04 13:18:13 FORGE ollama[1471193]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="43553" tid="138366229303296" timestamp=1725470293
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: loaded meta data with 31 key-value pairs and 435 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 (version GGUF V3 (latest))
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv   1:                               general.type str              = model
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv   2:                               general.name str              = Yi Coder 9B Chat
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv   3:                           general.finetune str              = Chat
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv   4:                           general.basename str              = Yi-Coder
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv   5:                         general.size_label str              = 9B
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv   6:                            general.license str              = apache-2.0
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv   7:                          llama.block_count u32              = 48
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv   8:                       llama.context_length u32              = 131072
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv   9:                     llama.embedding_length u32              = 4096
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv  10:                  llama.feed_forward_length u32              = 11008
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 32
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 4
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv  13:                       llama.rope.freq_base f32              = 10000000.000000
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv  14:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv  15:                          general.file_type u32              = 2
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv  16:                           llama.vocab_size u32              = 64000
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv  17:                 llama.rope.dimension_count u32              = 128
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv  18:            tokenizer.ggml.add_space_prefix bool             = false
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = llama
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = default
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,64000]   = ["<unk>", "<|startoftext|>", "<|endof...
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv  22:                      tokenizer.ggml.scores arr[f32,64000]   = [-1000.000000, -1000.000000, -1000.00...
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,64000]   = [3, 3, 3, 3, 3, 3, 1, 3, 1, 3, 3, 3, ...
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv  24:                tokenizer.ggml.bos_token_id u32              = 1
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 2
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv  26:            tokenizer.ggml.padding_token_id u32              = 0
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv  27:               tokenizer.ggml.add_bos_token bool             = false
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv  28:               tokenizer.ggml.add_eos_token bool             = false
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv  30:               general.quantization_version u32              = 2
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - type  f32:   97 tensors
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - type q4_0:  337 tensors
Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - type q6_K:    1 tensors
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_vocab: special tokens cache size = 12
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_vocab: token to piece cache size = 0.3834 MB
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: format           = GGUF V3 (latest)
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: arch             = llama
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: vocab type       = SPM
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_vocab          = 64000
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_merges         = 0
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: vocab_only       = 0
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_ctx_train      = 131072
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_embd           = 4096
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_layer          = 48
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_head           = 32
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_head_kv        = 4
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_rot            = 128
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_swa            = 0
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_embd_head_k    = 128
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_embd_head_v    = 128
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_gqa            = 8
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_embd_k_gqa     = 512
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_embd_v_gqa     = 512
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_ff             = 11008
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_expert         = 0
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_expert_used    = 0
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: causal attn      = 1
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: pooling type     = 0
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: rope type        = 0
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: rope scaling     = linear
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: freq_base_train  = 10000000.0
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: freq_scale_train = 1
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_ctx_orig_yarn  = 131072
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: rope_finetuned   = unknown
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_conv       = 0
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_inner      = 0
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_state      = 0
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: ssm_dt_rank      = 0
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: model type       = 34B
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: model ftype      = Q4_0
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: model params     = 8.83 B
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: model size       = 4.69 GiB (4.56 BPW)
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: general.name     = Yi Coder 9B Chat
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: BOS token        = 1 '<|startoftext|>'
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: EOS token        = 2 '<|endoftext|>'
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: UNK token        = 0 '<unk>'
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: PAD token        = 0 '<unk>'
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: LF token         = 315 '<0x0A>'
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: EOT token        = 2 '<|endoftext|>'
Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: max token length = 48
Sep 04 13:18:13 FORGE ollama[1461072]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Sep 04 13:18:13 FORGE ollama[1461072]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 04 13:18:13 FORGE ollama[1461072]: ggml_cuda_init: found 1 CUDA devices:
Sep 04 13:18:13 FORGE ollama[1461072]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Sep 04 13:18:13 FORGE ollama[1461072]: CUDA error: CUDA-capable device(s) is/are busy or unavailable
Sep 04 13:18:13 FORGE ollama[1461072]:   current device: 0, in function ggml_backend_cuda_get_device_memory at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:3040
Sep 04 13:18:13 FORGE ollama[1461072]:   cudaMemGetInfo(free, total)
Sep 04 13:18:13 FORGE ollama[1461072]: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:101: CUDA error
Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471194]
Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471195]
Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471196]
Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471197]
Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471198]
Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471199]
Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471200]
Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471201]
Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471202]
Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471203]
Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471204]
Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471205]
Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471206]
Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471207]
Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471208]
Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471209]
Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471210]
Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471211]
Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471212]
Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471213]
Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471214]
Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471215]
Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471216]
Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471217]
Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471218]
Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471219]
Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471220]
Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471221]
Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471222]
Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471223]
Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471224]
Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471225]
Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471226]
Sep 04 13:18:13 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:13.765-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:13 FORGE ollama[1471228]: [Thread debugging using libthread_db enabled]
Sep 04 13:18:13 FORGE ollama[1471228]: Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Sep 04 13:18:13 FORGE ollama[1471228]: 0x00007dd7818ea42f in __GI___wait4 (pid=1471228, stat_loc=0x7ffe2625b094, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
Sep 04 13:18:13 FORGE ollama[1461072]: 30        ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
Sep 04 13:18:13 FORGE ollama[1471228]: #0  0x00007dd7818ea42f in __GI___wait4 (pid=1471228, stat_loc=0x7ffe2625b094, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
Sep 04 13:18:13 FORGE ollama[1471228]: 30        in ../sysdeps/unix/sysv/linux/wait4.c
Sep 04 13:18:13 FORGE ollama[1471228]: #1  0x00007dd78203ca88 in ggml_abort () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so
Sep 04 13:18:13 FORGE ollama[1471228]: #2  0x00007dd782109d72 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so
Sep 04 13:18:13 FORGE ollama[1471228]: #3  0x00007dd782116c52 in ggml_backend_cuda_get_device_memory () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so
Sep 04 13:18:13 FORGE ollama[1471228]: #4  0x00007dd7e60da469 in llm_load_tensors(llama_model_loader&, llama_model&, int, llama_split_mode, int, float const*, bool, bool (*)(float, void*), void*) () from /tmp/ollama4167472154/runners/cuda_v12/libllama.so
Sep 04 13:18:13 FORGE ollama[1471228]: #5  0x00007dd7e611bfe2 in llama_load_model_from_file () from /tmp/ollama4167472154/runners/cuda_v12/libllama.so
Sep 04 13:18:13 FORGE ollama[1471228]: #6  0x00000000004e452b in llama_init_from_gpt_params(gpt_params&) ()
Sep 04 13:18:13 FORGE ollama[1471228]: #7  0x0000000000473710 in llama_server_context::load_model(gpt_params const&) ()
Sep 04 13:18:13 FORGE ollama[1471228]: #8  0x0000000000423058 in main ()
Sep 04 13:18:13 FORGE ollama[1461072]: time=2024-09-04T13:18:13.910-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
Sep 04 13:18:13 FORGE ollama[1471228]: [Inferior 1 (process 1471193) detached]
Sep 04 13:18:14 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:14.015-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:14 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:14.266-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:14 FORGE ollama[1461072]: time=2024-09-04T13:18:14.361-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
Sep 04 13:18:14 FORGE ollama[1461072]: time=2024-09-04T13:18:14.515-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.001345171 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2
Sep 04 13:18:14 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:14.515-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:14 FORGE ollama[1461072]: time=2024-09-04T13:18:14.765-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.251224478 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2
Sep 04 13:18:14 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:14.766-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:15 FORGE ollama[1461072]: time=2024-09-04T13:18:15.015-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.501346069 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2
Sep 04 13:18:15 FORGE ollama[1461072]: time=2024-09-04T13:18:15.514-04:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: CUDA error"
Sep 04 13:18:15 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:18:15 | 500 |  1.889397041s |       127.0.0.1 | POST     "/api/chat"
Sep 04 13:18:15 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:15.515-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:15 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:15.767-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:16 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:16.016-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:16 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:16.266-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:16 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:16.516-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:16 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:16.766-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:17 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:17.017-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:17 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:17.266-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:17 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:17.516-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:17 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:17.766-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:18 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:18.016-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:18 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:18.266-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:18 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:18.516-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:18 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:18.766-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:19 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:19.016-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:19 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:19.266-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:19 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:19.516-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:19 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:19.766-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:20 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:20.016-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:20 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:20.266-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:20 FORGE ollama[1461072]: time=2024-09-04T13:18:20.516-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.001506631 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2
Sep 04 13:18:20 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:20.516-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:20 FORGE ollama[1461072]: time=2024-09-04T13:18:20.765-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.251204197 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2
Sep 04 13:18:20 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:20.766-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:18:21 FORGE ollama[1461072]: time=2024-09-04T13:18:21.015-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.500751415 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2
Sep 04 13:18:38 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:18:38 | 200 |      26.809µs |       127.0.0.1 | HEAD     "/"
Sep 04 13:18:38 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:18:38 | 200 |   16.192728ms |       127.0.0.1 | POST     "/api/show"
Sep 04 13:18:38 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:18:38 | 200 |    9.152746ms |       127.0.0.1 | POST     "/api/chat"
Sep 04 13:18:41 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:18:41 | 200 |  279.957401ms |       127.0.0.1 | POST     "/api/chat"
Sep 04 13:21:01 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:21:01 | 200 |      47.526µs |       127.0.0.1 | GET      "/api/version"
Sep 04 13:22:14 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:22:14 | 200 |      48.756µs |       127.0.0.1 | HEAD     "/"
Sep 04 13:22:14 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:22:14 | 200 |   15.951301ms |       127.0.0.1 | POST     "/api/show"
Sep 04 13:22:14 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:14.678-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:22:14 FORGE ollama[1461072]: time=2024-09-04T13:22:14.694-04:00 level=INFO source=sched.go:508 msg="updated VRAM based on existing loaded models" gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 library=cuda total="23.6 GiB" available="17.4 GiB"
Sep 04 13:22:14 FORGE ollama[1461072]: time=2024-09-04T13:22:14.695-04:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 parallel=4 available=18716137472 required="6.4 GiB"
Sep 04 13:22:14 FORGE ollama[1461072]: time=2024-09-04T13:22:14.695-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[17.4 GiB]" memory.required.full="6.4 GiB" memory.required.partial="6.4 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[6.4 GiB]" memory.weights.total="5.1 GiB" memory.weights.repeating="4.9 GiB" memory.weights.nonrepeating="205.1 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="569.0 MiB"
Sep 04 13:22:14 FORGE ollama[1461072]: time=2024-09-04T13:22:14.696-04:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama4167472154/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 49 --parallel 4 --port 44533"
Sep 04 13:22:14 FORGE ollama[1461072]: time=2024-09-04T13:22:14.696-04:00 level=INFO source=sched.go:450 msg="loaded runners" count=2
Sep 04 13:22:14 FORGE ollama[1461072]: time=2024-09-04T13:22:14.696-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
Sep 04 13:22:14 FORGE ollama[1461072]: time=2024-09-04T13:22:14.696-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
Sep 04 13:22:14 FORGE ollama[1475532]: INFO [main] build info | build=1 commit="1e6f655" tid="137424205864960" timestamp=1725470534
Sep 04 13:22:14 FORGE ollama[1475532]: INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="137424205864960" timestamp=1725470534 total_threads=32
Sep 04 13:22:14 FORGE ollama[1475532]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="44533" tid="137424205864960" timestamp=1725470534
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: loaded meta data with 31 key-value pairs and 435 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 (version GGUF V3 (latest))
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv   1:                               general.type str              = model
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv   2:                               general.name str              = Yi Coder 9B Chat
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv   3:                           general.finetune str              = Chat
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv   4:                           general.basename str              = Yi-Coder
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv   5:                         general.size_label str              = 9B
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv   6:                            general.license str              = apache-2.0
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv   7:                          llama.block_count u32              = 48
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv   8:                       llama.context_length u32              = 131072
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv   9:                     llama.embedding_length u32              = 4096
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv  10:                  llama.feed_forward_length u32              = 11008
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 32
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 4
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv  13:                       llama.rope.freq_base f32              = 10000000.000000
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv  14:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv  15:                          general.file_type u32              = 2
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv  16:                           llama.vocab_size u32              = 64000
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv  17:                 llama.rope.dimension_count u32              = 128
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv  18:            tokenizer.ggml.add_space_prefix bool             = false
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = llama
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = default
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,64000]   = ["<unk>", "<|startoftext|>", "<|endof...
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv  22:                      tokenizer.ggml.scores arr[f32,64000]   = [-1000.000000, -1000.000000, -1000.00...
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,64000]   = [3, 3, 3, 3, 3, 3, 1, 3, 1, 3, 3, 3, ...
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv  24:                tokenizer.ggml.bos_token_id u32              = 1
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 2
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv  26:            tokenizer.ggml.padding_token_id u32              = 0
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv  27:               tokenizer.ggml.add_bos_token bool             = false
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv  28:               tokenizer.ggml.add_eos_token bool             = false
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv  30:               general.quantization_version u32              = 2
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - type  f32:   97 tensors
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - type q4_0:  337 tensors
Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - type q6_K:    1 tensors
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_vocab: special tokens cache size = 12
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_vocab: token to piece cache size = 0.3834 MB
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: format           = GGUF V3 (latest)
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: arch             = llama
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: vocab type       = SPM
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_vocab          = 64000
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_merges         = 0
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: vocab_only       = 0
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_ctx_train      = 131072
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_embd           = 4096
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_layer          = 48
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_head           = 32
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_head_kv        = 4
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_rot            = 128
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_swa            = 0
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_embd_head_k    = 128
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_embd_head_v    = 128
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_gqa            = 8
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_embd_k_gqa     = 512
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_embd_v_gqa     = 512
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_ff             = 11008
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_expert         = 0
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_expert_used    = 0
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: causal attn      = 1
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: pooling type     = 0
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: rope type        = 0
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: rope scaling     = linear
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: freq_base_train  = 10000000.0
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: freq_scale_train = 1
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_ctx_orig_yarn  = 131072
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: rope_finetuned   = unknown
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_conv       = 0
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_inner      = 0
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_state      = 0
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: ssm_dt_rank      = 0
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: model type       = 34B
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: model ftype      = Q4_0
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: model params     = 8.83 B
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: model size       = 4.69 GiB (4.56 BPW)
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: general.name     = Yi Coder 9B Chat
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: BOS token        = 1 '<|startoftext|>'
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: EOS token        = 2 '<|endoftext|>'
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: UNK token        = 0 '<unk>'
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: PAD token        = 0 '<unk>'
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: LF token         = 315 '<0x0A>'
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: EOT token        = 2 '<|endoftext|>'
Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: max token length = 48
Sep 04 13:22:14 FORGE ollama[1461072]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Sep 04 13:22:14 FORGE ollama[1461072]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 04 13:22:14 FORGE ollama[1461072]: ggml_cuda_init: found 1 CUDA devices:
Sep 04 13:22:14 FORGE ollama[1461072]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Sep 04 13:22:14 FORGE ollama[1461072]: CUDA error: CUDA-capable device(s) is/are busy or unavailable
Sep 04 13:22:14 FORGE ollama[1461072]:   current device: 0, in function ggml_backend_cuda_get_device_memory at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:3040
Sep 04 13:22:14 FORGE ollama[1461072]:   cudaMemGetInfo(free, total)
Sep 04 13:22:14 FORGE ollama[1461072]: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:101: CUDA error
Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475533]
Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475534]
Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475535]
Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475536]
Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475537]
Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475538]
Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475539]
Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475540]
Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475541]
Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475542]
Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475543]
Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475544]
Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475545]
Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475546]
Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475547]
Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475548]
Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475549]
Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475550]
Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475551]
Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475552]
Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475553]
Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475554]
Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475555]
Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475556]
Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475557]
Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475558]
Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475559]
Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475560]
Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475561]
Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475562]
Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475563]
Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475564]
Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475565]
Sep 04 13:22:14 FORGE ollama[1475567]: [Thread debugging using libthread_db enabled]
Sep 04 13:22:14 FORGE ollama[1475567]: Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Sep 04 13:22:14 FORGE ollama[1475567]: 0x00007cfc2c8ea42f in __GI___wait4 (pid=1475567, stat_loc=0x7ffc97e171d4, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
Sep 04 13:22:14 FORGE ollama[1461072]: 30        ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
Sep 04 13:22:14 FORGE ollama[1475567]: #0  0x00007cfc2c8ea42f in __GI___wait4 (pid=1475567, stat_loc=0x7ffc97e171d4, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
Sep 04 13:22:14 FORGE ollama[1475567]: 30        in ../sysdeps/unix/sysv/linux/wait4.c
Sep 04 13:22:14 FORGE ollama[1475567]: #1  0x00007cfc2d03ca88 in ggml_abort () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so
Sep 04 13:22:14 FORGE ollama[1475567]: #2  0x00007cfc2d109d72 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so
Sep 04 13:22:14 FORGE ollama[1475567]: #3  0x00007cfc2d116c52 in ggml_backend_cuda_get_device_memory () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so
Sep 04 13:22:14 FORGE ollama[1475567]: #4  0x00007cfc9114b469 in llm_load_tensors(llama_model_loader&, llama_model&, int, llama_split_mode, int, float const*, bool, bool (*)(float, void*), void*) () from /tmp/ollama4167472154/runners/cuda_v12/libllama.so
Sep 04 13:22:14 FORGE ollama[1475567]: #5  0x00007cfc9118cfe2 in llama_load_model_from_file () from /tmp/ollama4167472154/runners/cuda_v12/libllama.so
Sep 04 13:22:14 FORGE ollama[1475567]: #6  0x00000000004e452b in llama_init_from_gpt_params(gpt_params&) ()
Sep 04 13:22:14 FORGE ollama[1475567]: #7  0x0000000000473710 in llama_server_context::load_model(gpt_params const&) ()
Sep 04 13:22:14 FORGE ollama[1475567]: #8  0x0000000000423058 in main ()
Sep 04 13:22:14 FORGE ollama[1461072]: time=2024-09-04T13:22:14.964-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
Sep 04 13:22:14 FORGE ollama[1475567]: [Inferior 1 (process 1475532) detached]
Sep 04 13:22:15 FORGE ollama[1461072]: time=2024-09-04T13:22:15.415-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
Sep 04 13:22:16 FORGE ollama[1461072]: time=2024-09-04T13:22:16.568-04:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: CUDA error"
Sep 04 13:22:16 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:22:16 | 500 |  1.908285461s |       127.0.0.1 | POST     "/api/chat"
Sep 04 13:22:16 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:16.569-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:22:16 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:16.820-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:22:17 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:17.069-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:22:17 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:17.320-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:22:17 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:17.570-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:22:17 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:17.820-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:22:18 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:18.070-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:22:18 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:18.320-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:22:18 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:18.570-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:22:18 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:18.819-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:22:19 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:19.070-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:22:19 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:19.319-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:22:19 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:19.570-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:22:19 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:19.820-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:22:20 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:20.069-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:22:20 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:20.320-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:22:20 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:20.570-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:22:20 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:20.820-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:22:21 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:21.070-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:22:21 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:21.320-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:22:21 FORGE ollama[1461072]: time=2024-09-04T13:22:21.569-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.000850923 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2
Sep 04 13:22:21 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:21.570-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:22:21 FORGE ollama[1461072]: time=2024-09-04T13:22:21.819-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.250506103 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2
Sep 04 13:22:21 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:21.819-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 13:22:22 FORGE ollama[1461072]: time=2024-09-04T13:22:22.069-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.501120636 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2
Sep 04 13:23:41 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:23:41.664-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:24:55 FORGE ollama[1461072]: [GIN] 2024/09/04 - 14:24:55 | 200 |      38.066µs |       127.0.0.1 | HEAD     "/"
Sep 04 14:24:55 FORGE ollama[1461072]: [GIN] 2024/09/04 - 14:24:55 | 200 |   22.757806ms |       127.0.0.1 | POST     "/api/show"
Sep 04 14:24:55 FORGE ollama[1461072]: time=2024-09-04T14:24:55.988-04:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 parallel=4 available=23714660352 required="6.2 GiB"
Sep 04 14:24:55 FORGE ollama[1461072]: time=2024-09-04T14:24:55.988-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[22.1 GiB]" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
Sep 04 14:24:55 FORGE ollama[1461072]: time=2024-09-04T14:24:55.989-04:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama4167472154/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 43667"
Sep 04 14:24:55 FORGE ollama[1461072]: time=2024-09-04T14:24:55.990-04:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
Sep 04 14:24:55 FORGE ollama[1461072]: time=2024-09-04T14:24:55.990-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
Sep 04 14:24:55 FORGE ollama[1461072]: time=2024-09-04T14:24:55.990-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
Sep 04 14:24:56 FORGE ollama[1540742]: INFO [main] build info | build=1 commit="1e6f655" tid="126153710018560" timestamp=1725474296
Sep 04 14:24:56 FORGE ollama[1540742]: INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="126153710018560" timestamp=1725474296 total_threads=32
Sep 04 14:24:56 FORGE ollama[1540742]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="43667" tid="126153710018560" timestamp=1725474296
Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe (version GGUF V3 (latest))
Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv   1:                               general.type str              = model
Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv   3:                           general.finetune str              = Instruct
Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv   5:                         general.size_label str              = 8B
Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv   6:                            general.license str              = llama3.1
Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv   9:                          llama.block_count u32              = 32
Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv  17:                          general.file_type u32              = 2
Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv  28:               general.quantization_version u32              = 2
Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - type  f32:   66 tensors
Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - type q4_0:  225 tensors
Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - type q6_K:    1 tensors
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_vocab: special tokens cache size = 256
Sep 04 14:24:56 FORGE ollama[1461072]: time=2024-09-04T14:24:56.241-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_vocab: token to piece cache size = 0.7999 MB
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: format           = GGUF V3 (latest)
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: arch             = llama
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: vocab type       = BPE
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_vocab          = 128256
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_merges         = 280147
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: vocab_only       = 0
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_ctx_train      = 131072
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_embd           = 4096
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_layer          = 32
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_head           = 32
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_head_kv        = 8
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_rot            = 128
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_swa            = 0
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_embd_head_k    = 128
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_embd_head_v    = 128
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_gqa            = 4
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_embd_k_gqa     = 1024
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_embd_v_gqa     = 1024
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_ff             = 14336
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_expert         = 0
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_expert_used    = 0
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: causal attn      = 1
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: pooling type     = 0
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: rope type        = 0
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: rope scaling     = linear
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: freq_base_train  = 500000.0
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: freq_scale_train = 1
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_ctx_orig_yarn  = 131072
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: rope_finetuned   = unknown
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_conv       = 0
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_inner      = 0
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_state      = 0
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: ssm_dt_rank      = 0
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: model type       = 8B
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: model ftype      = Q4_0
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: model params     = 8.03 B
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW)
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: general.name     = Meta Llama 3.1 8B Instruct
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: LF token         = 128 'Ä'
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: max token length = 256
Sep 04 14:24:56 FORGE ollama[1461072]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Sep 04 14:24:56 FORGE ollama[1461072]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 04 14:24:56 FORGE ollama[1461072]: ggml_cuda_init: found 1 CUDA devices:
Sep 04 14:24:56 FORGE ollama[1461072]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_tensors: ggml ctx size =    0.27 MiB
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_tensors: offloading 32 repeating layers to GPU
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_tensors: offloading non-repeating layers to GPU
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_tensors: offloaded 33/33 layers to GPU
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_tensors:        CPU buffer size =   281.81 MiB
Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_tensors:      CUDA0 buffer size =  4156.00 MiB
Sep 04 14:24:56 FORGE ollama[1461072]: llama_new_context_with_model: n_ctx      = 8192
Sep 04 14:24:56 FORGE ollama[1461072]: llama_new_context_with_model: n_batch    = 512
Sep 04 14:24:56 FORGE ollama[1461072]: llama_new_context_with_model: n_ubatch   = 512
Sep 04 14:24:56 FORGE ollama[1461072]: llama_new_context_with_model: flash_attn = 0
Sep 04 14:24:56 FORGE ollama[1461072]: llama_new_context_with_model: freq_base  = 500000.0
Sep 04 14:24:56 FORGE ollama[1461072]: llama_new_context_with_model: freq_scale = 1
Sep 04 14:24:56 FORGE ollama[1461072]: llama_kv_cache_init:      CUDA0 KV buffer size =  1024.00 MiB
Sep 04 14:24:56 FORGE ollama[1461072]: llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
Sep 04 14:24:56 FORGE ollama[1461072]: llama_new_context_with_model:  CUDA_Host  output buffer size =     2.02 MiB
Sep 04 14:24:56 FORGE ollama[1461072]: llama_new_context_with_model:      CUDA0 compute buffer size =   560.00 MiB
Sep 04 14:24:56 FORGE ollama[1461072]: llama_new_context_with_model:  CUDA_Host compute buffer size =    24.01 MiB
Sep 04 14:24:56 FORGE ollama[1461072]: llama_new_context_with_model: graph nodes  = 1030
Sep 04 14:24:56 FORGE ollama[1461072]: llama_new_context_with_model: graph splits = 2
Sep 04 14:24:56 FORGE ollama[1540742]: INFO [main] model loaded | tid="126153710018560" timestamp=1725474296
Sep 04 14:24:56 FORGE ollama[1461072]: time=2024-09-04T14:24:56.995-04:00 level=INFO source=server.go:630 msg="llama runner started in 1.01 seconds"
Sep 04 14:24:56 FORGE ollama[1461072]: [GIN] 2024/09/04 - 14:24:56 | 200 |  1.107376461s |       127.0.0.1 | POST     "/api/chat"
Sep 04 14:25:19 FORGE ollama[1461072]: [GIN] 2024/09/04 - 14:25:19 | 200 |  3.176770441s |       127.0.0.1 | POST     "/api/chat"
chris@FORGE:~/bin$ 


...
<!-- gh-comment-id:2329720126 --> @iplayfast commented on GitHub (Sep 4, 2024): Here you go, hope it helps ``` journalctl -u ollama --no-pager --since="2024-09-04 12:00" Sep 04 12:52:45 FORGE ollama[841963]: [GIN] 2024/09/04 - 12:52:45 | 200 | 16.69µs | 127.0.0.1 | HEAD "/" Sep 04 12:52:45 FORGE ollama[841963]: [GIN] 2024/09/04 - 12:52:45 | 404 | 47.783µs | 127.0.0.1 | POST "/api/show" Sep 04 12:52:46 FORGE ollama[841963]: time=2024-09-04T12:52:46.988-04:00 level=INFO source=download.go:175 msg="downloading 8169bd33ad13 in 16 314 MB part(s)" Sep 04 12:53:53 FORGE ollama[841963]: time=2024-09-04T12:53:53.935-04:00 level=INFO source=download.go:175 msg="downloading a23e6bd35e94 in 1 693 B part(s)" Sep 04 12:53:55 FORGE ollama[841963]: time=2024-09-04T12:53:55.833-04:00 level=INFO source=download.go:175 msg="downloading 3dc12ee097e8 in 1 135 B part(s)" Sep 04 12:53:57 FORGE ollama[841963]: time=2024-09-04T12:53:57.755-04:00 level=INFO source=download.go:175 msg="downloading a60ed831ae4c in 1 485 B part(s)" Sep 04 12:54:01 FORGE ollama[841963]: [GIN] 2024/09/04 - 12:54:01 | 200 | 1m15s | 127.0.0.1 | POST "/api/pull" Sep 04 12:54:01 FORGE ollama[841963]: [GIN] 2024/09/04 - 12:54:01 | 200 | 25.060116ms | 127.0.0.1 | POST "/api/show" Sep 04 12:54:01 FORGE ollama[841963]: time=2024-09-04T12:54:01.308-04:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 parallel=4 available=23707648000 required="6.4 GiB" Sep 04 12:54:01 FORGE ollama[841963]: time=2024-09-04T12:54:01.308-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[22.1 GiB]" memory.required.full="6.4 GiB" memory.required.partial="6.4 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[6.4 GiB]" memory.weights.total="5.1 GiB" memory.weights.repeating="4.9 GiB" memory.weights.nonrepeating="205.1 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="569.0 MiB" Sep 04 12:54:01 FORGE ollama[841963]: time=2024-09-04T12:54:01.309-04:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama2682843892/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 49 --parallel 4 --port 41089" Sep 04 12:54:01 FORGE ollama[841963]: time=2024-09-04T12:54:01.310-04:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 Sep 04 12:54:01 FORGE ollama[841963]: time=2024-09-04T12:54:01.310-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" Sep 04 12:54:01 FORGE ollama[841963]: time=2024-09-04T12:54:01.310-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" Sep 04 12:54:01 FORGE ollama[1441911]: INFO [main] build info | build=1 commit="1e6f655" tid="126918209675264" timestamp=1725468841 Sep 04 12:54:01 FORGE ollama[1441911]: INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="126918209675264" timestamp=1725468841 total_threads=32 Sep 04 12:54:01 FORGE ollama[1441911]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="41089" tid="126918209675264" timestamp=1725468841 Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: loaded meta data with 31 key-value pairs and 435 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 (version GGUF V3 (latest)) Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv 0: general.architecture str = llama Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv 1: general.type str = model Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv 2: general.name str = Yi Coder 9B Chat Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv 3: general.finetune str = Chat Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv 4: general.basename str = Yi-Coder Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv 5: general.size_label str = 9B Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv 6: general.license str = apache-2.0 Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv 7: llama.block_count u32 = 48 Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv 8: llama.context_length u32 = 131072 Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv 9: llama.embedding_length u32 = 4096 Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv 10: llama.feed_forward_length u32 = 11008 Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv 11: llama.attention.head_count u32 = 32 Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 4 Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv 13: llama.rope.freq_base f32 = 10000000.000000 Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv 15: general.file_type u32 = 2 Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv 16: llama.vocab_size u32 = 64000 Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv 17: llama.rope.dimension_count u32 = 128 Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv 18: tokenizer.ggml.add_space_prefix bool = false Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv 19: tokenizer.ggml.model str = llama Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv 20: tokenizer.ggml.pre str = default Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,64000] = ["<unk>", "<|startoftext|>", "<|endof... Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv 22: tokenizer.ggml.scores arr[f32,64000] = [-1000.000000, -1000.000000, -1000.00... Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,64000] = [3, 3, 3, 3, 3, 3, 1, 3, 1, 3, 3, 3, ... Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv 24: tokenizer.ggml.bos_token_id u32 = 1 Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 2 Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 0 Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = false Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv 28: tokenizer.ggml.add_eos_token bool = false Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv 29: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - kv 30: general.quantization_version u32 = 2 Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - type f32: 97 tensors Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - type q4_0: 337 tensors Sep 04 12:54:01 FORGE ollama[841963]: llama_model_loader: - type q6_K: 1 tensors Sep 04 12:54:01 FORGE ollama[841963]: llm_load_vocab: special tokens cache size = 12 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_vocab: token to piece cache size = 0.3834 MB Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: format = GGUF V3 (latest) Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: arch = llama Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: vocab type = SPM Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_vocab = 64000 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_merges = 0 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: vocab_only = 0 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_ctx_train = 131072 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_embd = 4096 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_layer = 48 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_head = 32 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_head_kv = 4 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_rot = 128 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_swa = 0 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_embd_head_k = 128 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_embd_head_v = 128 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_gqa = 8 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_embd_k_gqa = 512 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_embd_v_gqa = 512 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: f_norm_eps = 0.0e+00 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: f_logit_scale = 0.0e+00 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_ff = 11008 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_expert = 0 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_expert_used = 0 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: causal attn = 1 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: pooling type = 0 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: rope type = 0 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: rope scaling = linear Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: freq_base_train = 10000000.0 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: freq_scale_train = 1 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: n_ctx_orig_yarn = 131072 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: rope_finetuned = unknown Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: ssm_d_conv = 0 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: ssm_d_inner = 0 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: ssm_d_state = 0 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: ssm_dt_rank = 0 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: model type = 34B Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: model ftype = Q4_0 Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: model params = 8.83 B Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: model size = 4.69 GiB (4.56 BPW) Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: general.name = Yi Coder 9B Chat Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: BOS token = 1 '<|startoftext|>' Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: EOS token = 2 '<|endoftext|>' Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: UNK token = 0 '<unk>' Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: PAD token = 0 '<unk>' Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: LF token = 315 '<0x0A>' Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: EOT token = 2 '<|endoftext|>' Sep 04 12:54:01 FORGE ollama[841963]: llm_load_print_meta: max token length = 48 Sep 04 12:54:01 FORGE ollama[841963]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Sep 04 12:54:01 FORGE ollama[841963]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Sep 04 12:54:01 FORGE ollama[841963]: ggml_cuda_init: found 1 CUDA devices: Sep 04 12:54:01 FORGE ollama[841963]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Sep 04 12:54:01 FORGE ollama[841963]: llm_load_tensors: ggml ctx size = 0.41 MiB Sep 04 12:54:01 FORGE ollama[841963]: llm_load_tensors: offloading 48 repeating layers to GPU Sep 04 12:54:01 FORGE ollama[841963]: llm_load_tensors: offloading non-repeating layers to GPU Sep 04 12:54:01 FORGE ollama[841963]: llm_load_tensors: offloaded 49/49 layers to GPU Sep 04 12:54:01 FORGE ollama[841963]: llm_load_tensors: CPU buffer size = 140.62 MiB Sep 04 12:54:01 FORGE ollama[841963]: llm_load_tensors: CUDA0 buffer size = 4661.61 MiB Sep 04 12:54:01 FORGE ollama[841963]: time=2024-09-04T12:54:01.561-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" Sep 04 12:54:02 FORGE ollama[841963]: llama_new_context_with_model: n_ctx = 8192 Sep 04 12:54:02 FORGE ollama[841963]: llama_new_context_with_model: n_batch = 512 Sep 04 12:54:02 FORGE ollama[841963]: llama_new_context_with_model: n_ubatch = 512 Sep 04 12:54:02 FORGE ollama[841963]: llama_new_context_with_model: flash_attn = 0 Sep 04 12:54:02 FORGE ollama[841963]: llama_new_context_with_model: freq_base = 10000000.0 Sep 04 12:54:02 FORGE ollama[841963]: llama_new_context_with_model: freq_scale = 1 Sep 04 12:54:02 FORGE ollama[841963]: llama_kv_cache_init: CUDA0 KV buffer size = 768.00 MiB Sep 04 12:54:02 FORGE ollama[841963]: llama_new_context_with_model: KV self size = 768.00 MiB, K (f16): 384.00 MiB, V (f16): 384.00 MiB Sep 04 12:54:02 FORGE ollama[841963]: llama_new_context_with_model: CUDA_Host output buffer size = 1.04 MiB Sep 04 12:54:02 FORGE ollama[841963]: llama_new_context_with_model: CUDA0 compute buffer size = 560.00 MiB Sep 04 12:54:02 FORGE ollama[841963]: llama_new_context_with_model: CUDA_Host compute buffer size = 24.01 MiB Sep 04 12:54:02 FORGE ollama[841963]: llama_new_context_with_model: graph nodes = 1542 Sep 04 12:54:02 FORGE ollama[841963]: llama_new_context_with_model: graph splits = 2 Sep 04 12:54:02 FORGE ollama[1441911]: INFO [main] model loaded | tid="126918209675264" timestamp=1725468842 Sep 04 12:54:02 FORGE ollama[841963]: time=2024-09-04T12:54:02.314-04:00 level=INFO source=server.go:630 msg="llama runner started in 1.00 seconds" Sep 04 12:54:02 FORGE ollama[841963]: [GIN] 2024/09/04 - 12:54:02 | 200 | 1.09955161s | 127.0.0.1 | POST "/api/chat" Sep 04 12:56:25 FORGE ollama[841963]: [GIN] 2024/09/04 - 12:56:25 | 200 | 2.112406341s | 127.0.0.1 | POST "/api/chat" Sep 04 12:58:05 FORGE ollama[841963]: [GIN] 2024/09/04 - 12:58:05 | 200 | 22.902µs | 127.0.0.1 | HEAD "/" Sep 04 12:58:05 FORGE ollama[841963]: [GIN] 2024/09/04 - 12:58:05 | 200 | 10.305179ms | 127.0.0.1 | GET "/api/tags" Sep 04 12:59:29 FORGE ollama[841963]: [GIN] 2024/09/04 - 12:59:29 | 200 | 2.165997807s | 127.0.0.1 | POST "/api/chat" Sep 04 13:00:44 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:00:44 | 200 | 31.832µs | 127.0.0.1 | HEAD "/" Sep 04 13:00:44 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:00:44 | 200 | 16.798501ms | 127.0.0.1 | GET "/api/tags" Sep 04 13:01:33 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:01:33 | 200 | 48.597µs | 127.0.0.1 | HEAD "/" Sep 04 13:01:33 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:01:33 | 200 | 14.299022ms | 127.0.0.1 | GET "/api/tags" Sep 04 13:02:11 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:02:11 | 200 | 23.888µs | 127.0.0.1 | HEAD "/" Sep 04 13:02:11 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:02:11 | 200 | 124.734063ms | 127.0.0.1 | DELETE "/api/delete" Sep 04 13:02:18 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:02:18 | 200 | 56.422µs | 127.0.0.1 | HEAD "/" Sep 04 13:02:18 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:02:18 | 200 | 111.964713ms | 127.0.0.1 | DELETE "/api/delete" Sep 04 13:02:31 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:02:31 | 200 | 68.827µs | 127.0.0.1 | HEAD "/" Sep 04 13:02:32 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:02:32 | 200 | 241.180652ms | 127.0.0.1 | DELETE "/api/delete" Sep 04 13:02:34 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:02:34 | 200 | 16.769µs | 127.0.0.1 | HEAD "/" Sep 04 13:02:34 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:02:34 | 200 | 8.65959ms | 127.0.0.1 | GET "/api/tags" Sep 04 13:02:54 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:02:54 | 200 | 20.896µs | 127.0.0.1 | HEAD "/" Sep 04 13:02:54 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:02:54 | 400 | 41.243µs | 127.0.0.1 | DELETE "/api/delete" Sep 04 13:03:09 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:03:09 | 200 | 40.24µs | 127.0.0.1 | HEAD "/" Sep 04 13:03:09 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:03:09 | 500 | 133.843µs | 127.0.0.1 | DELETE "/api/delete" Sep 04 13:03:18 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:03:18 | 200 | 35.656µs | 127.0.0.1 | HEAD "/" Sep 04 13:03:18 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:03:18 | 200 | 81.959669ms | 127.0.0.1 | DELETE "/api/delete" Sep 04 13:04:29 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:04:29.102-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:05:07 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:05:07 | 200 | 27.692µs | 127.0.0.1 | HEAD "/" Sep 04 13:05:07 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:05:07 | 200 | 7.418544ms | 127.0.0.1 | POST "/api/show" Sep 04 13:05:07 FORGE ollama[841963]: time=2024-09-04T13:05:07.994-04:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 parallel=4 available=23613407232 required="6.4 GiB" Sep 04 13:05:07 FORGE ollama[841963]: time=2024-09-04T13:05:07.995-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[22.0 GiB]" memory.required.full="6.4 GiB" memory.required.partial="6.4 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[6.4 GiB]" memory.weights.total="5.1 GiB" memory.weights.repeating="4.9 GiB" memory.weights.nonrepeating="205.1 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="569.0 MiB" Sep 04 13:05:07 FORGE ollama[841963]: time=2024-09-04T13:05:07.997-04:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama2682843892/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 49 --parallel 4 --port 44427" Sep 04 13:05:07 FORGE ollama[841963]: time=2024-09-04T13:05:07.997-04:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 Sep 04 13:05:07 FORGE ollama[841963]: time=2024-09-04T13:05:07.997-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" Sep 04 13:05:07 FORGE ollama[841963]: time=2024-09-04T13:05:07.997-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" Sep 04 13:05:08 FORGE ollama[1453465]: INFO [main] build info | build=1 commit="1e6f655" tid="132648159997952" timestamp=1725469508 Sep 04 13:05:08 FORGE ollama[1453465]: INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="132648159997952" timestamp=1725469508 total_threads=32 Sep 04 13:05:08 FORGE ollama[1453465]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="44427" tid="132648159997952" timestamp=1725469508 Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: loaded meta data with 31 key-value pairs and 435 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 (version GGUF V3 (latest)) Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv 0: general.architecture str = llama Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv 1: general.type str = model Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv 2: general.name str = Yi Coder 9B Chat Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv 3: general.finetune str = Chat Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv 4: general.basename str = Yi-Coder Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv 5: general.size_label str = 9B Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv 6: general.license str = apache-2.0 Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv 7: llama.block_count u32 = 48 Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv 8: llama.context_length u32 = 131072 Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv 9: llama.embedding_length u32 = 4096 Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv 10: llama.feed_forward_length u32 = 11008 Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv 11: llama.attention.head_count u32 = 32 Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 4 Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv 13: llama.rope.freq_base f32 = 10000000.000000 Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv 15: general.file_type u32 = 2 Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv 16: llama.vocab_size u32 = 64000 Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv 17: llama.rope.dimension_count u32 = 128 Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv 18: tokenizer.ggml.add_space_prefix bool = false Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv 19: tokenizer.ggml.model str = llama Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv 20: tokenizer.ggml.pre str = default Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,64000] = ["<unk>", "<|startoftext|>", "<|endof... Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv 22: tokenizer.ggml.scores arr[f32,64000] = [-1000.000000, -1000.000000, -1000.00... Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,64000] = [3, 3, 3, 3, 3, 3, 1, 3, 1, 3, 3, 3, ... Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv 24: tokenizer.ggml.bos_token_id u32 = 1 Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 2 Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 0 Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = false Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv 28: tokenizer.ggml.add_eos_token bool = false Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv 29: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - kv 30: general.quantization_version u32 = 2 Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - type f32: 97 tensors Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - type q4_0: 337 tensors Sep 04 13:05:08 FORGE ollama[841963]: llama_model_loader: - type q6_K: 1 tensors Sep 04 13:05:08 FORGE ollama[841963]: llm_load_vocab: special tokens cache size = 12 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_vocab: token to piece cache size = 0.3834 MB Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: format = GGUF V3 (latest) Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: arch = llama Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: vocab type = SPM Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_vocab = 64000 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_merges = 0 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: vocab_only = 0 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_ctx_train = 131072 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_embd = 4096 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_layer = 48 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_head = 32 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_head_kv = 4 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_rot = 128 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_swa = 0 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_embd_head_k = 128 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_embd_head_v = 128 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_gqa = 8 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_embd_k_gqa = 512 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_embd_v_gqa = 512 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: f_norm_eps = 0.0e+00 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: f_logit_scale = 0.0e+00 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_ff = 11008 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_expert = 0 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_expert_used = 0 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: causal attn = 1 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: pooling type = 0 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: rope type = 0 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: rope scaling = linear Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: freq_base_train = 10000000.0 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: freq_scale_train = 1 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: n_ctx_orig_yarn = 131072 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: rope_finetuned = unknown Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: ssm_d_conv = 0 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: ssm_d_inner = 0 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: ssm_d_state = 0 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: ssm_dt_rank = 0 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: model type = 34B Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: model ftype = Q4_0 Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: model params = 8.83 B Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: model size = 4.69 GiB (4.56 BPW) Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: general.name = Yi Coder 9B Chat Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: BOS token = 1 '<|startoftext|>' Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: EOS token = 2 '<|endoftext|>' Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: UNK token = 0 '<unk>' Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: PAD token = 0 '<unk>' Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: LF token = 315 '<0x0A>' Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: EOT token = 2 '<|endoftext|>' Sep 04 13:05:08 FORGE ollama[841963]: llm_load_print_meta: max token length = 48 Sep 04 13:05:08 FORGE ollama[841963]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Sep 04 13:05:08 FORGE ollama[841963]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Sep 04 13:05:08 FORGE ollama[841963]: ggml_cuda_init: found 1 CUDA devices: Sep 04 13:05:08 FORGE ollama[841963]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Sep 04 13:05:08 FORGE ollama[841963]: llm_load_tensors: ggml ctx size = 0.41 MiB Sep 04 13:05:08 FORGE ollama[841963]: llm_load_tensors: offloading 48 repeating layers to GPU Sep 04 13:05:08 FORGE ollama[841963]: llm_load_tensors: offloading non-repeating layers to GPU Sep 04 13:05:08 FORGE ollama[841963]: llm_load_tensors: offloaded 49/49 layers to GPU Sep 04 13:05:08 FORGE ollama[841963]: llm_load_tensors: CPU buffer size = 140.62 MiB Sep 04 13:05:08 FORGE ollama[841963]: llm_load_tensors: CUDA0 buffer size = 4661.61 MiB Sep 04 13:05:08 FORGE ollama[841963]: time=2024-09-04T13:05:08.249-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" Sep 04 13:05:08 FORGE ollama[841963]: llama_new_context_with_model: n_ctx = 8192 Sep 04 13:05:08 FORGE ollama[841963]: llama_new_context_with_model: n_batch = 512 Sep 04 13:05:08 FORGE ollama[841963]: llama_new_context_with_model: n_ubatch = 512 Sep 04 13:05:08 FORGE ollama[841963]: llama_new_context_with_model: flash_attn = 0 Sep 04 13:05:08 FORGE ollama[841963]: llama_new_context_with_model: freq_base = 10000000.0 Sep 04 13:05:08 FORGE ollama[841963]: llama_new_context_with_model: freq_scale = 1 Sep 04 13:05:08 FORGE ollama[841963]: llama_kv_cache_init: CUDA0 KV buffer size = 768.00 MiB Sep 04 13:05:08 FORGE ollama[841963]: llama_new_context_with_model: KV self size = 768.00 MiB, K (f16): 384.00 MiB, V (f16): 384.00 MiB Sep 04 13:05:08 FORGE ollama[841963]: llama_new_context_with_model: CUDA_Host output buffer size = 1.04 MiB Sep 04 13:05:08 FORGE ollama[841963]: llama_new_context_with_model: CUDA0 compute buffer size = 560.00 MiB Sep 04 13:05:08 FORGE ollama[841963]: llama_new_context_with_model: CUDA_Host compute buffer size = 24.01 MiB Sep 04 13:05:08 FORGE ollama[841963]: llama_new_context_with_model: graph nodes = 1542 Sep 04 13:05:08 FORGE ollama[841963]: llama_new_context_with_model: graph splits = 2 Sep 04 13:05:08 FORGE ollama[1453465]: INFO [main] model loaded | tid="132648159997952" timestamp=1725469508 Sep 04 13:05:09 FORGE ollama[841963]: time=2024-09-04T13:05:09.001-04:00 level=INFO source=server.go:630 msg="llama runner started in 1.00 seconds" Sep 04 13:05:09 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:05:09 | 200 | 1.105447205s | 127.0.0.1 | POST "/api/chat" Sep 04 13:05:18 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:05:18 | 200 | 20.929748ms | 127.0.0.1 | POST "/api/show" Sep 04 13:05:23 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:05:23 | 200 | 8.70556ms | 127.0.0.1 | POST "/api/show" Sep 04 13:05:30 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:05:30 | 200 | 15.351273ms | 127.0.0.1 | POST "/api/show" Sep 04 13:06:27 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:06:27 | 200 | 291.413881ms | 127.0.0.1 | POST "/api/generate" Sep 04 13:06:28 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:06:28 | 200 | 1.014343573s | 127.0.0.1 | POST "/api/generate" Sep 04 13:06:53 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:06:53 | 200 | 1.369648167s | 127.0.0.1 | POST "/api/generate" Sep 04 13:06:54 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:06:54 | 200 | 1.052921831s | 127.0.0.1 | POST "/api/generate" Sep 04 13:09:48 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:09:48 | 200 | 2.274160768s | 127.0.0.1 | POST "/api/generate" Sep 04 13:09:50 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:09:50 | 200 | 1.399936035s | 127.0.0.1 | POST "/api/generate" Sep 04 13:09:51 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:09:51 | 200 | 1.407673273s | 127.0.0.1 | POST "/api/generate" Sep 04 13:09:53 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:09:53 | 200 | 1.444955083s | 127.0.0.1 | POST "/api/generate" Sep 04 13:10:47 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:47.793-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:10:47 FORGE ollama[841963]: time=2024-09-04T13:10:47.817-04:00 level=INFO source=sched.go:508 msg="updated VRAM based on existing loaded models" gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 library=cuda total="23.6 GiB" available="17.2 GiB" Sep 04 13:10:47 FORGE ollama[841963]: time=2024-09-04T13:10:47.817-04:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 parallel=4 available=18496550912 required="6.2 GiB" Sep 04 13:10:47 FORGE ollama[841963]: time=2024-09-04T13:10:47.817-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[17.2 GiB]" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB" Sep 04 13:10:47 FORGE ollama[841963]: time=2024-09-04T13:10:47.818-04:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama2682843892/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 32919" Sep 04 13:10:47 FORGE ollama[841963]: time=2024-09-04T13:10:47.818-04:00 level=INFO source=sched.go:450 msg="loaded runners" count=2 Sep 04 13:10:47 FORGE ollama[841963]: time=2024-09-04T13:10:47.818-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" Sep 04 13:10:47 FORGE ollama[841963]: time=2024-09-04T13:10:47.818-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" Sep 04 13:10:47 FORGE ollama[1460566]: INFO [main] build info | build=1 commit="1e6f655" tid="133121226891264" timestamp=1725469847 Sep 04 13:10:47 FORGE ollama[1460566]: INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="133121226891264" timestamp=1725469847 total_threads=32 Sep 04 13:10:47 FORGE ollama[1460566]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="32919" tid="133121226891264" timestamp=1725469847 Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe (version GGUF V3 (latest)) Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv 0: general.architecture str = llama Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv 1: general.type str = model Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv 3: general.finetune str = Instruct Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv 5: general.size_label str = 8B Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv 6: general.license str = llama3.1 Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv 9: llama.block_count u32 = 32 Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv 10: llama.context_length u32 = 131072 Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv 11: llama.embedding_length u32 = 4096 Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336 Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv 13: llama.attention.head_count u32 = 32 Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv 17: general.file_type u32 = 2 Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - kv 28: general.quantization_version u32 = 2 Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - type f32: 66 tensors Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - type q4_0: 225 tensors Sep 04 13:10:47 FORGE ollama[841963]: llama_model_loader: - type q6_K: 1 tensors Sep 04 13:10:48 FORGE ollama[841963]: time=2024-09-04T13:10:48.069-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" Sep 04 13:10:48 FORGE ollama[841963]: llm_load_vocab: special tokens cache size = 256 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_vocab: token to piece cache size = 0.7999 MB Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: format = GGUF V3 (latest) Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: arch = llama Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: vocab type = BPE Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_vocab = 128256 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_merges = 280147 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: vocab_only = 0 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_ctx_train = 131072 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_embd = 4096 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_layer = 32 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_head = 32 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_head_kv = 8 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_rot = 128 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_swa = 0 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_embd_head_k = 128 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_embd_head_v = 128 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_gqa = 4 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_embd_k_gqa = 1024 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_embd_v_gqa = 1024 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: f_norm_eps = 0.0e+00 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: f_logit_scale = 0.0e+00 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_ff = 14336 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_expert = 0 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_expert_used = 0 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: causal attn = 1 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: pooling type = 0 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: rope type = 0 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: rope scaling = linear Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: freq_base_train = 500000.0 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: freq_scale_train = 1 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: n_ctx_orig_yarn = 131072 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: rope_finetuned = unknown Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: ssm_d_conv = 0 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: ssm_d_inner = 0 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: ssm_d_state = 0 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: ssm_dt_rank = 0 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: model type = 8B Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: model ftype = Q4_0 Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: model params = 8.03 B Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>' Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: LF token = 128 'Ä' Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>' Sep 04 13:10:48 FORGE ollama[841963]: llm_load_print_meta: max token length = 256 Sep 04 13:10:48 FORGE ollama[841963]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Sep 04 13:10:48 FORGE ollama[841963]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Sep 04 13:10:48 FORGE ollama[841963]: ggml_cuda_init: found 1 CUDA devices: Sep 04 13:10:48 FORGE ollama[841963]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Sep 04 13:10:48 FORGE ollama[841963]: CUDA error: CUDA-capable device(s) is/are busy or unavailable Sep 04 13:10:48 FORGE ollama[841963]: current device: 0, in function ggml_backend_cuda_get_device_memory at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:3040 Sep 04 13:10:48 FORGE ollama[841963]: cudaMemGetInfo(free, total) Sep 04 13:10:48 FORGE ollama[841963]: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:101: CUDA error Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460567] Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460568] Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460569] Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460570] Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460571] Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460572] Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460573] Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460574] Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460575] Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460576] Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460577] Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460578] Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460579] Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460580] Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460581] Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460582] Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460583] Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460584] Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460585] Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460586] Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460587] Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460588] Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460589] Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460590] Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460591] Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460592] Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460593] Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460594] Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460595] Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460596] Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460597] Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460598] Sep 04 13:10:48 FORGE ollama[1460601]: [New LWP 1460599] Sep 04 13:10:49 FORGE ollama[841963]: time=2024-09-04T13:10:49.023-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" Sep 04 13:10:49 FORGE ollama[1460601]: [Thread debugging using libthread_db enabled] Sep 04 13:10:49 FORGE ollama[1460601]: Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Sep 04 13:10:49 FORGE ollama[1460601]: 0x000079124f0ea42f in __GI___wait4 (pid=1460601, stat_loc=0x7ffcb70a19b4, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30 Sep 04 13:10:49 FORGE ollama[841963]: 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory. Sep 04 13:10:49 FORGE ollama[1460601]: #0 0x000079124f0ea42f in __GI___wait4 (pid=1460601, stat_loc=0x7ffcb70a19b4, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30 Sep 04 13:10:49 FORGE ollama[1460601]: 30 in ../sysdeps/unix/sysv/linux/wait4.c Sep 04 13:10:49 FORGE ollama[1460601]: #1 0x000079124f83ca88 in ggml_abort () from /tmp/ollama2682843892/runners/cuda_v12/libggml.so Sep 04 13:10:49 FORGE ollama[1460601]: #2 0x000079124f909d72 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /tmp/ollama2682843892/runners/cuda_v12/libggml.so Sep 04 13:10:49 FORGE ollama[1460601]: #3 0x000079124f916c52 in ggml_backend_cuda_get_device_memory () from /tmp/ollama2682843892/runners/cuda_v12/libggml.so Sep 04 13:10:49 FORGE ollama[1460601]: #4 0x00007912b38c3469 in llm_load_tensors(llama_model_loader&, llama_model&, int, llama_split_mode, int, float const*, bool, bool (*)(float, void*), void*) () from /tmp/ollama2682843892/runners/cuda_v12/libllama.so Sep 04 13:10:49 FORGE ollama[1460601]: #5 0x00007912b3904fe2 in llama_load_model_from_file () from /tmp/ollama2682843892/runners/cuda_v12/libllama.so Sep 04 13:10:49 FORGE ollama[1460601]: #6 0x00000000004e452b in llama_init_from_gpt_params(gpt_params&) () Sep 04 13:10:49 FORGE ollama[1460601]: #7 0x0000000000473710 in llama_server_context::load_model(gpt_params const&) () Sep 04 13:10:49 FORGE ollama[1460601]: #8 0x0000000000423058 in main () Sep 04 13:10:49 FORGE ollama[1460601]: [Inferior 1 (process 1460566) detached] Sep 04 13:10:51 FORGE ollama[841963]: time=2024-09-04T13:10:51.079-04:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: CUDA error" Sep 04 13:10:51 FORGE ollama[841963]: [GIN] 2024/09/04 - 13:10:51 | 500 | 3.297901778s | 127.0.0.1 | POST "/api/generate" Sep 04 13:10:51 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:51.079-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:10:51 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:51.330-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:10:51 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:51.581-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:10:51 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:51.830-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:10:52 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:52.080-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:10:52 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:52.331-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:10:52 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:52.581-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:10:52 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:52.831-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:10:53 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:53.080-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:10:53 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:53.331-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:10:53 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:53.581-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:10:53 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:53.830-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:10:54 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:54.080-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:10:54 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:54.331-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:10:54 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:54.580-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:10:54 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:54.830-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:10:55 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:55.080-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:10:55 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:55.331-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:10:55 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:55.581-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:10:55 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:55.831-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:10:56 FORGE ollama[841963]: time=2024-09-04T13:10:56.080-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.000983837 model=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe Sep 04 13:10:56 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:56.080-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:10:56 FORGE ollama[841963]: time=2024-09-04T13:10:56.329-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2507956 model=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe Sep 04 13:10:56 FORGE ollama[841963]: cuda driver library failed to get device context 46time=2024-09-04T13:10:56.330-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:10:56 FORGE ollama[841963]: time=2024-09-04T13:10:56.580-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.501179435 model=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe Sep 04 13:11:11 FORGE systemd[1]: Stopping Ollama Service... Sep 04 13:11:11 FORGE systemd[1]: ollama.service: Deactivated successfully. Sep 04 13:11:11 FORGE systemd[1]: Stopped Ollama Service. Sep 04 13:11:11 FORGE systemd[1]: ollama.service: Consumed 2min 50.718s CPU time. Sep 04 13:11:11 FORGE systemd[1]: Started Ollama Service. Sep 04 13:11:11 FORGE ollama[1461072]: 2024/09/04 13:11:11 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:4 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:4 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" Sep 04 13:11:11 FORGE ollama[1461072]: time=2024-09-04T13:11:11.592-04:00 level=INFO source=images.go:753 msg="total blobs: 268" Sep 04 13:11:11 FORGE ollama[1461072]: time=2024-09-04T13:11:11.596-04:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0" Sep 04 13:11:11 FORGE ollama[1461072]: time=2024-09-04T13:11:11.598-04:00 level=INFO source=routes.go:1172 msg="Listening on [::]:11434 (version 0.3.9)" Sep 04 13:11:11 FORGE ollama[1461072]: time=2024-09-04T13:11:11.599-04:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama4167472154/runners Sep 04 13:11:16 FORGE ollama[1461072]: time=2024-09-04T13:11:16.366-04:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm_v60102]" Sep 04 13:11:16 FORGE ollama[1461072]: time=2024-09-04T13:11:16.366-04:00 level=INFO source=gpu.go:200 msg="looking for compatible GPUs" Sep 04 13:11:16 FORGE ollama[1461072]: time=2024-09-04T13:11:16.447-04:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 library=cuda variant=v12 compute=8.9 driver=12.2 name="NVIDIA GeForce RTX 4090" total="23.6 GiB" available="21.9 GiB" Sep 04 13:11:56 FORGE ollama[1461072]: time=2024-09-04T13:11:56.100-04:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 parallel=4 available=23934861312 required="6.2 GiB" Sep 04 13:11:56 FORGE ollama[1461072]: time=2024-09-04T13:11:56.100-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[22.3 GiB]" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB" Sep 04 13:11:56 FORGE ollama[1461072]: time=2024-09-04T13:11:56.101-04:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama4167472154/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 46523" Sep 04 13:11:56 FORGE ollama[1461072]: time=2024-09-04T13:11:56.101-04:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 Sep 04 13:11:56 FORGE ollama[1461072]: time=2024-09-04T13:11:56.101-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" Sep 04 13:11:56 FORGE ollama[1461072]: time=2024-09-04T13:11:56.101-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" Sep 04 13:11:56 FORGE ollama[1461917]: INFO [main] build info | build=1 commit="1e6f655" tid="124675729735680" timestamp=1725469916 Sep 04 13:11:56 FORGE ollama[1461917]: INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="124675729735680" timestamp=1725469916 total_threads=32 Sep 04 13:11:56 FORGE ollama[1461917]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="46523" tid="124675729735680" timestamp=1725469916 Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe (version GGUF V3 (latest)) Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv 0: general.architecture str = llama Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv 1: general.type str = model Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv 3: general.finetune str = Instruct Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv 5: general.size_label str = 8B Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv 6: general.license str = llama3.1 Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv 9: llama.block_count u32 = 32 Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv 10: llama.context_length u32 = 131072 Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv 11: llama.embedding_length u32 = 4096 Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336 Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv 13: llama.attention.head_count u32 = 32 Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv 17: general.file_type u32 = 2 Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - kv 28: general.quantization_version u32 = 2 Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - type f32: 66 tensors Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - type q4_0: 225 tensors Sep 04 13:11:56 FORGE ollama[1461072]: llama_model_loader: - type q6_K: 1 tensors Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_vocab: special tokens cache size = 256 Sep 04 13:11:56 FORGE ollama[1461072]: time=2024-09-04T13:11:56.352-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_vocab: token to piece cache size = 0.7999 MB Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: format = GGUF V3 (latest) Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: arch = llama Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: vocab type = BPE Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_vocab = 128256 Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_merges = 280147 Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: vocab_only = 0 Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_ctx_train = 131072 Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_embd = 4096 Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_layer = 32 Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_head = 32 Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_head_kv = 8 Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_rot = 128 Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_swa = 0 Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_embd_head_k = 128 Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_embd_head_v = 128 Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_gqa = 4 Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_embd_k_gqa = 1024 Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_embd_v_gqa = 1024 Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: f_norm_eps = 0.0e+00 Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: f_logit_scale = 0.0e+00 Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_ff = 14336 Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_expert = 0 Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_expert_used = 0 Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: causal attn = 1 Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: pooling type = 0 Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: rope type = 0 Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: rope scaling = linear Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: freq_base_train = 500000.0 Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: freq_scale_train = 1 Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: n_ctx_orig_yarn = 131072 Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: rope_finetuned = unknown Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_conv = 0 Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_inner = 0 Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_state = 0 Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: ssm_dt_rank = 0 Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: model type = 8B Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: model ftype = Q4_0 Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: model params = 8.03 B Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>' Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: LF token = 128 'Ä' Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>' Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_print_meta: max token length = 256 Sep 04 13:11:56 FORGE ollama[1461072]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Sep 04 13:11:56 FORGE ollama[1461072]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Sep 04 13:11:56 FORGE ollama[1461072]: ggml_cuda_init: found 1 CUDA devices: Sep 04 13:11:56 FORGE ollama[1461072]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_tensors: ggml ctx size = 0.27 MiB Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_tensors: offloading 32 repeating layers to GPU Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_tensors: offloading non-repeating layers to GPU Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_tensors: offloaded 33/33 layers to GPU Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_tensors: CPU buffer size = 281.81 MiB Sep 04 13:11:56 FORGE ollama[1461072]: llm_load_tensors: CUDA0 buffer size = 4156.00 MiB Sep 04 13:11:56 FORGE ollama[1461072]: llama_new_context_with_model: n_ctx = 8192 Sep 04 13:11:56 FORGE ollama[1461072]: llama_new_context_with_model: n_batch = 512 Sep 04 13:11:56 FORGE ollama[1461072]: llama_new_context_with_model: n_ubatch = 512 Sep 04 13:11:56 FORGE ollama[1461072]: llama_new_context_with_model: flash_attn = 0 Sep 04 13:11:56 FORGE ollama[1461072]: llama_new_context_with_model: freq_base = 500000.0 Sep 04 13:11:56 FORGE ollama[1461072]: llama_new_context_with_model: freq_scale = 1 Sep 04 13:11:56 FORGE ollama[1461072]: llama_kv_cache_init: CUDA0 KV buffer size = 1024.00 MiB Sep 04 13:11:56 FORGE ollama[1461072]: llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB Sep 04 13:11:56 FORGE ollama[1461072]: llama_new_context_with_model: CUDA_Host output buffer size = 2.02 MiB Sep 04 13:11:56 FORGE ollama[1461072]: llama_new_context_with_model: CUDA0 compute buffer size = 560.00 MiB Sep 04 13:11:56 FORGE ollama[1461072]: llama_new_context_with_model: CUDA_Host compute buffer size = 24.01 MiB Sep 04 13:11:56 FORGE ollama[1461072]: llama_new_context_with_model: graph nodes = 1030 Sep 04 13:11:56 FORGE ollama[1461072]: llama_new_context_with_model: graph splits = 2 Sep 04 13:11:57 FORGE ollama[1461917]: INFO [main] model loaded | tid="124675729735680" timestamp=1725469917 Sep 04 13:11:57 FORGE ollama[1461072]: time=2024-09-04T13:11:57.106-04:00 level=INFO source=server.go:630 msg="llama runner started in 1.00 seconds" Sep 04 13:11:58 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:11:58 | 200 | 2.038118055s | 127.0.0.1 | POST "/api/generate" Sep 04 13:12:18 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:12:18 | 200 | 1.538377657s | 127.0.0.1 | POST "/api/generate" Sep 04 13:13:41 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:13:41 | 200 | 2.979718661s | 127.0.0.1 | POST "/api/generate" Sep 04 13:14:23 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:14:23 | 200 | 2.088595512s | 127.0.0.1 | POST "/api/generate" Sep 04 13:15:19 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:15:19 | 200 | 1.206651689s | 127.0.0.1 | POST "/api/generate" Sep 04 13:15:57 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:15:57 | 200 | 812.749244ms | 127.0.0.1 | POST "/api/generate" Sep 04 13:16:20 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:16:20 | 200 | 686.756108ms | 127.0.0.1 | POST "/api/generate" Sep 04 13:16:34 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:16:34 | 200 | 1.551377078s | 127.0.0.1 | POST "/api/generate" Sep 04 13:17:01 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:17:01 | 200 | 914.295957ms | 127.0.0.1 | POST "/api/generate" Sep 04 13:17:34 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:17:34 | 200 | 766.851879ms | 127.0.0.1 | POST "/api/generate" Sep 04 13:17:44 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:17:44 | 200 | 287.497181ms | 127.0.0.1 | POST "/api/generate" Sep 04 13:18:01 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:01.919-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:01 FORGE ollama[1461072]: time=2024-09-04T13:18:01.934-04:00 level=INFO source=sched.go:508 msg="updated VRAM based on existing loaded models" gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 library=cuda total="23.6 GiB" available="17.4 GiB" Sep 04 13:18:01 FORGE ollama[1461072]: time=2024-09-04T13:18:01.934-04:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 parallel=4 available=18716137472 required="6.4 GiB" Sep 04 13:18:01 FORGE ollama[1461072]: time=2024-09-04T13:18:01.935-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[17.4 GiB]" memory.required.full="6.4 GiB" memory.required.partial="6.4 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[6.4 GiB]" memory.weights.total="5.1 GiB" memory.weights.repeating="4.9 GiB" memory.weights.nonrepeating="205.1 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="569.0 MiB" Sep 04 13:18:01 FORGE ollama[1461072]: time=2024-09-04T13:18:01.937-04:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama4167472154/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 49 --parallel 4 --port 41377" Sep 04 13:18:01 FORGE ollama[1461072]: time=2024-09-04T13:18:01.938-04:00 level=INFO source=sched.go:450 msg="loaded runners" count=2 Sep 04 13:18:01 FORGE ollama[1461072]: time=2024-09-04T13:18:01.938-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" Sep 04 13:18:01 FORGE ollama[1461072]: time=2024-09-04T13:18:01.938-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" Sep 04 13:18:01 FORGE ollama[1470788]: INFO [main] build info | build=1 commit="1e6f655" tid="124231231954944" timestamp=1725470281 Sep 04 13:18:01 FORGE ollama[1470788]: INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="124231231954944" timestamp=1725470281 total_threads=32 Sep 04 13:18:01 FORGE ollama[1470788]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="41377" tid="124231231954944" timestamp=1725470281 Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: loaded meta data with 31 key-value pairs and 435 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 (version GGUF V3 (latest)) Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv 0: general.architecture str = llama Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv 1: general.type str = model Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv 2: general.name str = Yi Coder 9B Chat Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv 3: general.finetune str = Chat Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv 4: general.basename str = Yi-Coder Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv 5: general.size_label str = 9B Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv 6: general.license str = apache-2.0 Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv 7: llama.block_count u32 = 48 Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv 8: llama.context_length u32 = 131072 Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv 9: llama.embedding_length u32 = 4096 Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv 10: llama.feed_forward_length u32 = 11008 Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv 11: llama.attention.head_count u32 = 32 Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 4 Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv 13: llama.rope.freq_base f32 = 10000000.000000 Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv 15: general.file_type u32 = 2 Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv 16: llama.vocab_size u32 = 64000 Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv 17: llama.rope.dimension_count u32 = 128 Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv 18: tokenizer.ggml.add_space_prefix bool = false Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv 19: tokenizer.ggml.model str = llama Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv 20: tokenizer.ggml.pre str = default Sep 04 13:18:01 FORGE ollama[1461072]: llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,64000] = ["<unk>", "<|startoftext|>", "<|endof... Sep 04 13:18:02 FORGE ollama[1461072]: llama_model_loader: - kv 22: tokenizer.ggml.scores arr[f32,64000] = [-1000.000000, -1000.000000, -1000.00... Sep 04 13:18:02 FORGE ollama[1461072]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,64000] = [3, 3, 3, 3, 3, 3, 1, 3, 1, 3, 3, 3, ... Sep 04 13:18:02 FORGE ollama[1461072]: llama_model_loader: - kv 24: tokenizer.ggml.bos_token_id u32 = 1 Sep 04 13:18:02 FORGE ollama[1461072]: llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 2 Sep 04 13:18:02 FORGE ollama[1461072]: llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 0 Sep 04 13:18:02 FORGE ollama[1461072]: llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = false Sep 04 13:18:02 FORGE ollama[1461072]: llama_model_loader: - kv 28: tokenizer.ggml.add_eos_token bool = false Sep 04 13:18:02 FORGE ollama[1461072]: llama_model_loader: - kv 29: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... Sep 04 13:18:02 FORGE ollama[1461072]: llama_model_loader: - kv 30: general.quantization_version u32 = 2 Sep 04 13:18:02 FORGE ollama[1461072]: llama_model_loader: - type f32: 97 tensors Sep 04 13:18:02 FORGE ollama[1461072]: llama_model_loader: - type q4_0: 337 tensors Sep 04 13:18:02 FORGE ollama[1461072]: llama_model_loader: - type q6_K: 1 tensors Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_vocab: special tokens cache size = 12 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_vocab: token to piece cache size = 0.3834 MB Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: format = GGUF V3 (latest) Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: arch = llama Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: vocab type = SPM Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_vocab = 64000 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_merges = 0 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: vocab_only = 0 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_ctx_train = 131072 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_embd = 4096 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_layer = 48 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_head = 32 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_head_kv = 4 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_rot = 128 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_swa = 0 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_embd_head_k = 128 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_embd_head_v = 128 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_gqa = 8 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_embd_k_gqa = 512 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_embd_v_gqa = 512 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: f_norm_eps = 0.0e+00 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: f_logit_scale = 0.0e+00 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_ff = 11008 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_expert = 0 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_expert_used = 0 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: causal attn = 1 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: pooling type = 0 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: rope type = 0 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: rope scaling = linear Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: freq_base_train = 10000000.0 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: freq_scale_train = 1 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: n_ctx_orig_yarn = 131072 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: rope_finetuned = unknown Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_conv = 0 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_inner = 0 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_state = 0 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: ssm_dt_rank = 0 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: model type = 34B Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: model ftype = Q4_0 Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: model params = 8.83 B Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: model size = 4.69 GiB (4.56 BPW) Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: general.name = Yi Coder 9B Chat Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: BOS token = 1 '<|startoftext|>' Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: EOS token = 2 '<|endoftext|>' Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: UNK token = 0 '<unk>' Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: PAD token = 0 '<unk>' Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: LF token = 315 '<0x0A>' Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: EOT token = 2 '<|endoftext|>' Sep 04 13:18:02 FORGE ollama[1461072]: llm_load_print_meta: max token length = 48 Sep 04 13:18:02 FORGE ollama[1461072]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Sep 04 13:18:02 FORGE ollama[1461072]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Sep 04 13:18:02 FORGE ollama[1461072]: ggml_cuda_init: found 1 CUDA devices: Sep 04 13:18:02 FORGE ollama[1461072]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Sep 04 13:18:02 FORGE ollama[1461072]: CUDA error: CUDA-capable device(s) is/are busy or unavailable Sep 04 13:18:02 FORGE ollama[1461072]: current device: 0, in function ggml_backend_cuda_get_device_memory at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:3040 Sep 04 13:18:02 FORGE ollama[1461072]: cudaMemGetInfo(free, total) Sep 04 13:18:02 FORGE ollama[1461072]: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:101: CUDA error Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470789] Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470790] Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470791] Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470792] Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470793] Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470794] Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470795] Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470796] Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470797] Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470798] Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470799] Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470800] Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470801] Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470802] Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470803] Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470804] Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470805] Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470806] Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470807] Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470808] Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470809] Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470810] Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470811] Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470812] Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470813] Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470814] Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470815] Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470816] Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470817] Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470818] Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470819] Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470820] Sep 04 13:18:02 FORGE ollama[1470823]: [New LWP 1470821] Sep 04 13:18:02 FORGE ollama[1470823]: [Thread debugging using libthread_db enabled] Sep 04 13:18:02 FORGE ollama[1470823]: Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Sep 04 13:18:02 FORGE ollama[1470823]: 0x000070fc720ea42f in __GI___wait4 (pid=1470823, stat_loc=0x7ffd33cf1d44, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30 Sep 04 13:18:02 FORGE ollama[1461072]: 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory. Sep 04 13:18:02 FORGE ollama[1470823]: #0 0x000070fc720ea42f in __GI___wait4 (pid=1470823, stat_loc=0x7ffd33cf1d44, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30 Sep 04 13:18:02 FORGE ollama[1470823]: 30 in ../sysdeps/unix/sysv/linux/wait4.c Sep 04 13:18:02 FORGE ollama[1470823]: #1 0x000070fc7283ca88 in ggml_abort () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so Sep 04 13:18:02 FORGE ollama[1470823]: #2 0x000070fc72909d72 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so Sep 04 13:18:02 FORGE ollama[1470823]: #3 0x000070fc72916c52 in ggml_backend_cuda_get_device_memory () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so Sep 04 13:18:02 FORGE ollama[1470823]: #4 0x000070fcd6804469 in llm_load_tensors(llama_model_loader&, llama_model&, int, llama_split_mode, int, float const*, bool, bool (*)(float, void*), void*) () from /tmp/ollama4167472154/runners/cuda_v12/libllama.so Sep 04 13:18:02 FORGE ollama[1470823]: #5 0x000070fcd6845fe2 in llama_load_model_from_file () from /tmp/ollama4167472154/runners/cuda_v12/libllama.so Sep 04 13:18:02 FORGE ollama[1470823]: #6 0x00000000004e452b in llama_init_from_gpt_params(gpt_params&) () Sep 04 13:18:02 FORGE ollama[1470823]: #7 0x0000000000473710 in llama_server_context::load_model(gpt_params const&) () Sep 04 13:18:02 FORGE ollama[1470823]: #8 0x0000000000423058 in main () Sep 04 13:18:02 FORGE ollama[1461072]: time=2024-09-04T13:18:02.246-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" Sep 04 13:18:02 FORGE ollama[1470823]: [Inferior 1 (process 1470788) detached] Sep 04 13:18:02 FORGE ollama[1461072]: time=2024-09-04T13:18:02.696-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" Sep 04 13:18:03 FORGE ollama[1461072]: time=2024-09-04T13:18:03.850-04:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: CUDA error: CUDA-capable device(s) is/are busy or unavailable\n current device: 0, in function ggml_backend_cuda_get_device_memory at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:3040\n cudaMemGetInfo(free, total)\n/go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:101: CUDA error" Sep 04 13:18:03 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:18:03 | 500 | 1.943146661s | 127.0.0.1 | POST "/api/chat" Sep 04 13:18:03 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:03.851-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:04 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:04.102-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:04 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:04.352-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:04 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:04.603-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:04 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:04.852-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:05 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:05.102-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:05 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:05.353-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:05 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:05.602-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:05 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:05.852-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:06 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:06.102-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:06 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:06.352-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:06 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:06.603-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:06 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:06.852-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:07 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:07.102-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:07 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:07.352-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:07 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:18:07 | 200 | 11.352µs | 127.0.0.1 | HEAD "/" Sep 04 13:18:07 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:18:07 | 200 | 8.246808ms | 127.0.0.1 | POST "/api/show" Sep 04 13:18:07 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:07.602-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:07 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:07.607-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:07 FORGE ollama[1461072]: time=2024-09-04T13:18:07.624-04:00 level=INFO source=sched.go:508 msg="updated VRAM based on existing loaded models" gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 library=cuda total="23.6 GiB" available="17.4 GiB" Sep 04 13:18:07 FORGE ollama[1461072]: time=2024-09-04T13:18:07.625-04:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 parallel=4 available=18716137472 required="6.4 GiB" Sep 04 13:18:07 FORGE ollama[1461072]: time=2024-09-04T13:18:07.625-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[17.4 GiB]" memory.required.full="6.4 GiB" memory.required.partial="6.4 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[6.4 GiB]" memory.weights.total="5.1 GiB" memory.weights.repeating="4.9 GiB" memory.weights.nonrepeating="205.1 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="569.0 MiB" Sep 04 13:18:07 FORGE ollama[1461072]: time=2024-09-04T13:18:07.626-04:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama4167472154/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 49 --parallel 4 --port 40973" Sep 04 13:18:07 FORGE ollama[1461072]: time=2024-09-04T13:18:07.627-04:00 level=INFO source=sched.go:450 msg="loaded runners" count=2 Sep 04 13:18:07 FORGE ollama[1461072]: time=2024-09-04T13:18:07.627-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" Sep 04 13:18:07 FORGE ollama[1461072]: time=2024-09-04T13:18:07.627-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" Sep 04 13:18:07 FORGE ollama[1470980]: INFO [main] build info | build=1 commit="1e6f655" tid="124851219718144" timestamp=1725470287 Sep 04 13:18:07 FORGE ollama[1470980]: INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="124851219718144" timestamp=1725470287 total_threads=32 Sep 04 13:18:07 FORGE ollama[1470980]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="40973" tid="124851219718144" timestamp=1725470287 Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: loaded meta data with 31 key-value pairs and 435 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 (version GGUF V3 (latest)) Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv 0: general.architecture str = llama Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv 1: general.type str = model Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv 2: general.name str = Yi Coder 9B Chat Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv 3: general.finetune str = Chat Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv 4: general.basename str = Yi-Coder Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv 5: general.size_label str = 9B Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv 6: general.license str = apache-2.0 Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv 7: llama.block_count u32 = 48 Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv 8: llama.context_length u32 = 131072 Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv 9: llama.embedding_length u32 = 4096 Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv 10: llama.feed_forward_length u32 = 11008 Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv 11: llama.attention.head_count u32 = 32 Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 4 Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv 13: llama.rope.freq_base f32 = 10000000.000000 Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv 15: general.file_type u32 = 2 Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv 16: llama.vocab_size u32 = 64000 Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv 17: llama.rope.dimension_count u32 = 128 Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv 18: tokenizer.ggml.add_space_prefix bool = false Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv 19: tokenizer.ggml.model str = llama Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv 20: tokenizer.ggml.pre str = default Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,64000] = ["<unk>", "<|startoftext|>", "<|endof... Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv 22: tokenizer.ggml.scores arr[f32,64000] = [-1000.000000, -1000.000000, -1000.00... Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,64000] = [3, 3, 3, 3, 3, 3, 1, 3, 1, 3, 3, 3, ... Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv 24: tokenizer.ggml.bos_token_id u32 = 1 Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 2 Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 0 Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = false Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv 28: tokenizer.ggml.add_eos_token bool = false Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv 29: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - kv 30: general.quantization_version u32 = 2 Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - type f32: 97 tensors Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - type q4_0: 337 tensors Sep 04 13:18:07 FORGE ollama[1461072]: llama_model_loader: - type q6_K: 1 tensors Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_vocab: special tokens cache size = 12 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_vocab: token to piece cache size = 0.3834 MB Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: format = GGUF V3 (latest) Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: arch = llama Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: vocab type = SPM Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_vocab = 64000 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_merges = 0 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: vocab_only = 0 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_ctx_train = 131072 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_embd = 4096 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_layer = 48 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_head = 32 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_head_kv = 4 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_rot = 128 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_swa = 0 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_embd_head_k = 128 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_embd_head_v = 128 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_gqa = 8 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_embd_k_gqa = 512 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_embd_v_gqa = 512 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: f_norm_eps = 0.0e+00 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: f_logit_scale = 0.0e+00 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_ff = 11008 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_expert = 0 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_expert_used = 0 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: causal attn = 1 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: pooling type = 0 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: rope type = 0 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: rope scaling = linear Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: freq_base_train = 10000000.0 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: freq_scale_train = 1 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: n_ctx_orig_yarn = 131072 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: rope_finetuned = unknown Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_conv = 0 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_inner = 0 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_state = 0 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: ssm_dt_rank = 0 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: model type = 34B Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: model ftype = Q4_0 Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: model params = 8.83 B Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: model size = 4.69 GiB (4.56 BPW) Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: general.name = Yi Coder 9B Chat Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: BOS token = 1 '<|startoftext|>' Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: EOS token = 2 '<|endoftext|>' Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: UNK token = 0 '<unk>' Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: PAD token = 0 '<unk>' Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: LF token = 315 '<0x0A>' Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: EOT token = 2 '<|endoftext|>' Sep 04 13:18:07 FORGE ollama[1461072]: llm_load_print_meta: max token length = 48 Sep 04 13:18:07 FORGE ollama[1461072]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Sep 04 13:18:07 FORGE ollama[1461072]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Sep 04 13:18:07 FORGE ollama[1461072]: ggml_cuda_init: found 1 CUDA devices: Sep 04 13:18:07 FORGE ollama[1461072]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Sep 04 13:18:07 FORGE ollama[1461072]: CUDA error: CUDA-capable device(s) is/are busy or unavailable Sep 04 13:18:07 FORGE ollama[1461072]: current device: 0, in function ggml_backend_cuda_get_device_memory at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:3040 Sep 04 13:18:07 FORGE ollama[1461072]: cudaMemGetInfo(free, total) Sep 04 13:18:07 FORGE ollama[1461072]: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:101: CUDA error Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470981] Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470982] Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470983] Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470984] Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470985] Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470986] Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470987] Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470988] Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470989] Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470990] Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470991] Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470992] Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470993] Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470994] Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470995] Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470996] Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470997] Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470998] Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1470999] Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1471000] Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1471001] Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1471002] Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1471003] Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1471004] Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1471005] Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1471006] Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1471007] Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1471008] Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1471009] Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1471010] Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1471011] Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1471012] Sep 04 13:18:07 FORGE ollama[1471015]: [New LWP 1471013] Sep 04 13:18:07 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:07.851-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:07 FORGE ollama[1471015]: [Thread debugging using libthread_db enabled] Sep 04 13:18:07 FORGE ollama[1471015]: Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Sep 04 13:18:07 FORGE ollama[1471015]: 0x0000718ccc2ea42f in __GI___wait4 (pid=1471015, stat_loc=0x7ffcad718ff4, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30 Sep 04 13:18:07 FORGE ollama[1461072]: 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory. Sep 04 13:18:07 FORGE ollama[1471015]: #0 0x0000718ccc2ea42f in __GI___wait4 (pid=1471015, stat_loc=0x7ffcad718ff4, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30 Sep 04 13:18:07 FORGE ollama[1471015]: 30 in ../sysdeps/unix/sysv/linux/wait4.c Sep 04 13:18:07 FORGE ollama[1471015]: #1 0x0000718ccca3ca88 in ggml_abort () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so Sep 04 13:18:07 FORGE ollama[1471015]: #2 0x0000718cccb09d72 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so Sep 04 13:18:07 FORGE ollama[1471015]: #3 0x0000718cccb16c52 in ggml_backend_cuda_get_device_memory () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so Sep 04 13:18:07 FORGE ollama[1471015]: #4 0x0000718d30b53469 in llm_load_tensors(llama_model_loader&, llama_model&, int, llama_split_mode, int, float const*, bool, bool (*)(float, void*), void*) () from /tmp/ollama4167472154/runners/cuda_v12/libllama.so Sep 04 13:18:07 FORGE ollama[1471015]: #5 0x0000718d30b94fe2 in llama_load_model_from_file () from /tmp/ollama4167472154/runners/cuda_v12/libllama.so Sep 04 13:18:07 FORGE ollama[1471015]: #6 0x00000000004e452b in llama_init_from_gpt_params(gpt_params&) () Sep 04 13:18:07 FORGE ollama[1471015]: #7 0x0000000000473710 in llama_server_context::load_model(gpt_params const&) () Sep 04 13:18:07 FORGE ollama[1471015]: #8 0x0000000000423058 in main () Sep 04 13:18:07 FORGE ollama[1461072]: time=2024-09-04T13:18:07.910-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" Sep 04 13:18:07 FORGE ollama[1471015]: [Inferior 1 (process 1470980) detached] Sep 04 13:18:08 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:08.102-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:08 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:08.352-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:08 FORGE ollama[1461072]: time=2024-09-04T13:18:08.361-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" Sep 04 13:18:08 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:08.602-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:08 FORGE ollama[1461072]: time=2024-09-04T13:18:08.851-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.00123842 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 Sep 04 13:18:08 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:08.852-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:09 FORGE ollama[1461072]: time=2024-09-04T13:18:09.102-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.251756147 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 Sep 04 13:18:09 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:09.102-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:09 FORGE ollama[1461072]: time=2024-09-04T13:18:09.351-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.501123479 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 Sep 04 13:18:09 FORGE ollama[1461072]: time=2024-09-04T13:18:09.514-04:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: CUDA error" Sep 04 13:18:09 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:18:09 | 500 | 1.923404002s | 127.0.0.1 | POST "/api/chat" Sep 04 13:18:09 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:09.515-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:09 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:09.766-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:10 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:10.016-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:10 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:10.266-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:10 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:10.515-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:10 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:10.766-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:11 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:11.015-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:11 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:11.266-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:11 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:11.516-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:11 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:11.765-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:12 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:12.016-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:12 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:12.265-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:12 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:12.516-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:12 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:12.765-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:13 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:13.016-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:13 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:13.265-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:13 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:13.516-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:13 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:18:13 | 200 | 16.829µs | 127.0.0.1 | HEAD "/" Sep 04 13:18:13 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:18:13 | 200 | 9.61044ms | 127.0.0.1 | POST "/api/show" Sep 04 13:18:13 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:13.640-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:13 FORGE ollama[1461072]: time=2024-09-04T13:18:13.650-04:00 level=INFO source=sched.go:508 msg="updated VRAM based on existing loaded models" gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 library=cuda total="23.6 GiB" available="17.4 GiB" Sep 04 13:18:13 FORGE ollama[1461072]: time=2024-09-04T13:18:13.650-04:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 parallel=4 available=18716137472 required="6.4 GiB" Sep 04 13:18:13 FORGE ollama[1461072]: time=2024-09-04T13:18:13.650-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[17.4 GiB]" memory.required.full="6.4 GiB" memory.required.partial="6.4 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[6.4 GiB]" memory.weights.total="5.1 GiB" memory.weights.repeating="4.9 GiB" memory.weights.nonrepeating="205.1 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="569.0 MiB" Sep 04 13:18:13 FORGE ollama[1461072]: time=2024-09-04T13:18:13.651-04:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama4167472154/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 49 --parallel 4 --port 43553" Sep 04 13:18:13 FORGE ollama[1461072]: time=2024-09-04T13:18:13.651-04:00 level=INFO source=sched.go:450 msg="loaded runners" count=2 Sep 04 13:18:13 FORGE ollama[1461072]: time=2024-09-04T13:18:13.651-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" Sep 04 13:18:13 FORGE ollama[1461072]: time=2024-09-04T13:18:13.651-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" Sep 04 13:18:13 FORGE ollama[1471193]: INFO [main] build info | build=1 commit="1e6f655" tid="138366229303296" timestamp=1725470293 Sep 04 13:18:13 FORGE ollama[1471193]: INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="138366229303296" timestamp=1725470293 total_threads=32 Sep 04 13:18:13 FORGE ollama[1471193]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="43553" tid="138366229303296" timestamp=1725470293 Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: loaded meta data with 31 key-value pairs and 435 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 (version GGUF V3 (latest)) Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv 0: general.architecture str = llama Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv 1: general.type str = model Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv 2: general.name str = Yi Coder 9B Chat Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv 3: general.finetune str = Chat Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv 4: general.basename str = Yi-Coder Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv 5: general.size_label str = 9B Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv 6: general.license str = apache-2.0 Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv 7: llama.block_count u32 = 48 Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv 8: llama.context_length u32 = 131072 Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv 9: llama.embedding_length u32 = 4096 Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv 10: llama.feed_forward_length u32 = 11008 Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv 11: llama.attention.head_count u32 = 32 Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 4 Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv 13: llama.rope.freq_base f32 = 10000000.000000 Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv 15: general.file_type u32 = 2 Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv 16: llama.vocab_size u32 = 64000 Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv 17: llama.rope.dimension_count u32 = 128 Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv 18: tokenizer.ggml.add_space_prefix bool = false Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv 19: tokenizer.ggml.model str = llama Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv 20: tokenizer.ggml.pre str = default Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,64000] = ["<unk>", "<|startoftext|>", "<|endof... Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv 22: tokenizer.ggml.scores arr[f32,64000] = [-1000.000000, -1000.000000, -1000.00... Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,64000] = [3, 3, 3, 3, 3, 3, 1, 3, 1, 3, 3, 3, ... Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv 24: tokenizer.ggml.bos_token_id u32 = 1 Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 2 Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 0 Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = false Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv 28: tokenizer.ggml.add_eos_token bool = false Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv 29: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - kv 30: general.quantization_version u32 = 2 Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - type f32: 97 tensors Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - type q4_0: 337 tensors Sep 04 13:18:13 FORGE ollama[1461072]: llama_model_loader: - type q6_K: 1 tensors Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_vocab: special tokens cache size = 12 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_vocab: token to piece cache size = 0.3834 MB Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: format = GGUF V3 (latest) Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: arch = llama Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: vocab type = SPM Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_vocab = 64000 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_merges = 0 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: vocab_only = 0 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_ctx_train = 131072 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_embd = 4096 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_layer = 48 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_head = 32 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_head_kv = 4 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_rot = 128 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_swa = 0 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_embd_head_k = 128 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_embd_head_v = 128 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_gqa = 8 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_embd_k_gqa = 512 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_embd_v_gqa = 512 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: f_norm_eps = 0.0e+00 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: f_logit_scale = 0.0e+00 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_ff = 11008 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_expert = 0 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_expert_used = 0 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: causal attn = 1 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: pooling type = 0 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: rope type = 0 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: rope scaling = linear Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: freq_base_train = 10000000.0 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: freq_scale_train = 1 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: n_ctx_orig_yarn = 131072 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: rope_finetuned = unknown Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_conv = 0 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_inner = 0 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_state = 0 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: ssm_dt_rank = 0 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: model type = 34B Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: model ftype = Q4_0 Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: model params = 8.83 B Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: model size = 4.69 GiB (4.56 BPW) Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: general.name = Yi Coder 9B Chat Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: BOS token = 1 '<|startoftext|>' Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: EOS token = 2 '<|endoftext|>' Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: UNK token = 0 '<unk>' Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: PAD token = 0 '<unk>' Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: LF token = 315 '<0x0A>' Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: EOT token = 2 '<|endoftext|>' Sep 04 13:18:13 FORGE ollama[1461072]: llm_load_print_meta: max token length = 48 Sep 04 13:18:13 FORGE ollama[1461072]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Sep 04 13:18:13 FORGE ollama[1461072]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Sep 04 13:18:13 FORGE ollama[1461072]: ggml_cuda_init: found 1 CUDA devices: Sep 04 13:18:13 FORGE ollama[1461072]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Sep 04 13:18:13 FORGE ollama[1461072]: CUDA error: CUDA-capable device(s) is/are busy or unavailable Sep 04 13:18:13 FORGE ollama[1461072]: current device: 0, in function ggml_backend_cuda_get_device_memory at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:3040 Sep 04 13:18:13 FORGE ollama[1461072]: cudaMemGetInfo(free, total) Sep 04 13:18:13 FORGE ollama[1461072]: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:101: CUDA error Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471194] Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471195] Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471196] Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471197] Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471198] Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471199] Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471200] Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471201] Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471202] Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471203] Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471204] Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471205] Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471206] Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471207] Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471208] Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471209] Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471210] Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471211] Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471212] Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471213] Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471214] Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471215] Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471216] Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471217] Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471218] Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471219] Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471220] Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471221] Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471222] Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471223] Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471224] Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471225] Sep 04 13:18:13 FORGE ollama[1471228]: [New LWP 1471226] Sep 04 13:18:13 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:13.765-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:13 FORGE ollama[1471228]: [Thread debugging using libthread_db enabled] Sep 04 13:18:13 FORGE ollama[1471228]: Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Sep 04 13:18:13 FORGE ollama[1471228]: 0x00007dd7818ea42f in __GI___wait4 (pid=1471228, stat_loc=0x7ffe2625b094, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30 Sep 04 13:18:13 FORGE ollama[1461072]: 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory. Sep 04 13:18:13 FORGE ollama[1471228]: #0 0x00007dd7818ea42f in __GI___wait4 (pid=1471228, stat_loc=0x7ffe2625b094, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30 Sep 04 13:18:13 FORGE ollama[1471228]: 30 in ../sysdeps/unix/sysv/linux/wait4.c Sep 04 13:18:13 FORGE ollama[1471228]: #1 0x00007dd78203ca88 in ggml_abort () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so Sep 04 13:18:13 FORGE ollama[1471228]: #2 0x00007dd782109d72 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so Sep 04 13:18:13 FORGE ollama[1471228]: #3 0x00007dd782116c52 in ggml_backend_cuda_get_device_memory () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so Sep 04 13:18:13 FORGE ollama[1471228]: #4 0x00007dd7e60da469 in llm_load_tensors(llama_model_loader&, llama_model&, int, llama_split_mode, int, float const*, bool, bool (*)(float, void*), void*) () from /tmp/ollama4167472154/runners/cuda_v12/libllama.so Sep 04 13:18:13 FORGE ollama[1471228]: #5 0x00007dd7e611bfe2 in llama_load_model_from_file () from /tmp/ollama4167472154/runners/cuda_v12/libllama.so Sep 04 13:18:13 FORGE ollama[1471228]: #6 0x00000000004e452b in llama_init_from_gpt_params(gpt_params&) () Sep 04 13:18:13 FORGE ollama[1471228]: #7 0x0000000000473710 in llama_server_context::load_model(gpt_params const&) () Sep 04 13:18:13 FORGE ollama[1471228]: #8 0x0000000000423058 in main () Sep 04 13:18:13 FORGE ollama[1461072]: time=2024-09-04T13:18:13.910-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" Sep 04 13:18:13 FORGE ollama[1471228]: [Inferior 1 (process 1471193) detached] Sep 04 13:18:14 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:14.015-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:14 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:14.266-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:14 FORGE ollama[1461072]: time=2024-09-04T13:18:14.361-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" Sep 04 13:18:14 FORGE ollama[1461072]: time=2024-09-04T13:18:14.515-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.001345171 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 Sep 04 13:18:14 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:14.515-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:14 FORGE ollama[1461072]: time=2024-09-04T13:18:14.765-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.251224478 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 Sep 04 13:18:14 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:14.766-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:15 FORGE ollama[1461072]: time=2024-09-04T13:18:15.015-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.501346069 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 Sep 04 13:18:15 FORGE ollama[1461072]: time=2024-09-04T13:18:15.514-04:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: CUDA error" Sep 04 13:18:15 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:18:15 | 500 | 1.889397041s | 127.0.0.1 | POST "/api/chat" Sep 04 13:18:15 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:15.515-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:15 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:15.767-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:16 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:16.016-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:16 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:16.266-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:16 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:16.516-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:16 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:16.766-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:17 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:17.017-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:17 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:17.266-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:17 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:17.516-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:17 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:17.766-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:18 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:18.016-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:18 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:18.266-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:18 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:18.516-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:18 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:18.766-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:19 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:19.016-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:19 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:19.266-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:19 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:19.516-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:19 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:19.766-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:20 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:20.016-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:20 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:20.266-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:20 FORGE ollama[1461072]: time=2024-09-04T13:18:20.516-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.001506631 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 Sep 04 13:18:20 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:20.516-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:20 FORGE ollama[1461072]: time=2024-09-04T13:18:20.765-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.251204197 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 Sep 04 13:18:20 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:20.766-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:18:21 FORGE ollama[1461072]: time=2024-09-04T13:18:21.015-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.500751415 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 Sep 04 13:18:38 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:18:38 | 200 | 26.809µs | 127.0.0.1 | HEAD "/" Sep 04 13:18:38 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:18:38 | 200 | 16.192728ms | 127.0.0.1 | POST "/api/show" Sep 04 13:18:38 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:18:38 | 200 | 9.152746ms | 127.0.0.1 | POST "/api/chat" Sep 04 13:18:41 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:18:41 | 200 | 279.957401ms | 127.0.0.1 | POST "/api/chat" Sep 04 13:21:01 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:21:01 | 200 | 47.526µs | 127.0.0.1 | GET "/api/version" Sep 04 13:22:14 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:22:14 | 200 | 48.756µs | 127.0.0.1 | HEAD "/" Sep 04 13:22:14 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:22:14 | 200 | 15.951301ms | 127.0.0.1 | POST "/api/show" Sep 04 13:22:14 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:14.678-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:22:14 FORGE ollama[1461072]: time=2024-09-04T13:22:14.694-04:00 level=INFO source=sched.go:508 msg="updated VRAM based on existing loaded models" gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 library=cuda total="23.6 GiB" available="17.4 GiB" Sep 04 13:22:14 FORGE ollama[1461072]: time=2024-09-04T13:22:14.695-04:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 parallel=4 available=18716137472 required="6.4 GiB" Sep 04 13:22:14 FORGE ollama[1461072]: time=2024-09-04T13:22:14.695-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[17.4 GiB]" memory.required.full="6.4 GiB" memory.required.partial="6.4 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[6.4 GiB]" memory.weights.total="5.1 GiB" memory.weights.repeating="4.9 GiB" memory.weights.nonrepeating="205.1 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="569.0 MiB" Sep 04 13:22:14 FORGE ollama[1461072]: time=2024-09-04T13:22:14.696-04:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama4167472154/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 49 --parallel 4 --port 44533" Sep 04 13:22:14 FORGE ollama[1461072]: time=2024-09-04T13:22:14.696-04:00 level=INFO source=sched.go:450 msg="loaded runners" count=2 Sep 04 13:22:14 FORGE ollama[1461072]: time=2024-09-04T13:22:14.696-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" Sep 04 13:22:14 FORGE ollama[1461072]: time=2024-09-04T13:22:14.696-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" Sep 04 13:22:14 FORGE ollama[1475532]: INFO [main] build info | build=1 commit="1e6f655" tid="137424205864960" timestamp=1725470534 Sep 04 13:22:14 FORGE ollama[1475532]: INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="137424205864960" timestamp=1725470534 total_threads=32 Sep 04 13:22:14 FORGE ollama[1475532]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="44533" tid="137424205864960" timestamp=1725470534 Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: loaded meta data with 31 key-value pairs and 435 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 (version GGUF V3 (latest)) Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv 0: general.architecture str = llama Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv 1: general.type str = model Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv 2: general.name str = Yi Coder 9B Chat Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv 3: general.finetune str = Chat Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv 4: general.basename str = Yi-Coder Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv 5: general.size_label str = 9B Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv 6: general.license str = apache-2.0 Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv 7: llama.block_count u32 = 48 Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv 8: llama.context_length u32 = 131072 Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv 9: llama.embedding_length u32 = 4096 Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv 10: llama.feed_forward_length u32 = 11008 Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv 11: llama.attention.head_count u32 = 32 Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 4 Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv 13: llama.rope.freq_base f32 = 10000000.000000 Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv 15: general.file_type u32 = 2 Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv 16: llama.vocab_size u32 = 64000 Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv 17: llama.rope.dimension_count u32 = 128 Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv 18: tokenizer.ggml.add_space_prefix bool = false Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv 19: tokenizer.ggml.model str = llama Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv 20: tokenizer.ggml.pre str = default Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,64000] = ["<unk>", "<|startoftext|>", "<|endof... Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv 22: tokenizer.ggml.scores arr[f32,64000] = [-1000.000000, -1000.000000, -1000.00... Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,64000] = [3, 3, 3, 3, 3, 3, 1, 3, 1, 3, 3, 3, ... Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv 24: tokenizer.ggml.bos_token_id u32 = 1 Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 2 Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 0 Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = false Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv 28: tokenizer.ggml.add_eos_token bool = false Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv 29: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - kv 30: general.quantization_version u32 = 2 Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - type f32: 97 tensors Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - type q4_0: 337 tensors Sep 04 13:22:14 FORGE ollama[1461072]: llama_model_loader: - type q6_K: 1 tensors Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_vocab: special tokens cache size = 12 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_vocab: token to piece cache size = 0.3834 MB Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: format = GGUF V3 (latest) Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: arch = llama Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: vocab type = SPM Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_vocab = 64000 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_merges = 0 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: vocab_only = 0 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_ctx_train = 131072 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_embd = 4096 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_layer = 48 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_head = 32 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_head_kv = 4 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_rot = 128 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_swa = 0 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_embd_head_k = 128 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_embd_head_v = 128 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_gqa = 8 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_embd_k_gqa = 512 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_embd_v_gqa = 512 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: f_norm_eps = 0.0e+00 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: f_logit_scale = 0.0e+00 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_ff = 11008 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_expert = 0 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_expert_used = 0 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: causal attn = 1 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: pooling type = 0 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: rope type = 0 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: rope scaling = linear Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: freq_base_train = 10000000.0 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: freq_scale_train = 1 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: n_ctx_orig_yarn = 131072 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: rope_finetuned = unknown Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_conv = 0 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_inner = 0 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_state = 0 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: ssm_dt_rank = 0 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: model type = 34B Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: model ftype = Q4_0 Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: model params = 8.83 B Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: model size = 4.69 GiB (4.56 BPW) Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: general.name = Yi Coder 9B Chat Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: BOS token = 1 '<|startoftext|>' Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: EOS token = 2 '<|endoftext|>' Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: UNK token = 0 '<unk>' Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: PAD token = 0 '<unk>' Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: LF token = 315 '<0x0A>' Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: EOT token = 2 '<|endoftext|>' Sep 04 13:22:14 FORGE ollama[1461072]: llm_load_print_meta: max token length = 48 Sep 04 13:22:14 FORGE ollama[1461072]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Sep 04 13:22:14 FORGE ollama[1461072]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Sep 04 13:22:14 FORGE ollama[1461072]: ggml_cuda_init: found 1 CUDA devices: Sep 04 13:22:14 FORGE ollama[1461072]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Sep 04 13:22:14 FORGE ollama[1461072]: CUDA error: CUDA-capable device(s) is/are busy or unavailable Sep 04 13:22:14 FORGE ollama[1461072]: current device: 0, in function ggml_backend_cuda_get_device_memory at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:3040 Sep 04 13:22:14 FORGE ollama[1461072]: cudaMemGetInfo(free, total) Sep 04 13:22:14 FORGE ollama[1461072]: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:101: CUDA error Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475533] Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475534] Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475535] Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475536] Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475537] Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475538] Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475539] Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475540] Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475541] Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475542] Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475543] Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475544] Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475545] Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475546] Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475547] Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475548] Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475549] Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475550] Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475551] Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475552] Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475553] Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475554] Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475555] Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475556] Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475557] Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475558] Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475559] Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475560] Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475561] Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475562] Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475563] Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475564] Sep 04 13:22:14 FORGE ollama[1475567]: [New LWP 1475565] Sep 04 13:22:14 FORGE ollama[1475567]: [Thread debugging using libthread_db enabled] Sep 04 13:22:14 FORGE ollama[1475567]: Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Sep 04 13:22:14 FORGE ollama[1475567]: 0x00007cfc2c8ea42f in __GI___wait4 (pid=1475567, stat_loc=0x7ffc97e171d4, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30 Sep 04 13:22:14 FORGE ollama[1461072]: 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory. Sep 04 13:22:14 FORGE ollama[1475567]: #0 0x00007cfc2c8ea42f in __GI___wait4 (pid=1475567, stat_loc=0x7ffc97e171d4, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30 Sep 04 13:22:14 FORGE ollama[1475567]: 30 in ../sysdeps/unix/sysv/linux/wait4.c Sep 04 13:22:14 FORGE ollama[1475567]: #1 0x00007cfc2d03ca88 in ggml_abort () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so Sep 04 13:22:14 FORGE ollama[1475567]: #2 0x00007cfc2d109d72 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so Sep 04 13:22:14 FORGE ollama[1475567]: #3 0x00007cfc2d116c52 in ggml_backend_cuda_get_device_memory () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so Sep 04 13:22:14 FORGE ollama[1475567]: #4 0x00007cfc9114b469 in llm_load_tensors(llama_model_loader&, llama_model&, int, llama_split_mode, int, float const*, bool, bool (*)(float, void*), void*) () from /tmp/ollama4167472154/runners/cuda_v12/libllama.so Sep 04 13:22:14 FORGE ollama[1475567]: #5 0x00007cfc9118cfe2 in llama_load_model_from_file () from /tmp/ollama4167472154/runners/cuda_v12/libllama.so Sep 04 13:22:14 FORGE ollama[1475567]: #6 0x00000000004e452b in llama_init_from_gpt_params(gpt_params&) () Sep 04 13:22:14 FORGE ollama[1475567]: #7 0x0000000000473710 in llama_server_context::load_model(gpt_params const&) () Sep 04 13:22:14 FORGE ollama[1475567]: #8 0x0000000000423058 in main () Sep 04 13:22:14 FORGE ollama[1461072]: time=2024-09-04T13:22:14.964-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" Sep 04 13:22:14 FORGE ollama[1475567]: [Inferior 1 (process 1475532) detached] Sep 04 13:22:15 FORGE ollama[1461072]: time=2024-09-04T13:22:15.415-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" Sep 04 13:22:16 FORGE ollama[1461072]: time=2024-09-04T13:22:16.568-04:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: CUDA error" Sep 04 13:22:16 FORGE ollama[1461072]: [GIN] 2024/09/04 - 13:22:16 | 500 | 1.908285461s | 127.0.0.1 | POST "/api/chat" Sep 04 13:22:16 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:16.569-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:22:16 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:16.820-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:22:17 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:17.069-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:22:17 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:17.320-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:22:17 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:17.570-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:22:17 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:17.820-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:22:18 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:18.070-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:22:18 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:18.320-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:22:18 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:18.570-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:22:18 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:18.819-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:22:19 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:19.070-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:22:19 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:19.319-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:22:19 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:19.570-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:22:19 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:19.820-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:22:20 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:20.069-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:22:20 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:20.320-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:22:20 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:20.570-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:22:20 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:20.820-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:22:21 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:21.070-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:22:21 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:21.320-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:22:21 FORGE ollama[1461072]: time=2024-09-04T13:22:21.569-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.000850923 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 Sep 04 13:22:21 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:21.570-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:22:21 FORGE ollama[1461072]: time=2024-09-04T13:22:21.819-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.250506103 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 Sep 04 13:22:21 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:22:21.819-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 13:22:22 FORGE ollama[1461072]: time=2024-09-04T13:22:22.069-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.501120636 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 Sep 04 13:23:41 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:23:41.664-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:24:55 FORGE ollama[1461072]: [GIN] 2024/09/04 - 14:24:55 | 200 | 38.066µs | 127.0.0.1 | HEAD "/" Sep 04 14:24:55 FORGE ollama[1461072]: [GIN] 2024/09/04 - 14:24:55 | 200 | 22.757806ms | 127.0.0.1 | POST "/api/show" Sep 04 14:24:55 FORGE ollama[1461072]: time=2024-09-04T14:24:55.988-04:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 parallel=4 available=23714660352 required="6.2 GiB" Sep 04 14:24:55 FORGE ollama[1461072]: time=2024-09-04T14:24:55.988-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[22.1 GiB]" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB" Sep 04 14:24:55 FORGE ollama[1461072]: time=2024-09-04T14:24:55.989-04:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama4167472154/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 43667" Sep 04 14:24:55 FORGE ollama[1461072]: time=2024-09-04T14:24:55.990-04:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 Sep 04 14:24:55 FORGE ollama[1461072]: time=2024-09-04T14:24:55.990-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" Sep 04 14:24:55 FORGE ollama[1461072]: time=2024-09-04T14:24:55.990-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" Sep 04 14:24:56 FORGE ollama[1540742]: INFO [main] build info | build=1 commit="1e6f655" tid="126153710018560" timestamp=1725474296 Sep 04 14:24:56 FORGE ollama[1540742]: INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="126153710018560" timestamp=1725474296 total_threads=32 Sep 04 14:24:56 FORGE ollama[1540742]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="43667" tid="126153710018560" timestamp=1725474296 Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe (version GGUF V3 (latest)) Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv 0: general.architecture str = llama Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv 1: general.type str = model Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv 3: general.finetune str = Instruct Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv 5: general.size_label str = 8B Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv 6: general.license str = llama3.1 Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv 9: llama.block_count u32 = 32 Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv 10: llama.context_length u32 = 131072 Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv 11: llama.embedding_length u32 = 4096 Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336 Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv 13: llama.attention.head_count u32 = 32 Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv 17: general.file_type u32 = 2 Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - kv 28: general.quantization_version u32 = 2 Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - type f32: 66 tensors Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - type q4_0: 225 tensors Sep 04 14:24:56 FORGE ollama[1461072]: llama_model_loader: - type q6_K: 1 tensors Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_vocab: special tokens cache size = 256 Sep 04 14:24:56 FORGE ollama[1461072]: time=2024-09-04T14:24:56.241-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_vocab: token to piece cache size = 0.7999 MB Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: format = GGUF V3 (latest) Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: arch = llama Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: vocab type = BPE Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_vocab = 128256 Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_merges = 280147 Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: vocab_only = 0 Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_ctx_train = 131072 Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_embd = 4096 Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_layer = 32 Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_head = 32 Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_head_kv = 8 Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_rot = 128 Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_swa = 0 Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_embd_head_k = 128 Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_embd_head_v = 128 Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_gqa = 4 Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_embd_k_gqa = 1024 Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_embd_v_gqa = 1024 Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: f_norm_eps = 0.0e+00 Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: f_logit_scale = 0.0e+00 Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_ff = 14336 Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_expert = 0 Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_expert_used = 0 Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: causal attn = 1 Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: pooling type = 0 Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: rope type = 0 Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: rope scaling = linear Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: freq_base_train = 500000.0 Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: freq_scale_train = 1 Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: n_ctx_orig_yarn = 131072 Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: rope_finetuned = unknown Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_conv = 0 Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_inner = 0 Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_state = 0 Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: ssm_dt_rank = 0 Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: model type = 8B Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: model ftype = Q4_0 Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: model params = 8.03 B Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>' Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: LF token = 128 'Ä' Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>' Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_print_meta: max token length = 256 Sep 04 14:24:56 FORGE ollama[1461072]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Sep 04 14:24:56 FORGE ollama[1461072]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Sep 04 14:24:56 FORGE ollama[1461072]: ggml_cuda_init: found 1 CUDA devices: Sep 04 14:24:56 FORGE ollama[1461072]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_tensors: ggml ctx size = 0.27 MiB Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_tensors: offloading 32 repeating layers to GPU Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_tensors: offloading non-repeating layers to GPU Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_tensors: offloaded 33/33 layers to GPU Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_tensors: CPU buffer size = 281.81 MiB Sep 04 14:24:56 FORGE ollama[1461072]: llm_load_tensors: CUDA0 buffer size = 4156.00 MiB Sep 04 14:24:56 FORGE ollama[1461072]: llama_new_context_with_model: n_ctx = 8192 Sep 04 14:24:56 FORGE ollama[1461072]: llama_new_context_with_model: n_batch = 512 Sep 04 14:24:56 FORGE ollama[1461072]: llama_new_context_with_model: n_ubatch = 512 Sep 04 14:24:56 FORGE ollama[1461072]: llama_new_context_with_model: flash_attn = 0 Sep 04 14:24:56 FORGE ollama[1461072]: llama_new_context_with_model: freq_base = 500000.0 Sep 04 14:24:56 FORGE ollama[1461072]: llama_new_context_with_model: freq_scale = 1 Sep 04 14:24:56 FORGE ollama[1461072]: llama_kv_cache_init: CUDA0 KV buffer size = 1024.00 MiB Sep 04 14:24:56 FORGE ollama[1461072]: llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB Sep 04 14:24:56 FORGE ollama[1461072]: llama_new_context_with_model: CUDA_Host output buffer size = 2.02 MiB Sep 04 14:24:56 FORGE ollama[1461072]: llama_new_context_with_model: CUDA0 compute buffer size = 560.00 MiB Sep 04 14:24:56 FORGE ollama[1461072]: llama_new_context_with_model: CUDA_Host compute buffer size = 24.01 MiB Sep 04 14:24:56 FORGE ollama[1461072]: llama_new_context_with_model: graph nodes = 1030 Sep 04 14:24:56 FORGE ollama[1461072]: llama_new_context_with_model: graph splits = 2 Sep 04 14:24:56 FORGE ollama[1540742]: INFO [main] model loaded | tid="126153710018560" timestamp=1725474296 Sep 04 14:24:56 FORGE ollama[1461072]: time=2024-09-04T14:24:56.995-04:00 level=INFO source=server.go:630 msg="llama runner started in 1.01 seconds" Sep 04 14:24:56 FORGE ollama[1461072]: [GIN] 2024/09/04 - 14:24:56 | 200 | 1.107376461s | 127.0.0.1 | POST "/api/chat" Sep 04 14:25:19 FORGE ollama[1461072]: [GIN] 2024/09/04 - 14:25:19 | 200 | 3.176770441s | 127.0.0.1 | POST "/api/chat" chris@FORGE:~/bin$ ...
Author
Owner

@iplayfast commented on GitHub (Sep 4, 2024):

This might narrow it down:

chris@FORGE:~/bin$ ollama run yi-coder
Error: llama runner process has terminated: CUDA error
chris@FORGE:~/bin$ ollama run yi-coder
Error: llama runner process has terminated: CUDA error: CUDA-capable device(s) is/are busy or unavailable
  current device: 0, in function ggml_backend_cuda_get_device_memory at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:3040
  cudaMemGetInfo(free, total)
/go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:101: CUDA error
chris@FORGE:~/bin$ journalctl -u ollama --no-pager --since="2024-09-04 14:29"
Sep 04 14:29:49 FORGE ollama[1461072]: [GIN] 2024/09/04 - 14:29:49 | 200 |       34.18µs |       127.0.0.1 | HEAD     "/"
Sep 04 14:29:49 FORGE ollama[1461072]: [GIN] 2024/09/04 - 14:29:49 | 200 |    8.830032ms |       127.0.0.1 | POST     "/api/show"
Sep 04 14:29:49 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:49.944-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:29:49 FORGE ollama[1461072]: time=2024-09-04T14:29:49.959-04:00 level=INFO source=sched.go:508 msg="updated VRAM based on existing loaded models" gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 library=cuda total="23.6 GiB" available="17.4 GiB"
Sep 04 14:29:49 FORGE ollama[1461072]: time=2024-09-04T14:29:49.959-04:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 parallel=4 available=18716137472 required="6.4 GiB"
Sep 04 14:29:49 FORGE ollama[1461072]: time=2024-09-04T14:29:49.959-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[17.4 GiB]" memory.required.full="6.4 GiB" memory.required.partial="6.4 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[6.4 GiB]" memory.weights.total="5.1 GiB" memory.weights.repeating="4.9 GiB" memory.weights.nonrepeating="205.1 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="569.0 MiB"
Sep 04 14:29:49 FORGE ollama[1461072]: time=2024-09-04T14:29:49.960-04:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama4167472154/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 49 --parallel 4 --port 40829"
Sep 04 14:29:49 FORGE ollama[1461072]: time=2024-09-04T14:29:49.960-04:00 level=INFO source=sched.go:450 msg="loaded runners" count=2
Sep 04 14:29:49 FORGE ollama[1461072]: time=2024-09-04T14:29:49.960-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
Sep 04 14:29:49 FORGE ollama[1461072]: time=2024-09-04T14:29:49.961-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
Sep 04 14:29:49 FORGE ollama[1545718]: INFO [main] build info | build=1 commit="1e6f655" tid="130539399966720" timestamp=1725474589
Sep 04 14:29:49 FORGE ollama[1545718]: INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="130539399966720" timestamp=1725474589 total_threads=32
Sep 04 14:29:49 FORGE ollama[1545718]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="40829" tid="130539399966720" timestamp=1725474589
Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: loaded meta data with 31 key-value pairs and 435 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 (version GGUF V3 (latest))
Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv   1:                               general.type str              = model
Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv   2:                               general.name str              = Yi Coder 9B Chat
Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv   3:                           general.finetune str              = Chat
Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv   4:                           general.basename str              = Yi-Coder
Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv   5:                         general.size_label str              = 9B
Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv   6:                            general.license str              = apache-2.0
Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv   7:                          llama.block_count u32              = 48
Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv   8:                       llama.context_length u32              = 131072
Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv   9:                     llama.embedding_length u32              = 4096
Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv  10:                  llama.feed_forward_length u32              = 11008
Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 32
Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 4
Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv  13:                       llama.rope.freq_base f32              = 10000000.000000
Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv  14:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv  15:                          general.file_type u32              = 2
Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv  16:                           llama.vocab_size u32              = 64000
Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv  17:                 llama.rope.dimension_count u32              = 128
Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv  18:            tokenizer.ggml.add_space_prefix bool             = false
Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = llama
Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = default
Sep 04 14:29:50 FORGE ollama[1461072]: llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,64000]   = ["<unk>", "<|startoftext|>", "<|endof...
Sep 04 14:29:50 FORGE ollama[1461072]: llama_model_loader: - kv  22:                      tokenizer.ggml.scores arr[f32,64000]   = [-1000.000000, -1000.000000, -1000.00...
Sep 04 14:29:50 FORGE ollama[1461072]: llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,64000]   = [3, 3, 3, 3, 3, 3, 1, 3, 1, 3, 3, 3, ...
Sep 04 14:29:50 FORGE ollama[1461072]: llama_model_loader: - kv  24:                tokenizer.ggml.bos_token_id u32              = 1
Sep 04 14:29:50 FORGE ollama[1461072]: llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 2
Sep 04 14:29:50 FORGE ollama[1461072]: llama_model_loader: - kv  26:            tokenizer.ggml.padding_token_id u32              = 0
Sep 04 14:29:50 FORGE ollama[1461072]: llama_model_loader: - kv  27:               tokenizer.ggml.add_bos_token bool             = false
Sep 04 14:29:50 FORGE ollama[1461072]: llama_model_loader: - kv  28:               tokenizer.ggml.add_eos_token bool             = false
Sep 04 14:29:50 FORGE ollama[1461072]: llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
Sep 04 14:29:50 FORGE ollama[1461072]: llama_model_loader: - kv  30:               general.quantization_version u32              = 2
Sep 04 14:29:50 FORGE ollama[1461072]: llama_model_loader: - type  f32:   97 tensors
Sep 04 14:29:50 FORGE ollama[1461072]: llama_model_loader: - type q4_0:  337 tensors
Sep 04 14:29:50 FORGE ollama[1461072]: llama_model_loader: - type q6_K:    1 tensors
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_vocab: special tokens cache size = 12
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_vocab: token to piece cache size = 0.3834 MB
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: format           = GGUF V3 (latest)
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: arch             = llama
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: vocab type       = SPM
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_vocab          = 64000
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_merges         = 0
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: vocab_only       = 0
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_ctx_train      = 131072
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_embd           = 4096
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_layer          = 48
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_head           = 32
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_head_kv        = 4
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_rot            = 128
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_swa            = 0
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_embd_head_k    = 128
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_embd_head_v    = 128
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_gqa            = 8
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_embd_k_gqa     = 512
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_embd_v_gqa     = 512
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_ff             = 11008
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_expert         = 0
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_expert_used    = 0
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: causal attn      = 1
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: pooling type     = 0
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: rope type        = 0
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: rope scaling     = linear
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: freq_base_train  = 10000000.0
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: freq_scale_train = 1
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_ctx_orig_yarn  = 131072
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: rope_finetuned   = unknown
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_conv       = 0
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_inner      = 0
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_state      = 0
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: ssm_dt_rank      = 0
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: model type       = 34B
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: model ftype      = Q4_0
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: model params     = 8.83 B
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: model size       = 4.69 GiB (4.56 BPW)
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: general.name     = Yi Coder 9B Chat
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: BOS token        = 1 '<|startoftext|>'
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: EOS token        = 2 '<|endoftext|>'
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: UNK token        = 0 '<unk>'
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: PAD token        = 0 '<unk>'
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: LF token         = 315 '<0x0A>'
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: EOT token        = 2 '<|endoftext|>'
Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: max token length = 48
Sep 04 14:29:50 FORGE ollama[1461072]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Sep 04 14:29:50 FORGE ollama[1461072]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 04 14:29:50 FORGE ollama[1461072]: ggml_cuda_init: found 1 CUDA devices:
Sep 04 14:29:50 FORGE ollama[1461072]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Sep 04 14:29:50 FORGE ollama[1461072]: CUDA error: CUDA-capable device(s) is/are busy or unavailable
Sep 04 14:29:50 FORGE ollama[1461072]:   current device: 0, in function ggml_backend_cuda_get_device_memory at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:3040
Sep 04 14:29:50 FORGE ollama[1461072]:   cudaMemGetInfo(free, total)
Sep 04 14:29:50 FORGE ollama[1461072]: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:101: CUDA error
Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545719]
Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545720]
Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545721]
Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545722]
Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545723]
Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545724]
Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545725]
Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545726]
Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545727]
Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545728]
Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545729]
Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545730]
Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545731]
Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545732]
Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545733]
Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545734]
Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545735]
Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545736]
Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545737]
Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545738]
Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545739]
Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545740]
Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545741]
Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545742]
Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545743]
Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545744]
Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545745]
Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545746]
Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545747]
Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545748]
Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545749]
Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545750]
Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545751]
Sep 04 14:29:50 FORGE ollama[1545753]: [Thread debugging using libthread_db enabled]
Sep 04 14:29:50 FORGE ollama[1545753]: Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Sep 04 14:29:50 FORGE ollama[1545753]: 0x000076b92e6ea42f in __GI___wait4 (pid=1545753, stat_loc=0x7ffecb4ca4d4, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
Sep 04 14:29:50 FORGE ollama[1461072]: 30        ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
Sep 04 14:29:50 FORGE ollama[1545753]: #0  0x000076b92e6ea42f in __GI___wait4 (pid=1545753, stat_loc=0x7ffecb4ca4d4, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
Sep 04 14:29:50 FORGE ollama[1545753]: 30        in ../sysdeps/unix/sysv/linux/wait4.c
Sep 04 14:29:50 FORGE ollama[1545753]: #1  0x000076b92ee3ca88 in ggml_abort () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so
Sep 04 14:29:50 FORGE ollama[1545753]: #2  0x000076b92ef09d72 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so
Sep 04 14:29:50 FORGE ollama[1545753]: #3  0x000076b92ef16c52 in ggml_backend_cuda_get_device_memory () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so
Sep 04 14:29:50 FORGE ollama[1545753]: #4  0x000076b992d6b469 in llm_load_tensors(llama_model_loader&, llama_model&, int, llama_split_mode, int, float const*, bool, bool (*)(float, void*), void*) () from /tmp/ollama4167472154/runners/cuda_v12/libllama.so
Sep 04 14:29:50 FORGE ollama[1545753]: #5  0x000076b992dacfe2 in llama_load_model_from_file () from /tmp/ollama4167472154/runners/cuda_v12/libllama.so
Sep 04 14:29:50 FORGE ollama[1545753]: #6  0x00000000004e452b in llama_init_from_gpt_params(gpt_params&) ()
Sep 04 14:29:50 FORGE ollama[1545753]: #7  0x0000000000473710 in llama_server_context::load_model(gpt_params const&) ()
Sep 04 14:29:50 FORGE ollama[1545753]: #8  0x0000000000423058 in main ()
Sep 04 14:29:50 FORGE ollama[1461072]: time=2024-09-04T14:29:50.239-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
Sep 04 14:29:50 FORGE ollama[1545753]: [Inferior 1 (process 1545718) detached]
Sep 04 14:29:50 FORGE ollama[1461072]: time=2024-09-04T14:29:50.690-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
Sep 04 14:29:51 FORGE ollama[1461072]: time=2024-09-04T14:29:51.844-04:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: CUDA error"
Sep 04 14:29:51 FORGE ollama[1461072]: [GIN] 2024/09/04 - 14:29:51 | 500 |  1.910032889s |       127.0.0.1 | POST     "/api/chat"
Sep 04 14:29:51 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:51.844-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:29:52 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:52.096-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:29:52 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:52.345-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:29:52 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:52.595-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:29:52 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:52.846-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:29:53 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:53.096-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:29:53 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:53.347-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:29:53 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:53.595-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:29:53 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:53.846-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:29:54 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:54.096-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:29:54 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:54.345-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:29:54 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:54.595-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:29:54 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:54.845-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:29:55 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:55.095-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:29:55 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:55.345-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:29:55 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:55.595-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:29:55 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:55.845-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:29:56 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:56.096-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:29:56 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:56.346-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:29:56 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:56.596-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:29:56 FORGE ollama[1461072]: time=2024-09-04T14:29:56.845-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.001243643 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2
Sep 04 14:29:56 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:56.845-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:29:57 FORGE ollama[1461072]: time=2024-09-04T14:29:57.095-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.251050005 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2
Sep 04 14:29:57 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:57.095-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:29:57 FORGE ollama[1461072]: time=2024-09-04T14:29:57.345-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.501103261 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2
Sep 04 14:30:04 FORGE ollama[1461072]: [GIN] 2024/09/04 - 14:30:04 | 200 |      44.846µs |       127.0.0.1 | HEAD     "/"
Sep 04 14:30:04 FORGE ollama[1461072]: [GIN] 2024/09/04 - 14:30:04 | 200 |    9.045069ms |       127.0.0.1 | POST     "/api/show"
Sep 04 14:30:04 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:04.574-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:30:04 FORGE ollama[1461072]: time=2024-09-04T14:30:04.599-04:00 level=INFO source=sched.go:508 msg="updated VRAM based on existing loaded models" gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 library=cuda total="23.6 GiB" available="17.4 GiB"
Sep 04 14:30:04 FORGE ollama[1461072]: time=2024-09-04T14:30:04.600-04:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 parallel=4 available=18716137472 required="6.4 GiB"
Sep 04 14:30:04 FORGE ollama[1461072]: time=2024-09-04T14:30:04.600-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[17.4 GiB]" memory.required.full="6.4 GiB" memory.required.partial="6.4 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[6.4 GiB]" memory.weights.total="5.1 GiB" memory.weights.repeating="4.9 GiB" memory.weights.nonrepeating="205.1 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="569.0 MiB"
Sep 04 14:30:04 FORGE ollama[1461072]: time=2024-09-04T14:30:04.601-04:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama4167472154/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 49 --parallel 4 --port 36959"
Sep 04 14:30:04 FORGE ollama[1461072]: time=2024-09-04T14:30:04.601-04:00 level=INFO source=sched.go:450 msg="loaded runners" count=2
Sep 04 14:30:04 FORGE ollama[1461072]: time=2024-09-04T14:30:04.601-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
Sep 04 14:30:04 FORGE ollama[1461072]: time=2024-09-04T14:30:04.602-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
Sep 04 14:30:04 FORGE ollama[1546104]: INFO [main] build info | build=1 commit="1e6f655" tid="138559970435072" timestamp=1725474604
Sep 04 14:30:04 FORGE ollama[1546104]: INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="138559970435072" timestamp=1725474604 total_threads=32
Sep 04 14:30:04 FORGE ollama[1546104]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="36959" tid="138559970435072" timestamp=1725474604
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: loaded meta data with 31 key-value pairs and 435 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 (version GGUF V3 (latest))
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv   1:                               general.type str              = model
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv   2:                               general.name str              = Yi Coder 9B Chat
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv   3:                           general.finetune str              = Chat
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv   4:                           general.basename str              = Yi-Coder
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv   5:                         general.size_label str              = 9B
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv   6:                            general.license str              = apache-2.0
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv   7:                          llama.block_count u32              = 48
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv   8:                       llama.context_length u32              = 131072
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv   9:                     llama.embedding_length u32              = 4096
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv  10:                  llama.feed_forward_length u32              = 11008
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 32
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 4
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv  13:                       llama.rope.freq_base f32              = 10000000.000000
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv  14:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv  15:                          general.file_type u32              = 2
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv  16:                           llama.vocab_size u32              = 64000
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv  17:                 llama.rope.dimension_count u32              = 128
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv  18:            tokenizer.ggml.add_space_prefix bool             = false
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = llama
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = default
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,64000]   = ["<unk>", "<|startoftext|>", "<|endof...
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv  22:                      tokenizer.ggml.scores arr[f32,64000]   = [-1000.000000, -1000.000000, -1000.00...
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,64000]   = [3, 3, 3, 3, 3, 3, 1, 3, 1, 3, 3, 3, ...
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv  24:                tokenizer.ggml.bos_token_id u32              = 1
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 2
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv  26:            tokenizer.ggml.padding_token_id u32              = 0
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv  27:               tokenizer.ggml.add_bos_token bool             = false
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv  28:               tokenizer.ggml.add_eos_token bool             = false
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv  30:               general.quantization_version u32              = 2
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - type  f32:   97 tensors
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - type q4_0:  337 tensors
Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - type q6_K:    1 tensors
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_vocab: special tokens cache size = 12
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_vocab: token to piece cache size = 0.3834 MB
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: format           = GGUF V3 (latest)
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: arch             = llama
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: vocab type       = SPM
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_vocab          = 64000
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_merges         = 0
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: vocab_only       = 0
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_ctx_train      = 131072
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_embd           = 4096
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_layer          = 48
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_head           = 32
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_head_kv        = 4
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_rot            = 128
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_swa            = 0
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_embd_head_k    = 128
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_embd_head_v    = 128
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_gqa            = 8
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_embd_k_gqa     = 512
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_embd_v_gqa     = 512
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_ff             = 11008
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_expert         = 0
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_expert_used    = 0
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: causal attn      = 1
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: pooling type     = 0
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: rope type        = 0
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: rope scaling     = linear
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: freq_base_train  = 10000000.0
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: freq_scale_train = 1
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_ctx_orig_yarn  = 131072
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: rope_finetuned   = unknown
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_conv       = 0
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_inner      = 0
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_state      = 0
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: ssm_dt_rank      = 0
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: model type       = 34B
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: model ftype      = Q4_0
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: model params     = 8.83 B
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: model size       = 4.69 GiB (4.56 BPW)
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: general.name     = Yi Coder 9B Chat
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: BOS token        = 1 '<|startoftext|>'
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: EOS token        = 2 '<|endoftext|>'
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: UNK token        = 0 '<unk>'
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: PAD token        = 0 '<unk>'
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: LF token         = 315 '<0x0A>'
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: EOT token        = 2 '<|endoftext|>'
Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: max token length = 48
Sep 04 14:30:04 FORGE ollama[1461072]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Sep 04 14:30:04 FORGE ollama[1461072]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 04 14:30:04 FORGE ollama[1461072]: ggml_cuda_init: found 1 CUDA devices:
Sep 04 14:30:04 FORGE ollama[1461072]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Sep 04 14:30:04 FORGE ollama[1461072]: CUDA error: CUDA-capable device(s) is/are busy or unavailable
Sep 04 14:30:04 FORGE ollama[1461072]:   current device: 0, in function ggml_backend_cuda_get_device_memory at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:3040
Sep 04 14:30:04 FORGE ollama[1461072]:   cudaMemGetInfo(free, total)
Sep 04 14:30:04 FORGE ollama[1461072]: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:101: CUDA error
Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546105]
Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546106]
Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546107]
Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546108]
Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546109]
Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546110]
Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546111]
Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546112]
Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546113]
Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546114]
Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546115]
Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546116]
Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546117]
Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546118]
Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546119]
Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546120]
Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546121]
Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546122]
Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546123]
Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546124]
Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546125]
Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546126]
Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546127]
Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546128]
Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546129]
Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546130]
Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546131]
Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546132]
Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546133]
Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546134]
Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546135]
Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546136]
Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546137]
Sep 04 14:30:04 FORGE ollama[1546139]: [Thread debugging using libthread_db enabled]
Sep 04 14:30:04 FORGE ollama[1546139]: Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Sep 04 14:30:04 FORGE ollama[1546139]: 0x00007e049daea42f in __GI___wait4 (pid=1546139, stat_loc=0x7fff12aef494, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
Sep 04 14:30:04 FORGE ollama[1461072]: 30        ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
Sep 04 14:30:04 FORGE ollama[1546139]: #0  0x00007e049daea42f in __GI___wait4 (pid=1546139, stat_loc=0x7fff12aef494, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
Sep 04 14:30:04 FORGE ollama[1546139]: 30        in ../sysdeps/unix/sysv/linux/wait4.c
Sep 04 14:30:04 FORGE ollama[1546139]: #1  0x00007e049e23ca88 in ggml_abort () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so
Sep 04 14:30:04 FORGE ollama[1546139]: #2  0x00007e049e309d72 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so
Sep 04 14:30:04 FORGE ollama[1546139]: #3  0x00007e049e316c52 in ggml_backend_cuda_get_device_memory () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so
Sep 04 14:30:04 FORGE ollama[1546139]: #4  0x00007e0502192469 in llm_load_tensors(llama_model_loader&, llama_model&, int, llama_split_mode, int, float const*, bool, bool (*)(float, void*), void*) () from /tmp/ollama4167472154/runners/cuda_v12/libllama.so
Sep 04 14:30:04 FORGE ollama[1546139]: #5  0x00007e05021d3fe2 in llama_load_model_from_file () from /tmp/ollama4167472154/runners/cuda_v12/libllama.so
Sep 04 14:30:04 FORGE ollama[1546139]: #6  0x00000000004e452b in llama_init_from_gpt_params(gpt_params&) ()
Sep 04 14:30:04 FORGE ollama[1546139]: #7  0x0000000000473710 in llama_server_context::load_model(gpt_params const&) ()
Sep 04 14:30:04 FORGE ollama[1546139]: #8  0x0000000000423058 in main ()
Sep 04 14:30:04 FORGE ollama[1461072]: time=2024-09-04T14:30:04.863-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
Sep 04 14:30:04 FORGE ollama[1546139]: [Inferior 1 (process 1546104) detached]
Sep 04 14:30:05 FORGE ollama[1461072]: time=2024-09-04T14:30:05.314-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
Sep 04 14:30:06 FORGE ollama[1461072]: time=2024-09-04T14:30:06.468-04:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: CUDA error: CUDA-capable device(s) is/are busy or unavailable\n  current device: 0, in function ggml_backend_cuda_get_device_memory at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:3040\n  cudaMemGetInfo(free, total)\n/go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:101: CUDA error"
Sep 04 14:30:06 FORGE ollama[1461072]: [GIN] 2024/09/04 - 14:30:06 | 500 |  1.910298228s |       127.0.0.1 | POST     "/api/chat"
Sep 04 14:30:06 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:06.469-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:30:06 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:06.720-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:30:06 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:06.970-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:30:07 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:07.220-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:30:07 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:07.470-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:30:07 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:07.720-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:30:07 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:07.971-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:30:08 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:08.220-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:30:08 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:08.471-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:30:08 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:08.721-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:30:08 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:08.970-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:30:09 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:09.221-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:30:09 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:09.470-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:30:09 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:09.721-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:30:09 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:09.970-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:30:10 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:10.220-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:30:10 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:10.470-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:30:10 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:10.720-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:30:10 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:10.971-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:30:11 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:11.221-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:30:11 FORGE ollama[1461072]: time=2024-09-04T14:30:11.469-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.000829377 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2
Sep 04 14:30:11 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:11.470-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:30:11 FORGE ollama[1461072]: time=2024-09-04T14:30:11.719-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.250832055 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2
Sep 04 14:30:11 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:11.720-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
Sep 04 14:30:11 FORGE ollama[1461072]: time=2024-09-04T14:30:11.969-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.500791551 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2
Sep 04 14:30:19 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:19.238-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
chris@FORGE:~/bin$ ollama run llama3.1
>>> Send a message (/? for help)


<!-- gh-comment-id:2329725339 --> @iplayfast commented on GitHub (Sep 4, 2024): This might narrow it down: ``` chris@FORGE:~/bin$ ollama run yi-coder Error: llama runner process has terminated: CUDA error chris@FORGE:~/bin$ ollama run yi-coder Error: llama runner process has terminated: CUDA error: CUDA-capable device(s) is/are busy or unavailable current device: 0, in function ggml_backend_cuda_get_device_memory at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:3040 cudaMemGetInfo(free, total) /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:101: CUDA error chris@FORGE:~/bin$ journalctl -u ollama --no-pager --since="2024-09-04 14:29" Sep 04 14:29:49 FORGE ollama[1461072]: [GIN] 2024/09/04 - 14:29:49 | 200 | 34.18µs | 127.0.0.1 | HEAD "/" Sep 04 14:29:49 FORGE ollama[1461072]: [GIN] 2024/09/04 - 14:29:49 | 200 | 8.830032ms | 127.0.0.1 | POST "/api/show" Sep 04 14:29:49 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:49.944-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:29:49 FORGE ollama[1461072]: time=2024-09-04T14:29:49.959-04:00 level=INFO source=sched.go:508 msg="updated VRAM based on existing loaded models" gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 library=cuda total="23.6 GiB" available="17.4 GiB" Sep 04 14:29:49 FORGE ollama[1461072]: time=2024-09-04T14:29:49.959-04:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 parallel=4 available=18716137472 required="6.4 GiB" Sep 04 14:29:49 FORGE ollama[1461072]: time=2024-09-04T14:29:49.959-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[17.4 GiB]" memory.required.full="6.4 GiB" memory.required.partial="6.4 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[6.4 GiB]" memory.weights.total="5.1 GiB" memory.weights.repeating="4.9 GiB" memory.weights.nonrepeating="205.1 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="569.0 MiB" Sep 04 14:29:49 FORGE ollama[1461072]: time=2024-09-04T14:29:49.960-04:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama4167472154/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 49 --parallel 4 --port 40829" Sep 04 14:29:49 FORGE ollama[1461072]: time=2024-09-04T14:29:49.960-04:00 level=INFO source=sched.go:450 msg="loaded runners" count=2 Sep 04 14:29:49 FORGE ollama[1461072]: time=2024-09-04T14:29:49.960-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" Sep 04 14:29:49 FORGE ollama[1461072]: time=2024-09-04T14:29:49.961-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" Sep 04 14:29:49 FORGE ollama[1545718]: INFO [main] build info | build=1 commit="1e6f655" tid="130539399966720" timestamp=1725474589 Sep 04 14:29:49 FORGE ollama[1545718]: INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="130539399966720" timestamp=1725474589 total_threads=32 Sep 04 14:29:49 FORGE ollama[1545718]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="40829" tid="130539399966720" timestamp=1725474589 Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: loaded meta data with 31 key-value pairs and 435 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 (version GGUF V3 (latest)) Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv 0: general.architecture str = llama Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv 1: general.type str = model Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv 2: general.name str = Yi Coder 9B Chat Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv 3: general.finetune str = Chat Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv 4: general.basename str = Yi-Coder Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv 5: general.size_label str = 9B Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv 6: general.license str = apache-2.0 Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv 7: llama.block_count u32 = 48 Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv 8: llama.context_length u32 = 131072 Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv 9: llama.embedding_length u32 = 4096 Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv 10: llama.feed_forward_length u32 = 11008 Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv 11: llama.attention.head_count u32 = 32 Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 4 Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv 13: llama.rope.freq_base f32 = 10000000.000000 Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv 15: general.file_type u32 = 2 Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv 16: llama.vocab_size u32 = 64000 Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv 17: llama.rope.dimension_count u32 = 128 Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv 18: tokenizer.ggml.add_space_prefix bool = false Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv 19: tokenizer.ggml.model str = llama Sep 04 14:29:49 FORGE ollama[1461072]: llama_model_loader: - kv 20: tokenizer.ggml.pre str = default Sep 04 14:29:50 FORGE ollama[1461072]: llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,64000] = ["<unk>", "<|startoftext|>", "<|endof... Sep 04 14:29:50 FORGE ollama[1461072]: llama_model_loader: - kv 22: tokenizer.ggml.scores arr[f32,64000] = [-1000.000000, -1000.000000, -1000.00... Sep 04 14:29:50 FORGE ollama[1461072]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,64000] = [3, 3, 3, 3, 3, 3, 1, 3, 1, 3, 3, 3, ... Sep 04 14:29:50 FORGE ollama[1461072]: llama_model_loader: - kv 24: tokenizer.ggml.bos_token_id u32 = 1 Sep 04 14:29:50 FORGE ollama[1461072]: llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 2 Sep 04 14:29:50 FORGE ollama[1461072]: llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 0 Sep 04 14:29:50 FORGE ollama[1461072]: llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = false Sep 04 14:29:50 FORGE ollama[1461072]: llama_model_loader: - kv 28: tokenizer.ggml.add_eos_token bool = false Sep 04 14:29:50 FORGE ollama[1461072]: llama_model_loader: - kv 29: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... Sep 04 14:29:50 FORGE ollama[1461072]: llama_model_loader: - kv 30: general.quantization_version u32 = 2 Sep 04 14:29:50 FORGE ollama[1461072]: llama_model_loader: - type f32: 97 tensors Sep 04 14:29:50 FORGE ollama[1461072]: llama_model_loader: - type q4_0: 337 tensors Sep 04 14:29:50 FORGE ollama[1461072]: llama_model_loader: - type q6_K: 1 tensors Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_vocab: special tokens cache size = 12 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_vocab: token to piece cache size = 0.3834 MB Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: format = GGUF V3 (latest) Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: arch = llama Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: vocab type = SPM Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_vocab = 64000 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_merges = 0 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: vocab_only = 0 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_ctx_train = 131072 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_embd = 4096 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_layer = 48 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_head = 32 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_head_kv = 4 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_rot = 128 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_swa = 0 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_embd_head_k = 128 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_embd_head_v = 128 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_gqa = 8 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_embd_k_gqa = 512 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_embd_v_gqa = 512 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: f_norm_eps = 0.0e+00 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: f_logit_scale = 0.0e+00 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_ff = 11008 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_expert = 0 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_expert_used = 0 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: causal attn = 1 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: pooling type = 0 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: rope type = 0 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: rope scaling = linear Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: freq_base_train = 10000000.0 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: freq_scale_train = 1 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: n_ctx_orig_yarn = 131072 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: rope_finetuned = unknown Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_conv = 0 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_inner = 0 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_state = 0 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: ssm_dt_rank = 0 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: model type = 34B Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: model ftype = Q4_0 Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: model params = 8.83 B Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: model size = 4.69 GiB (4.56 BPW) Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: general.name = Yi Coder 9B Chat Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: BOS token = 1 '<|startoftext|>' Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: EOS token = 2 '<|endoftext|>' Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: UNK token = 0 '<unk>' Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: PAD token = 0 '<unk>' Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: LF token = 315 '<0x0A>' Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: EOT token = 2 '<|endoftext|>' Sep 04 14:29:50 FORGE ollama[1461072]: llm_load_print_meta: max token length = 48 Sep 04 14:29:50 FORGE ollama[1461072]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Sep 04 14:29:50 FORGE ollama[1461072]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Sep 04 14:29:50 FORGE ollama[1461072]: ggml_cuda_init: found 1 CUDA devices: Sep 04 14:29:50 FORGE ollama[1461072]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Sep 04 14:29:50 FORGE ollama[1461072]: CUDA error: CUDA-capable device(s) is/are busy or unavailable Sep 04 14:29:50 FORGE ollama[1461072]: current device: 0, in function ggml_backend_cuda_get_device_memory at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:3040 Sep 04 14:29:50 FORGE ollama[1461072]: cudaMemGetInfo(free, total) Sep 04 14:29:50 FORGE ollama[1461072]: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:101: CUDA error Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545719] Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545720] Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545721] Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545722] Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545723] Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545724] Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545725] Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545726] Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545727] Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545728] Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545729] Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545730] Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545731] Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545732] Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545733] Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545734] Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545735] Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545736] Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545737] Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545738] Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545739] Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545740] Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545741] Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545742] Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545743] Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545744] Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545745] Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545746] Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545747] Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545748] Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545749] Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545750] Sep 04 14:29:50 FORGE ollama[1545753]: [New LWP 1545751] Sep 04 14:29:50 FORGE ollama[1545753]: [Thread debugging using libthread_db enabled] Sep 04 14:29:50 FORGE ollama[1545753]: Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Sep 04 14:29:50 FORGE ollama[1545753]: 0x000076b92e6ea42f in __GI___wait4 (pid=1545753, stat_loc=0x7ffecb4ca4d4, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30 Sep 04 14:29:50 FORGE ollama[1461072]: 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory. Sep 04 14:29:50 FORGE ollama[1545753]: #0 0x000076b92e6ea42f in __GI___wait4 (pid=1545753, stat_loc=0x7ffecb4ca4d4, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30 Sep 04 14:29:50 FORGE ollama[1545753]: 30 in ../sysdeps/unix/sysv/linux/wait4.c Sep 04 14:29:50 FORGE ollama[1545753]: #1 0x000076b92ee3ca88 in ggml_abort () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so Sep 04 14:29:50 FORGE ollama[1545753]: #2 0x000076b92ef09d72 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so Sep 04 14:29:50 FORGE ollama[1545753]: #3 0x000076b92ef16c52 in ggml_backend_cuda_get_device_memory () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so Sep 04 14:29:50 FORGE ollama[1545753]: #4 0x000076b992d6b469 in llm_load_tensors(llama_model_loader&, llama_model&, int, llama_split_mode, int, float const*, bool, bool (*)(float, void*), void*) () from /tmp/ollama4167472154/runners/cuda_v12/libllama.so Sep 04 14:29:50 FORGE ollama[1545753]: #5 0x000076b992dacfe2 in llama_load_model_from_file () from /tmp/ollama4167472154/runners/cuda_v12/libllama.so Sep 04 14:29:50 FORGE ollama[1545753]: #6 0x00000000004e452b in llama_init_from_gpt_params(gpt_params&) () Sep 04 14:29:50 FORGE ollama[1545753]: #7 0x0000000000473710 in llama_server_context::load_model(gpt_params const&) () Sep 04 14:29:50 FORGE ollama[1545753]: #8 0x0000000000423058 in main () Sep 04 14:29:50 FORGE ollama[1461072]: time=2024-09-04T14:29:50.239-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" Sep 04 14:29:50 FORGE ollama[1545753]: [Inferior 1 (process 1545718) detached] Sep 04 14:29:50 FORGE ollama[1461072]: time=2024-09-04T14:29:50.690-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" Sep 04 14:29:51 FORGE ollama[1461072]: time=2024-09-04T14:29:51.844-04:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: CUDA error" Sep 04 14:29:51 FORGE ollama[1461072]: [GIN] 2024/09/04 - 14:29:51 | 500 | 1.910032889s | 127.0.0.1 | POST "/api/chat" Sep 04 14:29:51 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:51.844-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:29:52 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:52.096-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:29:52 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:52.345-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:29:52 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:52.595-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:29:52 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:52.846-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:29:53 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:53.096-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:29:53 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:53.347-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:29:53 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:53.595-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:29:53 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:53.846-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:29:54 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:54.096-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:29:54 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:54.345-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:29:54 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:54.595-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:29:54 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:54.845-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:29:55 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:55.095-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:29:55 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:55.345-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:29:55 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:55.595-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:29:55 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:55.845-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:29:56 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:56.096-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:29:56 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:56.346-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:29:56 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:56.596-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:29:56 FORGE ollama[1461072]: time=2024-09-04T14:29:56.845-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.001243643 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 Sep 04 14:29:56 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:56.845-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:29:57 FORGE ollama[1461072]: time=2024-09-04T14:29:57.095-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.251050005 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 Sep 04 14:29:57 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:29:57.095-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:29:57 FORGE ollama[1461072]: time=2024-09-04T14:29:57.345-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.501103261 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 Sep 04 14:30:04 FORGE ollama[1461072]: [GIN] 2024/09/04 - 14:30:04 | 200 | 44.846µs | 127.0.0.1 | HEAD "/" Sep 04 14:30:04 FORGE ollama[1461072]: [GIN] 2024/09/04 - 14:30:04 | 200 | 9.045069ms | 127.0.0.1 | POST "/api/show" Sep 04 14:30:04 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:04.574-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:30:04 FORGE ollama[1461072]: time=2024-09-04T14:30:04.599-04:00 level=INFO source=sched.go:508 msg="updated VRAM based on existing loaded models" gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 library=cuda total="23.6 GiB" available="17.4 GiB" Sep 04 14:30:04 FORGE ollama[1461072]: time=2024-09-04T14:30:04.600-04:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 parallel=4 available=18716137472 required="6.4 GiB" Sep 04 14:30:04 FORGE ollama[1461072]: time=2024-09-04T14:30:04.600-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[17.4 GiB]" memory.required.full="6.4 GiB" memory.required.partial="6.4 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[6.4 GiB]" memory.weights.total="5.1 GiB" memory.weights.repeating="4.9 GiB" memory.weights.nonrepeating="205.1 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="569.0 MiB" Sep 04 14:30:04 FORGE ollama[1461072]: time=2024-09-04T14:30:04.601-04:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama4167472154/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 49 --parallel 4 --port 36959" Sep 04 14:30:04 FORGE ollama[1461072]: time=2024-09-04T14:30:04.601-04:00 level=INFO source=sched.go:450 msg="loaded runners" count=2 Sep 04 14:30:04 FORGE ollama[1461072]: time=2024-09-04T14:30:04.601-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" Sep 04 14:30:04 FORGE ollama[1461072]: time=2024-09-04T14:30:04.602-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" Sep 04 14:30:04 FORGE ollama[1546104]: INFO [main] build info | build=1 commit="1e6f655" tid="138559970435072" timestamp=1725474604 Sep 04 14:30:04 FORGE ollama[1546104]: INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="138559970435072" timestamp=1725474604 total_threads=32 Sep 04 14:30:04 FORGE ollama[1546104]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="36959" tid="138559970435072" timestamp=1725474604 Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: loaded meta data with 31 key-value pairs and 435 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 (version GGUF V3 (latest)) Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv 0: general.architecture str = llama Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv 1: general.type str = model Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv 2: general.name str = Yi Coder 9B Chat Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv 3: general.finetune str = Chat Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv 4: general.basename str = Yi-Coder Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv 5: general.size_label str = 9B Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv 6: general.license str = apache-2.0 Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv 7: llama.block_count u32 = 48 Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv 8: llama.context_length u32 = 131072 Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv 9: llama.embedding_length u32 = 4096 Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv 10: llama.feed_forward_length u32 = 11008 Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv 11: llama.attention.head_count u32 = 32 Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 4 Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv 13: llama.rope.freq_base f32 = 10000000.000000 Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv 15: general.file_type u32 = 2 Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv 16: llama.vocab_size u32 = 64000 Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv 17: llama.rope.dimension_count u32 = 128 Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv 18: tokenizer.ggml.add_space_prefix bool = false Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv 19: tokenizer.ggml.model str = llama Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv 20: tokenizer.ggml.pre str = default Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,64000] = ["<unk>", "<|startoftext|>", "<|endof... Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv 22: tokenizer.ggml.scores arr[f32,64000] = [-1000.000000, -1000.000000, -1000.00... Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,64000] = [3, 3, 3, 3, 3, 3, 1, 3, 1, 3, 3, 3, ... Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv 24: tokenizer.ggml.bos_token_id u32 = 1 Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 2 Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 0 Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = false Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv 28: tokenizer.ggml.add_eos_token bool = false Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv 29: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - kv 30: general.quantization_version u32 = 2 Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - type f32: 97 tensors Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - type q4_0: 337 tensors Sep 04 14:30:04 FORGE ollama[1461072]: llama_model_loader: - type q6_K: 1 tensors Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_vocab: special tokens cache size = 12 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_vocab: token to piece cache size = 0.3834 MB Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: format = GGUF V3 (latest) Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: arch = llama Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: vocab type = SPM Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_vocab = 64000 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_merges = 0 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: vocab_only = 0 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_ctx_train = 131072 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_embd = 4096 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_layer = 48 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_head = 32 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_head_kv = 4 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_rot = 128 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_swa = 0 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_embd_head_k = 128 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_embd_head_v = 128 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_gqa = 8 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_embd_k_gqa = 512 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_embd_v_gqa = 512 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: f_norm_eps = 0.0e+00 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: f_logit_scale = 0.0e+00 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_ff = 11008 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_expert = 0 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_expert_used = 0 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: causal attn = 1 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: pooling type = 0 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: rope type = 0 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: rope scaling = linear Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: freq_base_train = 10000000.0 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: freq_scale_train = 1 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: n_ctx_orig_yarn = 131072 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: rope_finetuned = unknown Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_conv = 0 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_inner = 0 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: ssm_d_state = 0 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: ssm_dt_rank = 0 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: model type = 34B Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: model ftype = Q4_0 Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: model params = 8.83 B Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: model size = 4.69 GiB (4.56 BPW) Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: general.name = Yi Coder 9B Chat Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: BOS token = 1 '<|startoftext|>' Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: EOS token = 2 '<|endoftext|>' Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: UNK token = 0 '<unk>' Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: PAD token = 0 '<unk>' Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: LF token = 315 '<0x0A>' Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: EOT token = 2 '<|endoftext|>' Sep 04 14:30:04 FORGE ollama[1461072]: llm_load_print_meta: max token length = 48 Sep 04 14:30:04 FORGE ollama[1461072]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Sep 04 14:30:04 FORGE ollama[1461072]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Sep 04 14:30:04 FORGE ollama[1461072]: ggml_cuda_init: found 1 CUDA devices: Sep 04 14:30:04 FORGE ollama[1461072]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Sep 04 14:30:04 FORGE ollama[1461072]: CUDA error: CUDA-capable device(s) is/are busy or unavailable Sep 04 14:30:04 FORGE ollama[1461072]: current device: 0, in function ggml_backend_cuda_get_device_memory at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:3040 Sep 04 14:30:04 FORGE ollama[1461072]: cudaMemGetInfo(free, total) Sep 04 14:30:04 FORGE ollama[1461072]: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:101: CUDA error Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546105] Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546106] Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546107] Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546108] Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546109] Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546110] Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546111] Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546112] Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546113] Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546114] Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546115] Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546116] Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546117] Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546118] Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546119] Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546120] Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546121] Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546122] Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546123] Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546124] Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546125] Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546126] Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546127] Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546128] Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546129] Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546130] Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546131] Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546132] Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546133] Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546134] Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546135] Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546136] Sep 04 14:30:04 FORGE ollama[1546139]: [New LWP 1546137] Sep 04 14:30:04 FORGE ollama[1546139]: [Thread debugging using libthread_db enabled] Sep 04 14:30:04 FORGE ollama[1546139]: Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Sep 04 14:30:04 FORGE ollama[1546139]: 0x00007e049daea42f in __GI___wait4 (pid=1546139, stat_loc=0x7fff12aef494, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30 Sep 04 14:30:04 FORGE ollama[1461072]: 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory. Sep 04 14:30:04 FORGE ollama[1546139]: #0 0x00007e049daea42f in __GI___wait4 (pid=1546139, stat_loc=0x7fff12aef494, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30 Sep 04 14:30:04 FORGE ollama[1546139]: 30 in ../sysdeps/unix/sysv/linux/wait4.c Sep 04 14:30:04 FORGE ollama[1546139]: #1 0x00007e049e23ca88 in ggml_abort () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so Sep 04 14:30:04 FORGE ollama[1546139]: #2 0x00007e049e309d72 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so Sep 04 14:30:04 FORGE ollama[1546139]: #3 0x00007e049e316c52 in ggml_backend_cuda_get_device_memory () from /tmp/ollama4167472154/runners/cuda_v12/libggml.so Sep 04 14:30:04 FORGE ollama[1546139]: #4 0x00007e0502192469 in llm_load_tensors(llama_model_loader&, llama_model&, int, llama_split_mode, int, float const*, bool, bool (*)(float, void*), void*) () from /tmp/ollama4167472154/runners/cuda_v12/libllama.so Sep 04 14:30:04 FORGE ollama[1546139]: #5 0x00007e05021d3fe2 in llama_load_model_from_file () from /tmp/ollama4167472154/runners/cuda_v12/libllama.so Sep 04 14:30:04 FORGE ollama[1546139]: #6 0x00000000004e452b in llama_init_from_gpt_params(gpt_params&) () Sep 04 14:30:04 FORGE ollama[1546139]: #7 0x0000000000473710 in llama_server_context::load_model(gpt_params const&) () Sep 04 14:30:04 FORGE ollama[1546139]: #8 0x0000000000423058 in main () Sep 04 14:30:04 FORGE ollama[1461072]: time=2024-09-04T14:30:04.863-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" Sep 04 14:30:04 FORGE ollama[1546139]: [Inferior 1 (process 1546104) detached] Sep 04 14:30:05 FORGE ollama[1461072]: time=2024-09-04T14:30:05.314-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" Sep 04 14:30:06 FORGE ollama[1461072]: time=2024-09-04T14:30:06.468-04:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: CUDA error: CUDA-capable device(s) is/are busy or unavailable\n current device: 0, in function ggml_backend_cuda_get_device_memory at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:3040\n cudaMemGetInfo(free, total)\n/go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:101: CUDA error" Sep 04 14:30:06 FORGE ollama[1461072]: [GIN] 2024/09/04 - 14:30:06 | 500 | 1.910298228s | 127.0.0.1 | POST "/api/chat" Sep 04 14:30:06 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:06.469-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:30:06 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:06.720-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:30:06 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:06.970-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:30:07 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:07.220-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:30:07 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:07.470-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:30:07 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:07.720-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:30:07 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:07.971-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:30:08 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:08.220-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:30:08 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:08.471-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:30:08 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:08.721-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:30:08 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:08.970-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:30:09 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:09.221-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:30:09 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:09.470-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:30:09 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:09.721-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:30:09 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:09.970-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:30:10 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:10.220-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:30:10 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:10.470-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:30:10 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:10.720-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:30:10 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:10.971-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:30:11 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:11.221-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:30:11 FORGE ollama[1461072]: time=2024-09-04T14:30:11.469-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.000829377 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 Sep 04 14:30:11 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:11.470-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:30:11 FORGE ollama[1461072]: time=2024-09-04T14:30:11.719-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.250832055 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 Sep 04 14:30:11 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:11.720-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" Sep 04 14:30:11 FORGE ollama[1461072]: time=2024-09-04T14:30:11.969-04:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.500791551 model=/usr/share/ollama/.ollama/models/blobs/sha256-8169bd33ad1351c56755330bd6f1cf5696de6ac297420024fa8aebae5656b0e2 Sep 04 14:30:19 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T14:30:19.238-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" chris@FORGE:~/bin$ ollama run llama3.1 >>> Send a message (/? for help)
Author
Owner

@dhiltgen commented on GitHub (Sep 5, 2024):

It looks like things start to go bad around here

Sep 04 13:18:01 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:01.919-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"

We don't handle this failure case well, and we fail to update the free VRAM value, and think there's room to load a second model, and that load fails.

Looking in the cuda headers, the response code 46 maps to this:

    /**
     * This indicates that all CUDA devices are busy or unavailable at the current
     * time. Devices are often busy/unavailable due to use of
     * ::cudaComputeModeExclusive, ::cudaComputeModeProhibited or when long
     * running CUDA kernels have filled up the GPU and are blocking new work
     * from starting. They can also be unavailable due to memory constraints
     * on a device that already has active CUDA work being performed.
     */
    cudaErrorDevicesUnavailable           =     46,

Do you have other apps running that might be locking the GPU, or is just Ollama running on the GPU at the time?

I'm curious if there are any other interesting log messages in the kernel logs or dmesg around that time? (is there some other fault leading to this failure mode, or simply "normal" inference and some timing/race that leads to the unavailable error)

<!-- gh-comment-id:2332193782 --> @dhiltgen commented on GitHub (Sep 5, 2024): It looks like things start to go bad around here ``` Sep 04 13:18:01 FORGE ollama[1461072]: cuda driver library failed to get device context 46time=2024-09-04T13:18:01.919-04:00 level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" ``` We don't handle this failure case well, and we fail to update the free VRAM value, and think there's room to load a second model, and that load fails. Looking in the cuda headers, the response code 46 maps to this: ``` /** * This indicates that all CUDA devices are busy or unavailable at the current * time. Devices are often busy/unavailable due to use of * ::cudaComputeModeExclusive, ::cudaComputeModeProhibited or when long * running CUDA kernels have filled up the GPU and are blocking new work * from starting. They can also be unavailable due to memory constraints * on a device that already has active CUDA work being performed. */ cudaErrorDevicesUnavailable = 46, ``` Do you have other apps running that might be locking the GPU, or is just Ollama running on the GPU at the time? I'm curious if there are any other interesting log messages in the kernel logs or dmesg around that time? (is there some other fault leading to this failure mode, or simply "normal" inference and some timing/race that leads to the unavailable error)
Author
Owner

@iplayfast commented on GitHub (Sep 8, 2024):

It's seems everything is using the gpu including thunderbird and chrome and cinnamon. In my second post I showed the nvidia-smi output. Looks to me like there was plenty of room, and the two models in question should have fit.

chris@FORGE:~/github/FreeCAD/build$ ollama list | grep llama3.1
llama3.1:latest                            	f66fc8dc39ea	4.7 GB	2 days ago  	
chris@FORGE:~/github/FreeCAD/build$ ollama list | grep yi-coder
yi-coder:latest                            	0eed9e7baf59	5.0 GB	2 days ago  	
chris@FORGE:~/github/FreeCAD/build$ 
<!-- gh-comment-id:2336507519 --> @iplayfast commented on GitHub (Sep 8, 2024): It's seems everything is using the gpu including thunderbird and chrome and cinnamon. In my second post I showed the nvidia-smi output. Looks to me like there was plenty of room, and the two models in question should have fit. ``` chris@FORGE:~/github/FreeCAD/build$ ollama list | grep llama3.1 llama3.1:latest f66fc8dc39ea 4.7 GB 2 days ago chris@FORGE:~/github/FreeCAD/build$ ollama list | grep yi-coder yi-coder:latest 0eed9e7baf59 5.0 GB 2 days ago chris@FORGE:~/github/FreeCAD/build$ ```
Author
Owner

@dhiltgen commented on GitHub (Sep 9, 2024):

It sounds like we just need to harden for this error case and retry until the device is no longer busy.

<!-- gh-comment-id:2338674435 --> @dhiltgen commented on GitHub (Sep 9, 2024): It sounds like we just need to harden for this error case and retry until the device is no longer busy.
Author
Owner

@iplayfast commented on GitHub (Sep 24, 2024):

3.11 still has the same error. Tried with reader-lm and llama3.1 both couldn't be loaded at the same time.

<!-- gh-comment-id:2371952228 --> @iplayfast commented on GitHub (Sep 24, 2024): 3.11 still has the same error. Tried with reader-lm and llama3.1 both couldn't be loaded at the same time.
Author
Owner

@iplayfast commented on GitHub (Oct 3, 2024):

chris@FORGE:~$ ollama ps
NAME           ID              SIZE      PROCESSOR    UNTIL              
llama3.2:1b    baf6a787fdff    2.7 GB    100% GPU     4 minutes from now    
chris@FORGE:~$ ollama run llama3.2:latest
Error: llama runner process has terminated: CUDA error
chris@FORGE:~$ ollama --version
ollama version is 0.3.12
chris@FORGE:~$ 
 nvidia-smi
Thu Oct  3 02:41:59 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:01:00.0  On |                  Off |
|  0%   43C    P8              15W / 450W |   4688MiB / 24564MiB |      0%   E. Process |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      3057      G   /usr/lib/xorg/Xorg                         1215MiB |
|    0   N/A  N/A      3583      G   cinnamon                                    119MiB |
|    0   N/A  N/A      8267      G   ...seed-version=20240920-130106.786000      531MiB |
|    0   N/A  N/A    567562      G   ...yOnDemand --variations-seed-version      105MiB |
|    0   N/A  N/A    994587      G   ...yOnDemand --variations-seed-version       74MiB |
|    0   N/A  N/A   2174681      G   ...ures=SpareRendererForSitePerProcess       56MiB |
|    0   N/A  N/A   2557574      G   qtcreator                                     7MiB |
|    0   N/A  N/A   2620277      G   qtcreator                                     7MiB |
|    0   N/A  N/A   3257412      C   ...unners/cuda_v12/ollama_llama_server     2536MiB |
+---------------------------------------------------------------------------------------+
<!-- gh-comment-id:2390642097 --> @iplayfast commented on GitHub (Oct 3, 2024): ``` chris@FORGE:~$ ollama ps NAME ID SIZE PROCESSOR UNTIL llama3.2:1b baf6a787fdff 2.7 GB 100% GPU 4 minutes from now chris@FORGE:~$ ollama run llama3.2:latest Error: llama runner process has terminated: CUDA error chris@FORGE:~$ ollama --version ollama version is 0.3.12 chris@FORGE:~$ ``` ``` nvidia-smi Thu Oct 3 02:41:59 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 4090 On | 00000000:01:00.0 On | Off | | 0% 43C P8 15W / 450W | 4688MiB / 24564MiB | 0% E. Process | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 3057 G /usr/lib/xorg/Xorg 1215MiB | | 0 N/A N/A 3583 G cinnamon 119MiB | | 0 N/A N/A 8267 G ...seed-version=20240920-130106.786000 531MiB | | 0 N/A N/A 567562 G ...yOnDemand --variations-seed-version 105MiB | | 0 N/A N/A 994587 G ...yOnDemand --variations-seed-version 74MiB | | 0 N/A N/A 2174681 G ...ures=SpareRendererForSitePerProcess 56MiB | | 0 N/A N/A 2557574 G qtcreator 7MiB | | 0 N/A N/A 2620277 G qtcreator 7MiB | | 0 N/A N/A 3257412 C ...unners/cuda_v12/ollama_llama_server 2536MiB | +---------------------------------------------------------------------------------------+ ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#4178