[GH-ISSUE #1618] WSL: Error: timed out waiting for llama runner to start #899

Closed
opened 2026-04-12 10:34:16 -05:00 by GiteaMirror · 10 comments
Owner

Originally created by @otavio-silva on GitHub (Dec 19, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1618

Originally assigned to: @dhiltgen on GitHub.

Description

When trying to run the dolphin-mixtral model in a container, I get a Error: timed out waiting for llama runner to start response.

Steps to reproduce

> podman run --device nvidia.com/gpu=all --security-opt label=disable --detach --volume .ollama:/root/.ollama --net host --name ollama ollama/ollama
> podman exec -it ollama ollama run dolphin-mixtral

Logs

ollama.log

Device info

Nome do host:                              GE76RAIDER
Nome do sistema operacional:               Microsoft Windows 11 Pro
Versão do sistema operacional:             10.0.22631 N/A compilação 22631
Fabricante do sistema operacional:         Microsoft Corporation
Configuração do SO:                        Estação de trabalho autônoma
Tipo de compilação do sistema operacional: Multiprocessor Free
Proprietário registrado:                   otavioasilva@hotmail.com
Organização registrada:                    N/A
Identificação do produto:                  00330-80000-00000-AA520
Data da instalação original:               02/08/2023, 14:30:14
Tempo de Inicialização do Sistema:         16/12/2023, 22:35:35
Fabricante do sistema:                     Micro-Star International Co., Ltd.
Modelo do sistema:                         Raider GE76 12UHS
Tipo de sistema:                           x64-based PC
Processador(es):                           1 processador(es) instalado(s).
                                           [01]: Intel64 Family 6 Model 154 Stepping 3 GenuineIntel ~2900 Mhz
Versão do BIOS:                            American Megatrends International, LLC. E17K4IMS.20D, 26/06/2023
Pasta do Windows:                          C:\WINDOWS
Pasta do sistema:                          C:\WINDOWS\system32
Inicializar dispositivo:                   \Device\HarddiskVolume1
Localidade do sistema:                     pt-br;Português (Brasil)
Localidade de entrada:                     en-us;Inglês (Estados Unidos)
Fuso horário:                              (UTC-03:00) Brasília
Memória física total:                      65.305 MB
Memória física disponível:                 46.483 MB
Memória Virtual: Tamanho Máximo:           75.033 MB
Memória Virtual: Disponível:               49.770 MB
Memória Virtual: Em Uso:                   25.263 MB
Local(is) de arquivo de paginação:         C:\pagefile.sys
Domínio:                                   WORKGROUP
Servidor de Logon:                         \\GE76RAIDER
Hotfix(es):                                4 hotfix(es) instalado(s).
                                           [01]: KB5032007
                                           [02]: KB5027397
                                           [03]: KB5033375
                                           [04]: KB5032393
Placa(s) de Rede:                          3 NIC(s) instalado(s).
                                           [01]: Killer E3100G 2.5 Gigabit Ethernet Controller
                                                 Nome da conexão: Ethernet
                                                 Status:          Mídia desconectada
                                           [02]: Killer(R) Wi-Fi 6E AX1675i 160MHz Wireless Network Adapter (211NGW)
                                                 Nome da conexão: Wi-Fi
                                                 DHCP ativado:    Sim
                                                 Servidor DHCP:   192.168.1.1
                                                 Endereço(es) IP
                                                 [01]: 192.168.1.27
                                           [03]: TAP-Windows Adapter V9
                                                 Nome da conexão: TAP-Windows
                                                 Status:          Mídia desconectada
Requisitos do Hyper-V:                     Hipervisor detectado. Recursos necessários para o Hyper-V não serão exibidos.

Originally created by @otavio-silva on GitHub (Dec 19, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1618 Originally assigned to: @dhiltgen on GitHub. # Description When trying to run the [dolphin-mixtral](https://ollama.ai/library/dolphin-mixtral) model in a container, I get a `Error: timed out waiting for llama runner to start` response. # Steps to reproduce ```cmd > podman run --device nvidia.com/gpu=all --security-opt label=disable --detach --volume .ollama:/root/.ollama --net host --name ollama ollama/ollama > podman exec -it ollama ollama run dolphin-mixtral ``` # Logs [ollama.log](https://github.com/jmorganca/ollama/files/13720833/ollama.log) # Device info ```cmd Nome do host: GE76RAIDER Nome do sistema operacional: Microsoft Windows 11 Pro Versão do sistema operacional: 10.0.22631 N/A compilação 22631 Fabricante do sistema operacional: Microsoft Corporation Configuração do SO: Estação de trabalho autônoma Tipo de compilação do sistema operacional: Multiprocessor Free Proprietário registrado: otavioasilva@hotmail.com Organização registrada: N/A Identificação do produto: 00330-80000-00000-AA520 Data da instalação original: 02/08/2023, 14:30:14 Tempo de Inicialização do Sistema: 16/12/2023, 22:35:35 Fabricante do sistema: Micro-Star International Co., Ltd. Modelo do sistema: Raider GE76 12UHS Tipo de sistema: x64-based PC Processador(es): 1 processador(es) instalado(s). [01]: Intel64 Family 6 Model 154 Stepping 3 GenuineIntel ~2900 Mhz Versão do BIOS: American Megatrends International, LLC. E17K4IMS.20D, 26/06/2023 Pasta do Windows: C:\WINDOWS Pasta do sistema: C:\WINDOWS\system32 Inicializar dispositivo: \Device\HarddiskVolume1 Localidade do sistema: pt-br;Português (Brasil) Localidade de entrada: en-us;Inglês (Estados Unidos) Fuso horário: (UTC-03:00) Brasília Memória física total: 65.305 MB Memória física disponível: 46.483 MB Memória Virtual: Tamanho Máximo: 75.033 MB Memória Virtual: Disponível: 49.770 MB Memória Virtual: Em Uso: 25.263 MB Local(is) de arquivo de paginação: C:\pagefile.sys Domínio: WORKGROUP Servidor de Logon: \\GE76RAIDER Hotfix(es): 4 hotfix(es) instalado(s). [01]: KB5032007 [02]: KB5027397 [03]: KB5033375 [04]: KB5032393 Placa(s) de Rede: 3 NIC(s) instalado(s). [01]: Killer E3100G 2.5 Gigabit Ethernet Controller Nome da conexão: Ethernet Status: Mídia desconectada [02]: Killer(R) Wi-Fi 6E AX1675i 160MHz Wireless Network Adapter (211NGW) Nome da conexão: Wi-Fi DHCP ativado: Sim Servidor DHCP: 192.168.1.1 Endereço(es) IP [01]: 192.168.1.27 [03]: TAP-Windows Adapter V9 Nome da conexão: TAP-Windows Status: Mídia desconectada Requisitos do Hyper-V: Hipervisor detectado. Recursos necessários para o Hyper-V não serão exibidos. ```
Author
Owner

@vishnupkstrata commented on GitHub (Dec 20, 2023):

I've tried installing today and I'm facing the same issue.

<!-- gh-comment-id:1863914023 --> @vishnupkstrata commented on GitHub (Dec 20, 2023): I've tried installing today and I'm facing the same issue.
Author
Owner

@jexom commented on GitHub (Dec 21, 2023):

Same issue on 64gb ram with an rtx3060. I can see it allocating the ram in the task manager but it akes way too long to load, after which just times out. Running a docker container on windows
Update: Decided to try running ollama on wsl. There it runs fine, loading up pretty fast. The issue seems to be in docker

<!-- gh-comment-id:1865578048 --> @jexom commented on GitHub (Dec 21, 2023): Same issue on 64gb ram with an rtx3060. I can see it allocating the ram in the task manager but it akes way too long to load, after which just times out. Running a docker container on windows Update: Decided to try running ollama on wsl. There it runs fine, loading up pretty fast. The issue seems to be in docker
Author
Owner

@otavio-silva commented on GitHub (Dec 21, 2023):

Can confirm that it runs as expected on WSL2, seems to be a problem with the container image.

<!-- gh-comment-id:1866933854 --> @otavio-silva commented on GitHub (Dec 21, 2023): Can confirm that it runs as expected on WSL2, seems to be a problem with the container image.
Author
Owner

@BruceMacD commented on GitHub (Dec 22, 2023):

Hi @otavio-silva this can happen when the container doesn't have enough resources to load the model before timing out. We have a change merging soon that will not time out like this anymore. In the meantime you can try increasing the resources available to your container if possible.

<!-- gh-comment-id:1867810264 --> @BruceMacD commented on GitHub (Dec 22, 2023): Hi @otavio-silva this can happen when the container doesn't have enough resources to load the model before timing out. We have a change merging soon that will not time out like this anymore. In the meantime you can try increasing the resources available to your container if possible.
Author
Owner

@otavio-silva commented on GitHub (Dec 22, 2023):

Hi @BruceMacD I made a new podman machine with enough memory but the container gives the behavior described by @jexom , it allocates memory very slowly and times out before it can complete the process. Same thing happens with other models that can completely load like llama2, memory allocation is extremely slow.

<!-- gh-comment-id:1868030485 --> @otavio-silva commented on GitHub (Dec 22, 2023): Hi @BruceMacD I made a new podman machine with enough memory but the container gives the behavior described by @jexom , it allocates memory very slowly and times out before it can complete the process. Same thing happens with other models that can completely load like llama2, memory allocation is extremely slow.
Author
Owner

@dhiltgen commented on GitHub (Jan 27, 2024):

@otavio-silva we've revamped the way we load the LLM library since you filed this issue. Can you try out the latest release 0.1.22 and see if that resolves the problem? If not, please share a server log from the new version.

<!-- gh-comment-id:1912892416 --> @dhiltgen commented on GitHub (Jan 27, 2024): @otavio-silva we've revamped the way we load the LLM library since you filed this issue. Can you try out the latest release 0.1.22 and see if that resolves the problem? If not, please share a server log from the new version.
Author
Owner

@otavio-silva commented on GitHub (Jan 27, 2024):

@dhiltgen do you mean the original issue or the slow memory allocation one?
The first one was solved, the memory one persists. Here are the logs:

2024/01/27 01:16:39 images.go:857: INFO total blobs: 31
2024/01/27 01:16:39 images.go:864: INFO total unused blobs removed: 0
2024/01/27 01:16:39 routes.go:950: INFO Listening on [::]:11434 (version 0.1.22)
2024/01/27 01:16:39 payload_common.go:106: INFO Extracting dynamic libraries...
2024/01/27 01:16:41 payload_common.go:145: INFO Dynamic LLM libraries [cpu_avx2 rocm_v6 cpu cuda_v11 rocm_v5 cpu_avx]
2024/01/27 01:16:41 gpu.go:94: INFO Detecting GPU type
2024/01/27 01:16:41 gpu.go:236: INFO Searching for GPU management library libnvidia-ml.so
2024/01/27 01:16:41 gpu.go:282: INFO Discovered GPU libraries: [/usr/lib/wsl/drivers/nvmii.inf_amd64_2a8cae9d0cba5813/libnvidia-ml.so.1]
2024/01/27 01:16:43 gpu.go:99: INFO Nvidia GPU detected
2024/01/27 01:16:43 gpu.go:140: INFO CUDA Compute Capability detected: 8.6
[GIN] 2024/01/27 - 01:17:01 | 200 |        16.3µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/01/27 - 01:17:01 | 200 |   39.852596ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/01/27 - 01:17:01 | 200 |   15.844858ms |       127.0.0.1 | POST     "/api/show"
2024/01/27 01:17:19 gpu.go:140: INFO CUDA Compute Capability detected: 8.6
2024/01/27 01:17:19 gpu.go:140: INFO CUDA Compute Capability detected: 8.6
2024/01/27 01:17:19 cpu_common.go:11: INFO CPU has AVX2
2024/01/27 01:17:19 dyn_ext_server.go:90: INFO Loading Dynamic llm server: /tmp/ollama1751675031/cuda_v11/libext_server.so
2024/01/27 01:17:19 dyn_ext_server.go:145: INFO Initializing llama server
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080 Ti Laptop GPU, compute capability 8.6, VMM: yes
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256:6aa74acf170f8fb8e6ff8dae9bc9ea918d3a14b6ba95d0b0287da31b09a4848c (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = georgesung
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW)
llm_load_print_meta: general.name     = georgesung
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    70.31 MiB
llm_load_tensors:      CUDA0 buffer size =  3577.56 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  CUDA_Host input buffer size   =    12.01 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   156.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     8.00 MiB
llama_new_context_with_model: graph splits (measure): 3
2024/01/27 01:19:28 dyn_ext_server.go:156: INFO Starting llama main loop
[GIN] 2024/01/27 - 01:19:28 | 200 |         2m26s |       127.0.0.1 | POST     "/api/chat"
2024/01/27 01:19:53 dyn_ext_server.go:170: INFO loaded 0 images
[GIN] 2024/01/27 - 01:20:00 | 200 |  6.722048863s |       127.0.0.1 | POST     "/api/chat"

<!-- gh-comment-id:1912896385 --> @otavio-silva commented on GitHub (Jan 27, 2024): @dhiltgen do you mean the original issue or the slow memory allocation one? The first one was solved, the memory one persists. Here are the logs: ```go 2024/01/27 01:16:39 images.go:857: INFO total blobs: 31 2024/01/27 01:16:39 images.go:864: INFO total unused blobs removed: 0 2024/01/27 01:16:39 routes.go:950: INFO Listening on [::]:11434 (version 0.1.22) 2024/01/27 01:16:39 payload_common.go:106: INFO Extracting dynamic libraries... 2024/01/27 01:16:41 payload_common.go:145: INFO Dynamic LLM libraries [cpu_avx2 rocm_v6 cpu cuda_v11 rocm_v5 cpu_avx] 2024/01/27 01:16:41 gpu.go:94: INFO Detecting GPU type 2024/01/27 01:16:41 gpu.go:236: INFO Searching for GPU management library libnvidia-ml.so 2024/01/27 01:16:41 gpu.go:282: INFO Discovered GPU libraries: [/usr/lib/wsl/drivers/nvmii.inf_amd64_2a8cae9d0cba5813/libnvidia-ml.so.1] 2024/01/27 01:16:43 gpu.go:99: INFO Nvidia GPU detected 2024/01/27 01:16:43 gpu.go:140: INFO CUDA Compute Capability detected: 8.6 [GIN] 2024/01/27 - 01:17:01 | 200 | 16.3µs | 127.0.0.1 | HEAD "/" [GIN] 2024/01/27 - 01:17:01 | 200 | 39.852596ms | 127.0.0.1 | POST "/api/show" [GIN] 2024/01/27 - 01:17:01 | 200 | 15.844858ms | 127.0.0.1 | POST "/api/show" 2024/01/27 01:17:19 gpu.go:140: INFO CUDA Compute Capability detected: 8.6 2024/01/27 01:17:19 gpu.go:140: INFO CUDA Compute Capability detected: 8.6 2024/01/27 01:17:19 cpu_common.go:11: INFO CPU has AVX2 2024/01/27 01:17:19 dyn_ext_server.go:90: INFO Loading Dynamic llm server: /tmp/ollama1751675031/cuda_v11/libext_server.so 2024/01/27 01:17:19 dyn_ext_server.go:145: INFO Initializing llama server ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3080 Ti Laptop GPU, compute capability 8.6, VMM: yes llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256:6aa74acf170f8fb8e6ff8dae9bc9ea918d3a14b6ba95d0b0287da31b09a4848c (version GGUF V2) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = georgesung llama_model_loader: - kv 2: llama.context_length u32 = 2048 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: tokenizer.ggml.model str = llama llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 18: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V2 llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 2048 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 32 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 4096 llm_load_print_meta: n_embd_v_gqa = 4096 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 11008 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 2048 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 6.74 B llm_load_print_meta: model size = 3.56 GiB (4.54 BPW) llm_load_print_meta: general.name = georgesung llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: PAD token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.22 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 70.31 MiB llm_load_tensors: CUDA0 buffer size = 3577.56 MiB .................................................................................................. llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 1024.00 MiB llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama_new_context_with_model: CUDA_Host input buffer size = 12.01 MiB llama_new_context_with_model: CUDA0 compute buffer size = 156.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 8.00 MiB llama_new_context_with_model: graph splits (measure): 3 2024/01/27 01:19:28 dyn_ext_server.go:156: INFO Starting llama main loop [GIN] 2024/01/27 - 01:19:28 | 200 | 2m26s | 127.0.0.1 | POST "/api/chat" 2024/01/27 01:19:53 dyn_ext_server.go:170: INFO loaded 0 images [GIN] 2024/01/27 - 01:20:00 | 200 | 6.722048863s | 127.0.0.1 | POST "/api/chat" ```
Author
Owner

@dhiltgen commented on GitHub (Jan 27, 2024):

To clarify, it sounds like you now have a functional system from the fixes that have gone in since you initially filed the issue, but the initial memory allocation is slower than you expected. Once loaded, does the TPS rate look reasonable? Are you experiencing any timeouts/errors in the client as a result?

<!-- gh-comment-id:1913188572 --> @dhiltgen commented on GitHub (Jan 27, 2024): To clarify, it sounds like you now have a functional system from the fixes that have gone in since you initially filed the issue, but the initial memory allocation is slower than you expected. Once loaded, does the TPS rate look reasonable? Are you experiencing any timeouts/errors in the client as a result?
Author
Owner

@otavio-silva commented on GitHub (Jan 27, 2024):

@dhiltgen No such timeouts and errors from latest version. Performance is great, except when I'm in a chat that is inactive for more that 5 minutes, then ollama allocates the memory all over again and it takes a while. But no crashes.

<!-- gh-comment-id:1913275860 --> @otavio-silva commented on GitHub (Jan 27, 2024): @dhiltgen No such timeouts and errors from latest version. Performance is great, except when I'm in a chat that is inactive for more that 5 minutes, then ollama allocates the memory all over again and it takes a while. But no crashes.
Author
Owner

@dhiltgen commented on GitHub (Jan 27, 2024):

That's great to hear!

We're working on some improvements to make the inactivity timeout configurable, which should make its way into a release pretty soon. I think we can consider this issue resolved now.

<!-- gh-comment-id:1913305453 --> @dhiltgen commented on GitHub (Jan 27, 2024): That's great to hear! We're working on some improvements to make the inactivity timeout configurable, which should make its way into a release pretty soon. I think we can consider this issue resolved now.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#899