[GH-ISSUE #11261] the model load success and then it's gone #33182

Closed
opened 2026-04-22 15:36:57 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @fishfl on GitHub (Jul 2, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11261

What is the issue?

is ollama has a lazy mode?
We load the llava model use the API, it's success.
we use the ‘ollama ps’ command and can see the model process.
But After a few minutes the model process it's gone.

Image

Relevant log output

time=2025-07-02T02:37:03.946Z level=INFO source=routes.go:1235 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-07-02T02:37:03.947Z level=INFO source=images.go:480 msg="total blobs: 0"
time=2025-07-02T02:37:03.947Z level=INFO source=images.go:487 msg="total unused blobs removed: 0"
time=2025-07-02T02:37:03.947Z level=INFO source=routes.go:1288 msg="Listening on [::]:11434 (version 0.9.2)"
time=2025-07-02T02:37:03.947Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-07-02T02:37:03.956Z level=INFO source=gpu.go:377 msg="no compatible GPUs were discovered"
time=2025-07-02T02:37:03.956Z level=INFO source=types.go:130 msg="inference compute" id=0 library=cpu variant="" compute="" driver=0.0 name="" total="313.9 GiB" available="149.7 GiB"
time=2025-07-02T02:40:18.993Z level=INFO source=server.go:135 msg="system memory" total="313.9 GiB" free="147.6 GiB" free_swap="0 B"
time=2025-07-02T02:40:18.995Z level=INFO source=server.go:168 msg=offload library=cpu layers.requested=-1 layers.model=33 layers.offload=0 layers.split="" memory.available="[147.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.0 GiB" memory.required.partial="0 B" memory.required.kv="1.0 GiB" memory.required.allocations="[6.0 GiB]" memory.weights.total="3.8 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="585.0 MiB" projector.weights="595.5 MiB" projector.graph="0 B"
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-170370233dd5c5415250a2ecd5c71586352850729062ccef1496385647293868 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = liuhaotian
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 2
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 3.83 GiB (4.54 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 3
load: token to piece cache size = 0.1637 MB
print_info: arch             = llama
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 7.24 B
print_info: general.name     = liuhaotian
print_info: vocab type       = SPM
print_info: n_vocab          = 32000
print_info: n_merges         = 0
print_info: BOS token        = 1 '<s>'
print_info: EOS token        = 2 '</s>'
print_info: UNK token        = 0 '<unk>'
print_info: PAD token        = 0 '<unk>'
print_info: LF token         = 13 '<0x0A>'
print_info: EOG token        = 2 '</s>'
print_info: max token length = 48
llama_model_load: vocab only - skipping tensors
time=2025-07-02T02:40:19.097Z level=INFO source=server.go:431 msg="starting llama server" cmd="/usr/bin/ollama runner --model /root/.ollama/models/blobs/sha256-170370233dd5c5415250a2ecd5c71586352850729062ccef1496385647293868 --ctx-size 8192 --batch-size 512 --threads 36 --no-mmap --parallel 2 --mmproj /root/.ollama/models/blobs/sha256-72d6f08a42f656d36b356dbe0920675899a99ce21192fd66266fb7d82ed07539 --port 40903"
time=2025-07-02T02:40:19.097Z level=INFO source=sched.go:483 msg="loaded runners" count=1
time=2025-07-02T02:40:19.097Z level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
time=2025-07-02T02:40:19.098Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
time=2025-07-02T02:40:19.116Z level=INFO source=runner.go:815 msg="starting go runner"
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-skylakex.so
time=2025-07-02T02:40:19.126Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)
time=2025-07-02T02:40:19.128Z level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:40903"
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-170370233dd5c5415250a2ecd5c71586352850729062ccef1496385647293868 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = liuhaotian
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 2
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 3.83 GiB (4.54 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 3
load: token to piece cache size = 0.1637 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 4096
print_info: n_layer          = 32
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 14336
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 7B
print_info: model params     = 7.24 B
print_info: general.name     = liuhaotian
print_info: vocab type       = SPM
print_info: n_vocab          = 32000
print_info: n_merges         = 0
print_info: BOS token        = 1 '<s>'
print_info: EOS token        = 2 '</s>'
print_info: UNK token        = 0 '<unk>'
print_info: PAD token        = 0 '<unk>'
print_info: LF token         = 13 '<0x0A>'
print_info: EOG token        = 2 '</s>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors:          CPU model buffer size =  3917.87 MiB
time=2025-07-02T02:40:19.349Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
[GIN] 2025/07/02 - 02:40:19 | 200 |      41.398µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/07/02 - 02:40:19 | 200 |     193.973µs |       127.0.0.1 | GET      "/api/ps"
llama_context: constructing llama_context
llama_context: n_seq_max     = 2
llama_context: n_ctx         = 8192
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 1024
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.28 MiB
llama_kv_cache_unified: kv_size = 8192, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32
llama_kv_cache_unified:        CPU KV buffer size =  1024.00 MiB
llama_kv_cache_unified: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_context:        CPU compute buffer size =   560.01 MiB
llama_context: graph nodes  = 1094
llama_context: graph splits = 1
clip_ctx: CLIP using CPU backend
clip_model_loader: model name:   openai/clip-vit-large-patch14-336
clip_model_loader: description:  image encoder for LLaVA
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    377
clip_model_loader: n_kv:         19

clip_model_loader: tensor[0]: n_dims = 1, name = mm.0.bias, tensor_size=16384, offset=0, shape:[4096, 1, 1, 1], type = f32
clip_model_loader: tensor[1]: n_dims = 2, name = mm.0.weight, tensor_size=8388608, offset=16384, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[2]: n_dims = 1, name = mm.2.bias, tensor_size=16384, offset=8404992, shape:[4096, 1, 1, 1], type = f32

...

load_hparams: projector:          mlp
load_hparams: n_embd:             1024
load_hparams: n_head:             16
load_hparams: n_ff:               4096
load_hparams: n_layer:            23
load_hparams: projection_dim:     768
load_hparams: image_size:         336
load_hparams: patch_size:         14

load_hparams: has_llava_proj:     1
load_hparams: minicpmv_version:   0
load_hparams: proj_scale_factor:  0
load_hparams: n_wa_pattern:       0
load_hparams: ffn_op:             gelu_quick
load_hparams: model size:         595.49 MiB
load_hparams: metadata size:      0.13 MiB
load_tensors: loaded 377 tensors from /root/.ollama/models/blobs/sha256-72d6f08a42f656d36b356dbe0920675899a99ce21192fd66266fb7d82ed07539
alloc_compute_meta:        CPU compute buffer size =    32.88 MiB
time=2025-07-02T02:40:23.611Z level=INFO source=server.go:630 msg="llama runner started in 4.51 seconds"
[GIN] 2025/07/02 - 02:40:23 | 200 |   4.64251133s |   10.147.13.186 | POST     "/api/chat"
[GIN] 2025/07/02 - 02:40:35 | 200 |      42.935µs |       127.0.0.1 | HEAD     "/"

OS

Linux

GPU

No response

CPU

Intel

Ollama version

0.9.2

Originally created by @fishfl on GitHub (Jul 2, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11261 ### What is the issue? is ollama has a lazy mode? We load the llava model use the API, it's success. we use the ‘ollama ps’ command and can see the model process. But After a few minutes the model process it's gone. ![Image](https://github.com/user-attachments/assets/8356b752-5ca6-41d9-9942-abcffbb58956) ### Relevant log output ```shell time=2025-07-02T02:37:03.946Z level=INFO source=routes.go:1235 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2025-07-02T02:37:03.947Z level=INFO source=images.go:480 msg="total blobs: 0" time=2025-07-02T02:37:03.947Z level=INFO source=images.go:487 msg="total unused blobs removed: 0" time=2025-07-02T02:37:03.947Z level=INFO source=routes.go:1288 msg="Listening on [::]:11434 (version 0.9.2)" time=2025-07-02T02:37:03.947Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-07-02T02:37:03.956Z level=INFO source=gpu.go:377 msg="no compatible GPUs were discovered" time=2025-07-02T02:37:03.956Z level=INFO source=types.go:130 msg="inference compute" id=0 library=cpu variant="" compute="" driver=0.0 name="" total="313.9 GiB" available="149.7 GiB" time=2025-07-02T02:40:18.993Z level=INFO source=server.go:135 msg="system memory" total="313.9 GiB" free="147.6 GiB" free_swap="0 B" time=2025-07-02T02:40:18.995Z level=INFO source=server.go:168 msg=offload library=cpu layers.requested=-1 layers.model=33 layers.offload=0 layers.split="" memory.available="[147.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.0 GiB" memory.required.partial="0 B" memory.required.kv="1.0 GiB" memory.required.allocations="[6.0 GiB]" memory.weights.total="3.8 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="585.0 MiB" projector.weights="595.5 MiB" projector.graph="0 B" llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-170370233dd5c5415250a2ecd5c71586352850729062ccef1496385647293868 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = liuhaotian llama_model_loader: - kv 2: llama.context_length u32 = 32768 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_0 print_info: file size = 3.83 GiB (4.54 BPW) load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: special tokens cache size = 3 load: token to piece cache size = 0.1637 MB print_info: arch = llama print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 7.24 B print_info: general.name = liuhaotian print_info: vocab type = SPM print_info: n_vocab = 32000 print_info: n_merges = 0 print_info: BOS token = 1 '<s>' print_info: EOS token = 2 '</s>' print_info: UNK token = 0 '<unk>' print_info: PAD token = 0 '<unk>' print_info: LF token = 13 '<0x0A>' print_info: EOG token = 2 '</s>' print_info: max token length = 48 llama_model_load: vocab only - skipping tensors time=2025-07-02T02:40:19.097Z level=INFO source=server.go:431 msg="starting llama server" cmd="/usr/bin/ollama runner --model /root/.ollama/models/blobs/sha256-170370233dd5c5415250a2ecd5c71586352850729062ccef1496385647293868 --ctx-size 8192 --batch-size 512 --threads 36 --no-mmap --parallel 2 --mmproj /root/.ollama/models/blobs/sha256-72d6f08a42f656d36b356dbe0920675899a99ce21192fd66266fb7d82ed07539 --port 40903" time=2025-07-02T02:40:19.097Z level=INFO source=sched.go:483 msg="loaded runners" count=1 time=2025-07-02T02:40:19.097Z level=INFO source=server.go:591 msg="waiting for llama runner to start responding" time=2025-07-02T02:40:19.098Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" time=2025-07-02T02:40:19.116Z level=INFO source=runner.go:815 msg="starting go runner" load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-skylakex.so time=2025-07-02T02:40:19.126Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc) time=2025-07-02T02:40:19.128Z level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:40903" llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-170370233dd5c5415250a2ecd5c71586352850729062ccef1496385647293868 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = liuhaotian llama_model_loader: - kv 2: llama.context_length u32 = 32768 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_0 print_info: file size = 3.83 GiB (4.54 BPW) load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: special tokens cache size = 3 load: token to piece cache size = 0.1637 MB print_info: arch = llama print_info: vocab_only = 0 print_info: n_ctx_train = 32768 print_info: n_embd = 4096 print_info: n_layer = 32 print_info: n_head = 32 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 4 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 14336 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 0 print_info: rope scaling = linear print_info: freq_base_train = 1000000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 32768 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 7B print_info: model params = 7.24 B print_info: general.name = liuhaotian print_info: vocab type = SPM print_info: n_vocab = 32000 print_info: n_merges = 0 print_info: BOS token = 1 '<s>' print_info: EOS token = 2 '</s>' print_info: UNK token = 0 '<unk>' print_info: PAD token = 0 '<unk>' print_info: LF token = 13 '<0x0A>' print_info: EOG token = 2 '</s>' print_info: max token length = 48 load_tensors: loading model tensors, this can take a while... (mmap = false) load_tensors: CPU model buffer size = 3917.87 MiB time=2025-07-02T02:40:19.349Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" [GIN] 2025/07/02 - 02:40:19 | 200 | 41.398µs | 127.0.0.1 | HEAD "/" [GIN] 2025/07/02 - 02:40:19 | 200 | 193.973µs | 127.0.0.1 | GET "/api/ps" llama_context: constructing llama_context llama_context: n_seq_max = 2 llama_context: n_ctx = 8192 llama_context: n_ctx_per_seq = 4096 llama_context: n_batch = 1024 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 0 llama_context: freq_base = 1000000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized llama_context: CPU output buffer size = 0.28 MiB llama_kv_cache_unified: kv_size = 8192, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32 llama_kv_cache_unified: CPU KV buffer size = 1024.00 MiB llama_kv_cache_unified: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama_context: CPU compute buffer size = 560.01 MiB llama_context: graph nodes = 1094 llama_context: graph splits = 1 clip_ctx: CLIP using CPU backend clip_model_loader: model name: openai/clip-vit-large-patch14-336 clip_model_loader: description: image encoder for LLaVA clip_model_loader: GGUF version: 3 clip_model_loader: alignment: 32 clip_model_loader: n_tensors: 377 clip_model_loader: n_kv: 19 clip_model_loader: tensor[0]: n_dims = 1, name = mm.0.bias, tensor_size=16384, offset=0, shape:[4096, 1, 1, 1], type = f32 clip_model_loader: tensor[1]: n_dims = 2, name = mm.0.weight, tensor_size=8388608, offset=16384, shape:[1024, 4096, 1, 1], type = f16 clip_model_loader: tensor[2]: n_dims = 1, name = mm.2.bias, tensor_size=16384, offset=8404992, shape:[4096, 1, 1, 1], type = f32 ... load_hparams: projector: mlp load_hparams: n_embd: 1024 load_hparams: n_head: 16 load_hparams: n_ff: 4096 load_hparams: n_layer: 23 load_hparams: projection_dim: 768 load_hparams: image_size: 336 load_hparams: patch_size: 14 load_hparams: has_llava_proj: 1 load_hparams: minicpmv_version: 0 load_hparams: proj_scale_factor: 0 load_hparams: n_wa_pattern: 0 load_hparams: ffn_op: gelu_quick load_hparams: model size: 595.49 MiB load_hparams: metadata size: 0.13 MiB load_tensors: loaded 377 tensors from /root/.ollama/models/blobs/sha256-72d6f08a42f656d36b356dbe0920675899a99ce21192fd66266fb7d82ed07539 alloc_compute_meta: CPU compute buffer size = 32.88 MiB time=2025-07-02T02:40:23.611Z level=INFO source=server.go:630 msg="llama runner started in 4.51 seconds" [GIN] 2025/07/02 - 02:40:23 | 200 | 4.64251133s | 10.147.13.186 | POST "/api/chat" [GIN] 2025/07/02 - 02:40:35 | 200 | 42.935µs | 127.0.0.1 | HEAD "/" ``` ### OS Linux ### GPU _No response_ ### CPU Intel ### Ollama version 0.9.2
GiteaMirror added the bug label 2026-04-22 15:36:57 -05:00
Author
Owner
<!-- gh-comment-id:3026181061 --> @rick-github commented on GitHub (Jul 2, 2025): https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-keep-a-model-loaded-in-memory-or-make-it-unload-immediately
Author
Owner

@gee-coder commented on GitHub (Jul 4, 2025):

https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-keep-a-model-loaded-in-memory-or-make-it-unload-immediately

@fishfl The log you provided shows that the survival time you set for the model is 5 minutes, so your model will be destroyed from memory after 5 minutes.

Image

This is not a bug. You may need to read the documentation provided by @rick-github first to understand how to set a model to remain in memory forever

<!-- gh-comment-id:3034362530 --> @gee-coder commented on GitHub (Jul 4, 2025): > https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-keep-a-model-loaded-in-memory-or-make-it-unload-immediately @fishfl The log you provided shows that the survival time you set for the model is 5 minutes, so your model will be destroyed from memory after 5 minutes. ![Image](https://github.com/user-attachments/assets/41e4c042-a66d-478f-a891-aba0302cdd72) This is not a bug. You may need to read the documentation provided by @rick-github first to understand how to set a model to remain in memory forever
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#33182