[GH-ISSUE #6902] No ollama model can recognize the referenced information. #30126

Closed
opened 2026-04-22 09:35:54 -05:00 by GiteaMirror · 9 comments
Owner

Originally created by @goactiongo on GitHub (Sep 21, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6902

What is the issue?

Scene One
By calling a public cloud-based LLM model through an AI Agent, two documents exceeding 2000 words each are uploaded, and the input question is: Analyze the differences between the two documents. In this manner, the model can normally analyze the differences between the two documents.

Scene Two
If the locally deployed LLM model of olalma0.3.3 is called (multiple different models have been tried), with the same documents and the same question, the model indicates that it cannot find the documents for comparison.
If the document content is reduced to around 1000 words, the model can then compare normally.
The model's maxContext and maxResponse have been adjusted from small to large, with no effect.

Scene Three: When uploading a document exceeding 2000 words and asking the ollama local model to summarize the content, the same issue arises. However, if the document is reduced to around 1000 words, the ollama local model can analyze it normally.

Despite trying multiple ollama models and adjusting maxContext and maxResponse from 2000 to 30000, the problem persists.

the log message as followed,

(base) [root@gpu ~]# journalctl -u ollama -r
-- Logs begin at 一 2024-09-02 03:24:01 CST, end at 六 2024-09-21 21:59:04 CST. --
9月 21 21:59:04 gpu ollama[48923]: [GIN] 2024/09/21 - 21:59:04 | 200 | 11.452294657s | 172.16.1.219 | POST "/v1/chat/completions"
9月 21 21:59:01 gpu ollama[48923]: time=2024-09-21T21:59:01.646+08:00 level=INFO source=server.go:623 msg="llama runner started in 6.78 seconds"
9月 21 21:59:01 gpu ollama[48923]: INFO [main] model loaded | tid="140514816995328" timestamp=1726927141
9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: graph splits = 2
9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: graph nodes = 1850
9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: CUDA_Host compute buffer size = 41.01 MiB
9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: CUDA0 compute buffer size = 578.00 MiB
9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: CUDA_Host output buffer size = 3.98 MiB
9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: KV self size = 2944.00 MiB, K (f16): 1472.00 MiB, V (f16): 1472.00 MiB
9月 21 21:59:00 gpu ollama[48923]: llama_kv_cache_init: CUDA0 KV buffer size = 2944.00 MiB
9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: freq_scale = 1
9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: freq_base = 10000.0
9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: flash_attn = 0
9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: n_ubatch = 512
9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: n_batch = 512
9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: n_ctx = 8192
9月 21 21:58:58 gpu ollama[48923]: time=2024-09-21T21:58:58.185+08:00 level=INFO source=server.go:618 msg="waiting for server to become available" status="llm serve
9月 21 21:58:58 gpu ollama[48923]: llm_load_tensors: CUDA0 buffer size = 14898.60 MiB
9月 21 21:58:58 gpu ollama[48923]: llm_load_tensors: CPU buffer size = 922.85 MiB
9月 21 21:58:58 gpu ollama[48923]: llm_load_tensors: offloaded 47/47 layers to GPU
9月 21 21:58:58 gpu ollama[48923]: llm_load_tensors: offloading non-repeating layers to GPU
9月 21 21:58:58 gpu ollama[48923]: llm_load_tensors: offloading 46 repeating layers to GPU
9月 21 21:58:56 gpu ollama[48923]: time=2024-09-21T21:58:56.580+08:00 level=INFO source=server.go:618 msg="waiting for server to become available" status="llm serve
9月 21 21:58:56 gpu ollama[48923]: llm_load_tensors: ggml ctx size = 0.45 MiB
9月 21 21:58:55 gpu ollama[48923]: Device 0: NVIDIA A30, compute capability 8.0, VMM: yes
9月 21 21:58:55 gpu ollama[48923]: ggml_cuda_init: found 1 CUDA devices:
9月 21 21:58:55 gpu ollama[48923]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
9月 21 21:58:55 gpu ollama[48923]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: max token length = 93
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: EOT token = 107 '<end_of_turn>'
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: LF token = 227 '<0x0A>'
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: PAD token = 0 ''
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: UNK token = 3 ''
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: EOS token = 1 ''
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: BOS token = 2 ''
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: general.name = gemma-2-27b-it
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: model size = 14.55 GiB (4.59 BPW)
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: model params = 27.23 B
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: model ftype = Q4_0
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: model type = 27B
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: ssm_dt_rank = 0
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: ssm_d_state = 0
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: ssm_d_inner = 0
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: ssm_d_conv = 0
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: rope_finetuned = unknown
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_ctx_orig_yarn = 8192
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: freq_scale_train = 1
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: freq_base_train = 10000.0
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: rope scaling = linear
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: rope type = 2
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: pooling type = 0
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: causal attn = 1
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_expert_used = 0
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_expert = 0
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_ff = 36864
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: f_logit_scale = 0.0e+00
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: f_norm_eps = 0.0e+00
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_embd_v_gqa = 2048
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_embd_k_gqa = 2048
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_gqa = 2
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_embd_head_v = 128
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_embd_head_k = 128
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_swa = 4096
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_rot = 128
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_head_kv = 16
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_head = 32
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_layer = 46
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_embd = 4608
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_ctx_train = 8192
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: vocab_only = 0
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_merges = 0
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_vocab = 256000
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: vocab type = SPM
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: arch = gemma2
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: format = GGUF V3 (latest)
9月 21 21:58:55 gpu ollama[48923]: llm_load_vocab: token to piece cache size = 1.6014 MB
9月 21 21:58:55 gpu ollama[48923]: llm_load_vocab: special tokens cache size = 108
9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - type q6_K: 1 tensors
9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - type q4_0: 322 tensors
9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - type f32: 185 tensors
9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 28: general.quantization_version u32 = 2
9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 27: tokenizer.ggml.add_space_prefix bool = false
9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 26: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol
9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false
9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = true
9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 0
9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 22: tokenizer.ggml.unknown_token_id u32 = 3
9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 1
9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 2
9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 18: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000
9月 21 21:58:55 gpu ollama[48923]: time=2024-09-21T21:58:55.122+08:00 level=INFO source=server.go:618 msg="waiting for server to become available" status="llm serve
9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,256000] = ["", "", "", "",
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 16: tokenizer.ggml.pre str = default
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 15: tokenizer.ggml.model str = llama
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 14: gemma2.attention.sliding_window u32 = 4096
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 13: gemma2.final_logit_softcapping f32 = 30.000000
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 12: gemma2.attn_logit_softcapping f32 = 50.000000
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 11: general.file_type u32 = 2
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 10: gemma2.attention.value_length u32 = 128
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 9: gemma2.attention.key_length u32 = 128
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 8: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 7: gemma2.attention.head_count_kv u32 = 16
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 6: gemma2.attention.head_count u32 = 32
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 5: gemma2.feed_forward_length u32 = 36864
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 4: gemma2.block_count u32 = 46
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 3: gemma2.embedding_length u32 = 4608
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 2: gemma2.context_length u32 = 8192
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 1: general.name str = gemma-2-27b-it
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 0: general.architecture str = gemma2
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: loaded meta data with 29 key-value pairs and 508 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-d
9月 21 21:58:54 gpu ollama[48923]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="63" port="34781" tid="140514816995328" timestamp=1726927
9月 21 21:58:54 gpu ollama[48923]: INFO [main] system info | n_threads=32 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VB
9月 21 21:58:54 gpu ollama[48923]: INFO [main] build info | build=1 commit="6eeaeba" tid="140514816995328" timestamp=1726927134
9月 21 21:58:54 gpu ollama[48923]: time=2024-09-21T21:58:54.869+08:00 level=INFO source=server.go:618 msg="waiting for server to become available" status="llm serve
9月 21 21:58:54 gpu ollama[48923]: time=2024-09-21T21:58:54.864+08:00 level=INFO source=server.go:584 msg="waiting for llama runner to start responding"
9月 21 21:58:54 gpu ollama[48923]: time=2024-09-21T21:58:54.863+08:00 level=INFO source=sched.go:445 msg="loaded runners" count=1
9月 21 21:58:54 gpu ollama[48923]: time=2024-09-21T21:58:54.863+08:00 level=INFO source=server.go:384 msg="starting llama server" cmd="/tmp/ollama242898797/runners/
9月 21 21:58:54 gpu ollama[48923]: time=2024-09-21T21:58:54.862+08:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=47 laye
9月 21 21:58:54 gpu ollama[48923]: time=2024-09-21T21:58:54.861+08:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading

upgrate to 0.3.11 , the same issue with log as followed

9月 21 22:39:08 gpu ollama[54696]: [GIN] 2024/09/21 - 22:39:08 | 200 | 10.260377093s | 172.16.1.219 | POST "/v1/chat/completions"
9月 21 22:39:05 gpu ollama[54696]: time=2024-09-21T22:39:05.203+08:00 level=INFO source=server.go:626 msg="llama runner started in 5.67 seconds"
9月 21 22:39:05 gpu ollama[54696]: INFO [main] model loaded | tid="140149569400832" timestamp=1726929545
9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: graph splits = 2
9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: graph nodes = 1850
9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: CUDA_Host compute buffer size = 41.01 MiB
9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: CUDA0 compute buffer size = 578.00 MiB
9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: CUDA_Host output buffer size = 3.98 MiB
9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: KV self size = 2944.00 MiB, K (f16): 1472.00 MiB, V (f16): 1472.00 MiB
9月 21 22:39:04 gpu ollama[54696]: llama_kv_cache_init: CUDA0 KV buffer size = 2944.00 MiB
9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: freq_scale = 1
9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: freq_base = 10000.0
9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: flash_attn = 0
9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: n_ubatch = 512
9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: n_batch = 512
9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: n_ctx = 8192
9月 21 22:39:02 gpu ollama[54696]: time=2024-09-21T22:39:02.405+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm serve
9月 21 22:39:02 gpu ollama[54696]: llm_load_tensors: CUDA0 buffer size = 14898.60 MiB
9月 21 22:39:02 gpu ollama[54696]: llm_load_tensors: CPU buffer size = 922.85 MiB
9月 21 22:39:02 gpu ollama[54696]: llm_load_tensors: offloaded 47/47 layers to GPU
9月 21 22:39:02 gpu ollama[54696]: llm_load_tensors: offloading non-repeating layers to GPU
9月 21 22:39:02 gpu ollama[54696]: llm_load_tensors: offloading 46 repeating layers to GPU
9月 21 22:39:01 gpu ollama[54696]: time=2024-09-21T22:39:01.250+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm serve
9月 21 22:39:00 gpu ollama[54696]: llm_load_tensors: ggml ctx size = 0.45 MiB
9月 21 22:39:00 gpu ollama[54696]: Device 0: NVIDIA A30, compute capability 8.0, VMM: yes
9月 21 22:39:00 gpu ollama[54696]: ggml_cuda_init: found 1 CUDA devices:
9月 21 22:39:00 gpu ollama[54696]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
9月 21 22:39:00 gpu ollama[54696]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: max token length = 93
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: EOT token = 107 '<end_of_turn>'
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: LF token = 227 '<0x0A>'
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: PAD token = 0 ''
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: UNK token = 3 ''
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: EOS token = 1 ''
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: BOS token = 2 ''
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: general.name = gemma-2-27b-it
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: model size = 14.55 GiB (4.59 BPW)
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: model params = 27.23 B
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: model ftype = Q4_0
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: model type = 27B
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: ssm_dt_b_c_rms = 0
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: ssm_dt_rank = 0
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: ssm_d_state = 0
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: ssm_d_inner = 0
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: ssm_d_conv = 0
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: rope_finetuned = unknown
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_ctx_orig_yarn = 8192
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: freq_scale_train = 1
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: freq_base_train = 10000.0
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: rope scaling = linear
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: rope type = 2
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: pooling type = 0
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: causal attn = 1
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_expert_used = 0
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_expert = 0
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_ff = 36864
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: f_logit_scale = 0.0e+00
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: f_norm_eps = 0.0e+00
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_embd_v_gqa = 2048
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_embd_k_gqa = 2048
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_gqa = 2
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_embd_head_v = 128
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_embd_head_k = 128
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_swa = 4096
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_rot = 128
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_head_kv = 16
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_head = 32
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_layer = 46
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_embd = 4608
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_ctx_train = 8192
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: vocab_only = 0
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_merges = 0
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_vocab = 256000
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: vocab type = SPM
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: arch = gemma2
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: format = GGUF V3 (latest)
9月 21 22:39:00 gpu ollama[54696]: llm_load_vocab: token to piece cache size = 1.6014 MB
9月 21 22:39:00 gpu ollama[54696]: llm_load_vocab: special tokens cache size = 108
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - type q6_K: 1 tensors
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - type q4_0: 322 tensors
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - type f32: 185 tensors
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 28: general.quantization_version u32 = 2
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 27: tokenizer.ggml.add_space_prefix bool = false
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 26: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = true
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 0
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 22: tokenizer.ggml.unknown_token_id u32 = 3
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 1
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 2
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 18: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,256000] = ["", "", "", "",
9月 21 22:38:59 gpu ollama[54696]: time=2024-09-21T22:38:59.792+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm serve
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 16: tokenizer.ggml.pre str = default
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 15: tokenizer.ggml.model str = llama
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 14: gemma2.attention.sliding_window u32 = 4096
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 13: gemma2.final_logit_softcapping f32 = 30.000000
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 12: gemma2.attn_logit_softcapping f32 = 50.000000
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 11: general.file_type u32 = 2
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 10: gemma2.attention.value_length u32 = 128
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 9: gemma2.attention.key_length u32 = 128
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 8: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 7: gemma2.attention.head_count_kv u32 = 16
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 6: gemma2.attention.head_count u32 = 32
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 5: gemma2.feed_forward_length u32 = 36864
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 4: gemma2.block_count u32 = 46
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 3: gemma2.embedding_length u32 = 4608
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 2: gemma2.context_length u32 = 8192
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 1: general.name str = gemma-2-27b-it
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 0: general.architecture str = gemma2
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: loaded meta data with 29 key-value pairs and 508 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-d
9月 21 22:38:59 gpu ollama[54696]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="63" port="43383" tid="140149569400832" timestamp=1726929
9月 21 22:38:59 gpu ollama[54696]: INFO [main] system info | n_threads=32 n_threads_batch=32 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VB
9月 21 22:38:59 gpu ollama[54696]: INFO [main] build info | build=10 commit="9225b05" tid="140149569400832" timestamp=1726929539
9月 21 22:38:59 gpu ollama[54696]: time=2024-09-21T22:38:59.537+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm serve
9月 21 22:38:59 gpu ollama[54696]: time=2024-09-21T22:38:59.536+08:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
9月 21 22:38:59 gpu ollama[54696]: time=2024-09-21T22:38:59.536+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
9月 21 22:38:59 gpu ollama[54696]: time=2024-09-21T22:38:59.534+08:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2642747161/runners
9月 21 22:38:59 gpu ollama[54696]: time=2024-09-21T22:38:59.516+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=47 laye
9月 21 22:38:59 gpu ollama[54696]: time=2024-09-21T22:38:59.514+08:00 level=INFO source=server.go:103 msg="system memory" total="125.4 GiB" free="111.5 GiB" free_sw
9月 21 22:38:59 gpu ollama[54696]: time=2024-09-21T22:38:59.514+08:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loadin
9月 21 22:38:51 gpu ollama[54696]: time=2024-09-21T22:38:51.422+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e
9月 21 22:38:51 gpu ollama[54696]: time=2024-09-21T22:38:51.422+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2
9月 21 22:38:51 gpu ollama[54696]: time=2024-09-21T22:38:51.422+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd00879
9月 21 22:38:51 gpu ollama[54696]: time=2024-09-21T22:38:51.421+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a0
9月 21 22:38:50 gpu ollama[54696]: time=2024-09-21T22:38:50.097+08:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
9月 21 22:38:50 gpu ollama[54696]: time=2024-09-21T22:38:50.097+08:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda
9月 21 22:38:34 gpu ollama[54696]: time=2024-09-21T22:38:34.553+08:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama2642747161/runn
9月 21 22:38:34 gpu ollama[54696]: time=2024-09-21T22:38:34.552+08:00 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)"
9月 21 22:38:34 gpu ollama[54696]: time=2024-09-21T22:38:34.551+08:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
9月 21 22:38:34 gpu ollama[54696]: time=2024-09-21T22:38:34.549+08:00 level=INFO source=images.go:753 msg="total blobs: 44"
9月 21 22:38:34 gpu ollama[54696]: 2024/09/21 22:38:34 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HS

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.3.3

Originally created by @goactiongo on GitHub (Sep 21, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6902 ### What is the issue? Scene One By calling a public cloud-based LLM model through an AI Agent, two documents exceeding 2000 words each are uploaded, and the input question is: Analyze the differences between the two documents. In this manner, the model can normally analyze the differences between the two documents. Scene Two If the locally deployed LLM model of olalma0.3.3 is called (multiple different models have been tried), with the same documents and the same question, the model indicates that it cannot find the documents for comparison. If the document content is reduced to around 1000 words, the model can then compare normally. The model's maxContext and maxResponse have been adjusted from small to large, with no effect. Scene Three: When uploading a document exceeding 2000 words and asking the ollama local model to summarize the content, the same issue arises. However, if the document is reduced to around 1000 words, the ollama local model can analyze it normally. Despite trying multiple ollama models and adjusting maxContext and maxResponse from 2000 to 30000, the problem persists. the log message as followed, (base) [root@gpu ~]# journalctl -u ollama -r -- Logs begin at 一 2024-09-02 03:24:01 CST, end at 六 2024-09-21 21:59:04 CST. -- 9月 21 21:59:04 gpu ollama[48923]: [GIN] 2024/09/21 - 21:59:04 | 200 | 11.452294657s | 172.16.1.219 | POST "/v1/chat/completions" 9月 21 21:59:01 gpu ollama[48923]: time=2024-09-21T21:59:01.646+08:00 level=INFO source=server.go:623 msg="llama runner started in 6.78 seconds" 9月 21 21:59:01 gpu ollama[48923]: INFO [main] model loaded | tid="140514816995328" timestamp=1726927141 9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: graph splits = 2 9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: graph nodes = 1850 9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: CUDA_Host compute buffer size = 41.01 MiB 9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: CUDA0 compute buffer size = 578.00 MiB 9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: CUDA_Host output buffer size = 3.98 MiB 9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: KV self size = 2944.00 MiB, K (f16): 1472.00 MiB, V (f16): 1472.00 MiB 9月 21 21:59:00 gpu ollama[48923]: llama_kv_cache_init: CUDA0 KV buffer size = 2944.00 MiB 9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: freq_scale = 1 9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: freq_base = 10000.0 9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: flash_attn = 0 9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: n_ubatch = 512 9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: n_batch = 512 9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: n_ctx = 8192 9月 21 21:58:58 gpu ollama[48923]: time=2024-09-21T21:58:58.185+08:00 level=INFO source=server.go:618 msg="waiting for server to become available" status="llm serve 9月 21 21:58:58 gpu ollama[48923]: llm_load_tensors: CUDA0 buffer size = 14898.60 MiB 9月 21 21:58:58 gpu ollama[48923]: llm_load_tensors: CPU buffer size = 922.85 MiB 9月 21 21:58:58 gpu ollama[48923]: llm_load_tensors: offloaded 47/47 layers to GPU 9月 21 21:58:58 gpu ollama[48923]: llm_load_tensors: offloading non-repeating layers to GPU 9月 21 21:58:58 gpu ollama[48923]: llm_load_tensors: offloading 46 repeating layers to GPU 9月 21 21:58:56 gpu ollama[48923]: time=2024-09-21T21:58:56.580+08:00 level=INFO source=server.go:618 msg="waiting for server to become available" status="llm serve 9月 21 21:58:56 gpu ollama[48923]: llm_load_tensors: ggml ctx size = 0.45 MiB 9月 21 21:58:55 gpu ollama[48923]: Device 0: NVIDIA A30, compute capability 8.0, VMM: yes 9月 21 21:58:55 gpu ollama[48923]: ggml_cuda_init: found 1 CUDA devices: 9月 21 21:58:55 gpu ollama[48923]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 9月 21 21:58:55 gpu ollama[48923]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: max token length = 93 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: EOT token = 107 '<end_of_turn>' 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: LF token = 227 '<0x0A>' 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: PAD token = 0 '<pad>' 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: UNK token = 3 '<unk>' 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: EOS token = 1 '<eos>' 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: BOS token = 2 '<bos>' 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: general.name = gemma-2-27b-it 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: model size = 14.55 GiB (4.59 BPW) 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: model params = 27.23 B 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: model ftype = Q4_0 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: model type = 27B 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: ssm_dt_rank = 0 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: ssm_d_state = 0 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: ssm_d_inner = 0 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: ssm_d_conv = 0 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: rope_finetuned = unknown 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_ctx_orig_yarn = 8192 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: freq_scale_train = 1 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: freq_base_train = 10000.0 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: rope scaling = linear 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: rope type = 2 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: pooling type = 0 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: causal attn = 1 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_expert_used = 0 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_expert = 0 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_ff = 36864 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: f_logit_scale = 0.0e+00 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: f_norm_eps = 0.0e+00 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_embd_v_gqa = 2048 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_embd_k_gqa = 2048 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_gqa = 2 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_embd_head_v = 128 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_embd_head_k = 128 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_swa = 4096 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_rot = 128 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_head_kv = 16 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_head = 32 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_layer = 46 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_embd = 4608 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_ctx_train = 8192 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: vocab_only = 0 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_merges = 0 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_vocab = 256000 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: vocab type = SPM 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: arch = gemma2 9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: format = GGUF V3 (latest) 9月 21 21:58:55 gpu ollama[48923]: llm_load_vocab: token to piece cache size = 1.6014 MB 9月 21 21:58:55 gpu ollama[48923]: llm_load_vocab: special tokens cache size = 108 9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - type q6_K: 1 tensors 9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - type q4_0: 322 tensors 9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - type f32: 185 tensors 9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 28: general.quantization_version u32 = 2 9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 27: tokenizer.ggml.add_space_prefix bool = false 9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 26: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol 9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false 9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = true 9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 0 9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 22: tokenizer.ggml.unknown_token_id u32 = 3 9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 1 9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 2 9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 18: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000 9月 21 21:58:55 gpu ollama[48923]: time=2024-09-21T21:58:55.122+08:00 level=INFO source=server.go:618 msg="waiting for server to become available" status="llm serve 9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", 9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 16: tokenizer.ggml.pre str = default 9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 15: tokenizer.ggml.model str = llama 9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 14: gemma2.attention.sliding_window u32 = 4096 9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 13: gemma2.final_logit_softcapping f32 = 30.000000 9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 12: gemma2.attn_logit_softcapping f32 = 50.000000 9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 11: general.file_type u32 = 2 9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 10: gemma2.attention.value_length u32 = 128 9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 9: gemma2.attention.key_length u32 = 128 9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 8: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001 9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 7: gemma2.attention.head_count_kv u32 = 16 9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 6: gemma2.attention.head_count u32 = 32 9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 5: gemma2.feed_forward_length u32 = 36864 9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 4: gemma2.block_count u32 = 46 9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 3: gemma2.embedding_length u32 = 4608 9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 2: gemma2.context_length u32 = 8192 9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 1: general.name str = gemma-2-27b-it 9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 0: general.architecture str = gemma2 9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: loaded meta data with 29 key-value pairs and 508 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-d 9月 21 21:58:54 gpu ollama[48923]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="63" port="34781" tid="140514816995328" timestamp=1726927 9月 21 21:58:54 gpu ollama[48923]: INFO [main] system info | n_threads=32 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VB 9月 21 21:58:54 gpu ollama[48923]: INFO [main] build info | build=1 commit="6eeaeba" tid="140514816995328" timestamp=1726927134 9月 21 21:58:54 gpu ollama[48923]: time=2024-09-21T21:58:54.869+08:00 level=INFO source=server.go:618 msg="waiting for server to become available" status="llm serve 9月 21 21:58:54 gpu ollama[48923]: time=2024-09-21T21:58:54.864+08:00 level=INFO source=server.go:584 msg="waiting for llama runner to start responding" 9月 21 21:58:54 gpu ollama[48923]: time=2024-09-21T21:58:54.863+08:00 level=INFO source=sched.go:445 msg="loaded runners" count=1 9月 21 21:58:54 gpu ollama[48923]: time=2024-09-21T21:58:54.863+08:00 level=INFO source=server.go:384 msg="starting llama server" cmd="/tmp/ollama242898797/runners/ 9月 21 21:58:54 gpu ollama[48923]: time=2024-09-21T21:58:54.862+08:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=47 laye 9月 21 21:58:54 gpu ollama[48923]: time=2024-09-21T21:58:54.861+08:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading ### upgrate to 0.3.11 , the same issue with log as followed 9月 21 22:39:08 gpu ollama[54696]: [GIN] 2024/09/21 - 22:39:08 | 200 | 10.260377093s | 172.16.1.219 | POST "/v1/chat/completions" 9月 21 22:39:05 gpu ollama[54696]: time=2024-09-21T22:39:05.203+08:00 level=INFO source=server.go:626 msg="llama runner started in 5.67 seconds" 9月 21 22:39:05 gpu ollama[54696]: INFO [main] model loaded | tid="140149569400832" timestamp=1726929545 9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: graph splits = 2 9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: graph nodes = 1850 9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: CUDA_Host compute buffer size = 41.01 MiB 9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: CUDA0 compute buffer size = 578.00 MiB 9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: CUDA_Host output buffer size = 3.98 MiB 9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: KV self size = 2944.00 MiB, K (f16): 1472.00 MiB, V (f16): 1472.00 MiB 9月 21 22:39:04 gpu ollama[54696]: llama_kv_cache_init: CUDA0 KV buffer size = 2944.00 MiB 9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: freq_scale = 1 9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: freq_base = 10000.0 9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: flash_attn = 0 9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: n_ubatch = 512 9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: n_batch = 512 9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: n_ctx = 8192 9月 21 22:39:02 gpu ollama[54696]: time=2024-09-21T22:39:02.405+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm serve 9月 21 22:39:02 gpu ollama[54696]: llm_load_tensors: CUDA0 buffer size = 14898.60 MiB 9月 21 22:39:02 gpu ollama[54696]: llm_load_tensors: CPU buffer size = 922.85 MiB 9月 21 22:39:02 gpu ollama[54696]: llm_load_tensors: offloaded 47/47 layers to GPU 9月 21 22:39:02 gpu ollama[54696]: llm_load_tensors: offloading non-repeating layers to GPU 9月 21 22:39:02 gpu ollama[54696]: llm_load_tensors: offloading 46 repeating layers to GPU 9月 21 22:39:01 gpu ollama[54696]: time=2024-09-21T22:39:01.250+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm serve 9月 21 22:39:00 gpu ollama[54696]: llm_load_tensors: ggml ctx size = 0.45 MiB 9月 21 22:39:00 gpu ollama[54696]: Device 0: NVIDIA A30, compute capability 8.0, VMM: yes 9月 21 22:39:00 gpu ollama[54696]: ggml_cuda_init: found 1 CUDA devices: 9月 21 22:39:00 gpu ollama[54696]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 9月 21 22:39:00 gpu ollama[54696]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: max token length = 93 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: EOT token = 107 '<end_of_turn>' 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: LF token = 227 '<0x0A>' 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: PAD token = 0 '<pad>' 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: UNK token = 3 '<unk>' 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: EOS token = 1 '<eos>' 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: BOS token = 2 '<bos>' 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: general.name = gemma-2-27b-it 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: model size = 14.55 GiB (4.59 BPW) 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: model params = 27.23 B 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: model ftype = Q4_0 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: model type = 27B 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: ssm_dt_b_c_rms = 0 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: ssm_dt_rank = 0 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: ssm_d_state = 0 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: ssm_d_inner = 0 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: ssm_d_conv = 0 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: rope_finetuned = unknown 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_ctx_orig_yarn = 8192 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: freq_scale_train = 1 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: freq_base_train = 10000.0 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: rope scaling = linear 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: rope type = 2 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: pooling type = 0 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: causal attn = 1 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_expert_used = 0 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_expert = 0 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_ff = 36864 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: f_logit_scale = 0.0e+00 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: f_norm_eps = 0.0e+00 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_embd_v_gqa = 2048 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_embd_k_gqa = 2048 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_gqa = 2 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_embd_head_v = 128 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_embd_head_k = 128 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_swa = 4096 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_rot = 128 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_head_kv = 16 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_head = 32 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_layer = 46 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_embd = 4608 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_ctx_train = 8192 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: vocab_only = 0 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_merges = 0 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_vocab = 256000 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: vocab type = SPM 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: arch = gemma2 9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: format = GGUF V3 (latest) 9月 21 22:39:00 gpu ollama[54696]: llm_load_vocab: token to piece cache size = 1.6014 MB 9月 21 22:39:00 gpu ollama[54696]: llm_load_vocab: special tokens cache size = 108 9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - type q6_K: 1 tensors 9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - type q4_0: 322 tensors 9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - type f32: 185 tensors 9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 28: general.quantization_version u32 = 2 9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 27: tokenizer.ggml.add_space_prefix bool = false 9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 26: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol 9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false 9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = true 9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 0 9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 22: tokenizer.ggml.unknown_token_id u32 = 3 9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 1 9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 2 9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 18: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000 9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", 9月 21 22:38:59 gpu ollama[54696]: time=2024-09-21T22:38:59.792+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm serve 9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 16: tokenizer.ggml.pre str = default 9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 15: tokenizer.ggml.model str = llama 9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 14: gemma2.attention.sliding_window u32 = 4096 9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 13: gemma2.final_logit_softcapping f32 = 30.000000 9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 12: gemma2.attn_logit_softcapping f32 = 50.000000 9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 11: general.file_type u32 = 2 9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 10: gemma2.attention.value_length u32 = 128 9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 9: gemma2.attention.key_length u32 = 128 9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 8: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001 9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 7: gemma2.attention.head_count_kv u32 = 16 9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 6: gemma2.attention.head_count u32 = 32 9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 5: gemma2.feed_forward_length u32 = 36864 9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 4: gemma2.block_count u32 = 46 9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 3: gemma2.embedding_length u32 = 4608 9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 2: gemma2.context_length u32 = 8192 9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 1: general.name str = gemma-2-27b-it 9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 0: general.architecture str = gemma2 9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: loaded meta data with 29 key-value pairs and 508 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-d 9月 21 22:38:59 gpu ollama[54696]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="63" port="43383" tid="140149569400832" timestamp=1726929 9月 21 22:38:59 gpu ollama[54696]: INFO [main] system info | n_threads=32 n_threads_batch=32 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VB 9月 21 22:38:59 gpu ollama[54696]: INFO [main] build info | build=10 commit="9225b05" tid="140149569400832" timestamp=1726929539 9月 21 22:38:59 gpu ollama[54696]: time=2024-09-21T22:38:59.537+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm serve 9月 21 22:38:59 gpu ollama[54696]: time=2024-09-21T22:38:59.536+08:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding" 9月 21 22:38:59 gpu ollama[54696]: time=2024-09-21T22:38:59.536+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 9月 21 22:38:59 gpu ollama[54696]: time=2024-09-21T22:38:59.534+08:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2642747161/runners 9月 21 22:38:59 gpu ollama[54696]: time=2024-09-21T22:38:59.516+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=47 laye 9月 21 22:38:59 gpu ollama[54696]: time=2024-09-21T22:38:59.514+08:00 level=INFO source=server.go:103 msg="system memory" total="125.4 GiB" free="111.5 GiB" free_sw 9月 21 22:38:59 gpu ollama[54696]: time=2024-09-21T22:38:59.514+08:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loadin 9月 21 22:38:51 gpu ollama[54696]: time=2024-09-21T22:38:51.422+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e 9月 21 22:38:51 gpu ollama[54696]: time=2024-09-21T22:38:51.422+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2 9月 21 22:38:51 gpu ollama[54696]: time=2024-09-21T22:38:51.422+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd00879 9月 21 22:38:51 gpu ollama[54696]: time=2024-09-21T22:38:51.421+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a0 9月 21 22:38:50 gpu ollama[54696]: time=2024-09-21T22:38:50.097+08:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs" 9月 21 22:38:50 gpu ollama[54696]: time=2024-09-21T22:38:50.097+08:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda 9月 21 22:38:34 gpu ollama[54696]: time=2024-09-21T22:38:34.553+08:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama2642747161/runn 9月 21 22:38:34 gpu ollama[54696]: time=2024-09-21T22:38:34.552+08:00 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)" 9月 21 22:38:34 gpu ollama[54696]: time=2024-09-21T22:38:34.551+08:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0" 9月 21 22:38:34 gpu ollama[54696]: time=2024-09-21T22:38:34.549+08:00 level=INFO source=images.go:753 msg="total blobs: 44" 9月 21 22:38:34 gpu ollama[54696]: 2024/09/21 22:38:34 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HS ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.3.3
GiteaMirror added the bug label 2026-04-22 09:35:54 -05:00
Author
Owner

@rick-github commented on GitHub (Sep 21, 2024):

Your log is missing useful information. Run this: journalctl -u ollama --no-pager. How are you adjusting maxContext and maxResponse?

<!-- gh-comment-id:2365366157 --> @rick-github commented on GitHub (Sep 21, 2024): Your log is missing useful information. Run this: `journalctl -u ollama --no-pager`. How are you adjusting maxContext and maxResponse?
Author
Owner

@goactiongo commented on GitHub (Sep 22, 2024):

Here , I just use gemma-2-27b as a sample, I tried many moduel and the issue is the same .

— The 27B model was trained with 13 trillion tokens and the 9B model was trained with 8 trillion tokens.

 "model": "gemma2:27b",
  "name": "localNet-ollama-gemma2:27b",
  "avatar": "/imgs/model/openai.svg",
  "maxContext": 8000,
  "maxResponse": 8000,
  "quoteMaxToken":5000,

or change to followed, the same issue

  "model": "gemma2:27b",
  "name": "localNet-ollama-gemma2:27b",
  "avatar": "/imgs/model/openai.svg",
  "maxContext": 120000,
  "maxResponse": 10000,
  "quoteMaxToken":5000,

journalctl -u ollama --no-pager

9月 22 13:35:58 gpu systemd[1]: Stopping Ollama Service...
9月 22 13:36:01 gpu systemd[1]: Stopped Ollama Service.
9月 22 13:36:01 gpu systemd[1]: Started Ollama Service.
9月 22 13:36:01 gpu ollama[50713]: 2024/09/22 13:36:01 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
9月 22 13:36:01 gpu ollama[50713]: time=2024-09-22T13:36:01.944+08:00 level=INFO source=images.go:753 msg="total blobs: 34"
9月 22 13:36:01 gpu ollama[50713]: time=2024-09-22T13:36:01.946+08:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
9月 22 13:36:01 gpu ollama[50713]: time=2024-09-22T13:36:01.947+08:00 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)"
9月 22 13:36:01 gpu ollama[50713]: time=2024-09-22T13:36:01.950+08:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama2397062001/runners
9月 22 13:36:17 gpu ollama[50713]: time=2024-09-22T13:36:17.632+08:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm_v60102]"
9月 22 13:36:17 gpu ollama[50713]: time=2024-09-22T13:36:17.632+08:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
9月 22 13:36:19 gpu ollama[50713]: time=2024-09-22T13:36:19.003+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="2.8 GiB"
9月 22 13:36:19 gpu ollama[50713]: time=2024-09-22T13:36:19.003+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="19.7 GiB"
9月 22 13:36:19 gpu ollama[50713]: time=2024-09-22T13:36:19.003+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="14.8 GiB"
9月 22 13:36:19 gpu ollama[50713]: time=2024-09-22T13:36:19.003+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="6.4 GiB"
9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.572+08:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc gpu=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 parallel=4 available=21122187264 required="18.8 GiB"
9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.572+08:00 level=INFO source=server.go:103 msg="system memory" total="125.4 GiB" free="100.7 GiB" free_swap="3.5 GiB"
9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.573+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=47 layers.offload=47 layers.split="" memory.available="[19.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="18.8 GiB" memory.required.partial="18.8 GiB" memory.required.kv="2.9 GiB" memory.required.allocations="[18.8 GiB]" memory.weights.total="16.5 GiB" memory.weights.repeating="15.6 GiB" memory.weights.nonrepeating="922.9 MiB" memory.graph.full="562.0 MiB" memory.graph.partial="1.4 GiB"
9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.592+08:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2397062001/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 47 --parallel 4 --port 38200"
9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.592+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.592+08:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.593+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
9月 22 13:36:33 gpu ollama[50713]: INFO [main] build info | build=10 commit="9225b05" tid="140300405710848" timestamp=1726983393
9月 22 13:36:33 gpu ollama[50713]: INFO [main] system info | n_threads=32 n_threads_batch=32 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140300405710848" timestamp=1726983393 total_threads=64
9月 22 13:36:33 gpu ollama[50713]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="63" port="38200" tid="140300405710848" timestamp=1726983393
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: loaded meta data with 29 key-value pairs and 508 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc (version GGUF V3 (latest))
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 0: general.architecture str = gemma2
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 1: general.name str = gemma-2-27b-it
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 2: gemma2.context_length u32 = 8192
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 3: gemma2.embedding_length u32 = 4608
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 4: gemma2.block_count u32 = 46
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 5: gemma2.feed_forward_length u32 = 36864
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 6: gemma2.attention.head_count u32 = 32
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 7: gemma2.attention.head_count_kv u32 = 16
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 8: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 9: gemma2.attention.key_length u32 = 128
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 10: gemma2.attention.value_length u32 = 128
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 11: general.file_type u32 = 2
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 12: gemma2.attn_logit_softcapping f32 = 50.000000
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 13: gemma2.final_logit_softcapping f32 = 30.000000
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 14: gemma2.attention.sliding_window u32 = 4096
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 15: tokenizer.ggml.model str = llama
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 16: tokenizer.ggml.pre str = default
9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.846+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,256000] = ["", "", "", "", ...
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 18: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000...
9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 2
9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 1
9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 22: tokenizer.ggml.unknown_token_id u32 = 3
9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 0
9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = true
9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false
9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 26: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol...
9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 27: tokenizer.ggml.add_space_prefix bool = false
9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 28: general.quantization_version u32 = 2
9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - type f32: 185 tensors
9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - type q4_0: 322 tensors
9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - type q6_K: 1 tensors
9月 22 13:36:34 gpu ollama[50713]: llm_load_vocab: special tokens cache size = 108
9月 22 13:36:34 gpu ollama[50713]: llm_load_vocab: token to piece cache size = 1.6014 MB
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: format = GGUF V3 (latest)
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: arch = gemma2
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: vocab type = SPM
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_vocab = 256000
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_merges = 0
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: vocab_only = 0
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_ctx_train = 8192
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_embd = 4608
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_layer = 46
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_head = 32
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_head_kv = 16
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_rot = 128
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_swa = 4096
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_embd_head_k = 128
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_embd_head_v = 128
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_gqa = 2
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_embd_k_gqa = 2048
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_embd_v_gqa = 2048
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: f_norm_eps = 0.0e+00
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: f_logit_scale = 0.0e+00
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_ff = 36864
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_expert = 0
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_expert_used = 0
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: causal attn = 1
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: pooling type = 0
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: rope type = 2
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: rope scaling = linear
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: freq_base_train = 10000.0
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: freq_scale_train = 1
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_ctx_orig_yarn = 8192
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: rope_finetuned = unknown
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: ssm_d_conv = 0
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: ssm_d_inner = 0
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: ssm_d_state = 0
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: ssm_dt_rank = 0
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: ssm_dt_b_c_rms = 0
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: model type = 27B
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: model ftype = Q4_0
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: model params = 27.23 B
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: model size = 14.55 GiB (4.59 BPW)
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: general.name = gemma-2-27b-it
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: BOS token = 2 ''
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: EOS token = 1 ''
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: UNK token = 3 ''
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: PAD token = 0 ''
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: LF token = 227 '<0x0A>'
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: EOT token = 107 '<end_of_turn>'
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: max token length = 93
9月 22 13:36:34 gpu ollama[50713]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
9月 22 13:36:34 gpu ollama[50713]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
9月 22 13:36:34 gpu ollama[50713]: ggml_cuda_init: found 1 CUDA devices:
9月 22 13:36:34 gpu ollama[50713]: Device 0: NVIDIA A30, compute capability 8.0, VMM: yes
9月 22 13:36:34 gpu ollama[50713]: llm_load_tensors: ggml ctx size = 0.45 MiB
9月 22 13:36:35 gpu ollama[50713]: time=2024-09-22T13:36:35.304+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding"
9月 22 13:36:36 gpu ollama[50713]: time=2024-09-22T13:36:36.592+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
9月 22 13:36:36 gpu ollama[50713]: llm_load_tensors: offloading 46 repeating layers to GPU
9月 22 13:36:36 gpu ollama[50713]: llm_load_tensors: offloading non-repeating layers to GPU
9月 22 13:36:36 gpu ollama[50713]: llm_load_tensors: offloaded 47/47 layers to GPU
9月 22 13:36:36 gpu ollama[50713]: llm_load_tensors: CPU buffer size = 922.85 MiB
9月 22 13:36:36 gpu ollama[50713]: llm_load_tensors: CUDA0 buffer size = 14898.60 MiB
9月 22 13:36:39 gpu ollama[50713]: time=2024-09-22T13:36:39.309+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding"
9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: n_ctx = 8192
9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: n_batch = 512
9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: n_ubatch = 512
9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: flash_attn = 0
9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: freq_base = 10000.0
9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: freq_scale = 1
9月 22 13:36:39 gpu ollama[50713]: llama_kv_cache_init: CUDA0 KV buffer size = 2944.00 MiB
9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: KV self size = 2944.00 MiB, K (f16): 1472.00 MiB, V (f16): 1472.00 MiB
9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: CUDA_Host output buffer size = 3.98 MiB
9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: CUDA0 compute buffer size = 578.00 MiB
9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: CUDA_Host compute buffer size = 41.01 MiB
9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: graph nodes = 1850
9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: graph splits = 2
9月 22 13:36:39 gpu ollama[50713]: time=2024-09-22T13:36:39.562+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
9月 22 13:36:39 gpu ollama[50713]: INFO [main] model loaded | tid="140300405710848" timestamp=1726983399
9月 22 13:36:39 gpu ollama[50713]: time=2024-09-22T13:36:39.814+08:00 level=INFO source=server.go:626 msg="llama runner started in 6.22 seconds"
9月 22 13:36:42 gpu ollama[50713]: [GIN] 2024/09/22 - 13:36:42 | 200 | 10.06400965s | 172.16.1.219 | POST "/v1/chat/completions"
9月 22 13:36:46 gpu systemd[1]: Stopping Ollama Service...
9月 22 13:36:47 gpu systemd[1]: Stopped Ollama Service.
9月 22 13:36:47 gpu systemd[1]: Started Ollama Service.
9月 22 13:36:47 gpu ollama[50857]: 2024/09/22 13:36:47 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:
https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
9月 22 13:36:47 gpu ollama[50857]: time=2024-09-22T13:36:47.610+08:00 level=INFO source=images.go:753 msg="total blobs: 34"
9月 22 13:36:47 gpu ollama[50857]: time=2024-09-22T13:36:47.613+08:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
9月 22 13:36:47 gpu ollama[50857]: time=2024-09-22T13:36:47.614+08:00 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)"
9月 22 13:36:47 gpu ollama[50857]: time=2024-09-22T13:36:47.616+08:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama2548666145/runners
9月 22 13:37:02 gpu ollama[50857]: time=2024-09-22T13:37:02.902+08:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cuda_v12 rocm_v60102 cpu cpu_avx cpu_avx2 cuda_v11]"
9月 22 13:37:02 gpu ollama[50857]: time=2024-09-22T13:37:02.902+08:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
9月 22 13:37:04 gpu ollama[50857]: time=2024-09-22T13:37:04.255+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="2.8 GiB"
9月 22 13:37:04 gpu ollama[50857]: time=2024-09-22T13:37:04.255+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="19.7 GiB"
9月 22 13:37:04 gpu ollama[50857]: time=2024-09-22T13:37:04.255+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="14.8 GiB"
9月 22 13:37:04 gpu ollama[50857]: time=2024-09-22T13:37:04.255+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="6.4 GiB"
9月 22 13:37:53 gpu ollama[50857]: time=2024-09-22T13:37:53.159+08:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc gpu=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 parallel=4 available=21122187264 required="18.8 GiB"
9月 22 13:37:53 gpu ollama[50857]: time=2024-09-22T13:37:53.159+08:00 level=INFO source=server.go:103 msg="system memory" total="125.4 GiB" free="100.7 GiB" free_swap="3.5 GiB"
9月 22 13:37:53 gpu ollama[50857]: time=2024-09-22T13:37:53.160+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=47 layers.offload=47 layers.split="" memory.available="[19.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="18.8 GiB" memory.required.partial="18.8 GiB" memory.required.kv="2.9 GiB" memory.required.allocations="[18.8 GiB]" memory.weights.total="16.5 GiB" memory.weights.repeating="15.6 GiB" memory.weights.nonrepeating="922.9 MiB" memory.graph.full="562.0 MiB" memory.graph.partial="1.4 GiB"
9月 22 13:37:53 gpu ollama[50857]: time=2024-09-22T13:37:53.176+08:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2548666145/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 47 --parallel 4 --port 42032"
9月 22 13:37:53 gpu ollama[50857]: time=2024-09-22T13:37:53.177+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
9月 22 13:37:53 gpu ollama[50857]: time=2024-09-22T13:37:53.178+08:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
9月 22 13:37:53 gpu ollama[50857]: time=2024-09-22T13:37:53.178+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
9月 22 13:37:53 gpu ollama[50857]: INFO [main] build info | build=10 commit="9225b05" tid="140713765588992" timestamp=1726983473
9月 22 13:37:53 gpu ollama[50857]: INFO [main] system info | n_threads=32 n_threads_batch=32 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140713765588992" timestamp=1726983473 total_threads=64
9月 22 13:37:53 gpu ollama[50857]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="63" port="42032" tid="140713765588992" timestamp=1726983473
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: loaded meta data with 29 key-value pairs and 508 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc (version GGUF V3 (latest))
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 0: general.architecture str = gemma2
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 1: general.name str = gemma-2-27b-it
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 2: gemma2.context_length u32 = 8192
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 3: gemma2.embedding_length u32 = 4608
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 4: gemma2.block_count u32 = 46
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 5: gemma2.feed_forward_length u32 = 36864
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 6: gemma2.attention.head_count u32 = 32
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 7: gemma2.attention.head_count_kv u32 = 16
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 8: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 9: gemma2.attention.key_length u32 = 128
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 10: gemma2.attention.value_length u32 = 128
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 11: general.file_type u32 = 2
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 12: gemma2.attn_logit_softcapping f32 = 50.000000
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 13: gemma2.final_logit_softcapping f32 = 30.000000
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 14: gemma2.attention.sliding_window u32 = 4096
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 15: tokenizer.ggml.model str = llama
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 16: tokenizer.ggml.pre str = default
9月 22 13:37:53 gpu ollama[50857]: time=2024-09-22T13:37:53.431+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,256000] = ["", "", "", "", ...
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 18: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000...
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 2
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 1
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 22: tokenizer.ggml.unknown_token_id u32 = 3
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 0
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = true
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 26: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol...
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 27: tokenizer.ggml.add_space_prefix bool = false
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 28: general.quantization_version u32 = 2
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - type f32: 185 tensors
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - type q4_0: 322 tensors
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - type q6_K: 1 tensors
9月 22 13:37:53 gpu ollama[50857]: llm_load_vocab: special tokens cache size = 108
9月 22 13:37:53 gpu ollama[50857]: llm_load_vocab: token to piece cache size = 1.6014 MB
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: format = GGUF V3 (latest)
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: arch = gemma2
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: vocab type = SPM
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_vocab = 256000
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_merges = 0
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: vocab_only = 0
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_ctx_train = 8192
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_embd = 4608
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_layer = 46
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_head = 32
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_head_kv = 16
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_rot = 128
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_swa = 4096
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_embd_head_k = 128
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_embd_head_v = 128
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_gqa = 2
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_embd_k_gqa = 2048
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_embd_v_gqa = 2048
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: f_norm_eps = 0.0e+00
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: f_logit_scale = 0.0e+00
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_ff = 36864
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_expert = 0
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_expert_used = 0
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: causal attn = 1
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: pooling type = 0
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: rope type = 2
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: rope scaling = linear
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: freq_base_train = 10000.0
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: freq_scale_train = 1
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_ctx_orig_yarn = 8192
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: rope_finetuned = unknown
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: ssm_d_conv = 0
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: ssm_d_inner = 0
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: ssm_d_state = 0
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: ssm_dt_rank = 0
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: ssm_dt_b_c_rms = 0
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: model type = 27B
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: model ftype = Q4_0
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: model params = 27.23 B
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: model size = 14.55 GiB (4.59 BPW)
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: general.name = gemma-2-27b-it
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: BOS token = 2 ''
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: EOS token = 1 ''
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: UNK token = 3 ''
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: PAD token = 0 ''
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: LF token = 227 '<0x0A>'
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: EOT token = 107 '<end_of_turn>'
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: max token length = 93
9月 22 13:37:53 gpu ollama[50857]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
9月 22 13:37:53 gpu ollama[50857]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
9月 22 13:37:53 gpu ollama[50857]: ggml_cuda_init: found 1 CUDA devices:
9月 22 13:37:53 gpu ollama[50857]: Device 0: NVIDIA A30, compute capability 8.0, VMM: yes
9月 22 13:37:54 gpu ollama[50857]: llm_load_tensors: ggml ctx size = 0.45 MiB
9月 22 13:37:54 gpu ollama[50857]: time=2024-09-22T13:37:54.888+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding"
9月 22 13:37:55 gpu ollama[50857]: llm_load_tensors: offloading 46 repeating layers to GPU
9月 22 13:37:55 gpu ollama[50857]: llm_load_tensors: offloading non-repeating layers to GPU
9月 22 13:37:55 gpu ollama[50857]: llm_load_tensors: offloaded 47/47 layers to GPU
9月 22 13:37:55 gpu ollama[50857]: llm_load_tensors: CPU buffer size = 922.85 MiB
9月 22 13:37:55 gpu ollama[50857]: llm_load_tensors: CUDA0 buffer size = 14898.60 MiB
9月 22 13:37:56 gpu ollama[50857]: time=2024-09-22T13:37:56.042+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: n_ctx = 8192
9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: n_batch = 512
9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: n_ubatch = 512
9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: flash_attn = 0
9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: freq_base = 10000.0
9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: freq_scale = 1
9月 22 13:37:58 gpu ollama[50857]: llama_kv_cache_init: CUDA0 KV buffer size = 2944.00 MiB
9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: KV self size = 2944.00 MiB, K (f16): 1472.00 MiB, V (f16): 1472.00 MiB
9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: CUDA_Host output buffer size = 3.98 MiB
9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: CUDA0 compute buffer size = 578.00 MiB
9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: CUDA_Host compute buffer size = 41.01 MiB
9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: graph nodes = 1850
9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: graph splits = 2
9月 22 13:37:58 gpu ollama[50857]: INFO [main] model loaded | tid="140713765588992" timestamp=1726983478
9月 22 13:37:58 gpu ollama[50857]: time=2024-09-22T13:37:58.556+08:00 level=INFO source=server.go:626 msg="llama runner started in 5.38 seconds"
9月 22 13:38:00 gpu ollama[50857]: [GIN] 2024/09/22 - 13:38:00 | 200 | 9.160086456s | 172.16.1.219 | POST "/v1/chat/completions"
(base) [root@gpu ~]#

here is AI DEBUG information

System
answer the question。
The content within "" is to be considered as your knowledge

File: test.docx

4

按面值全部或部分回售给公司。中国国际贸易中心股份有限公司 2024 年半年度报告

按税法及相关规定计算的当期所得税

228,742,593

213,975,457

递延所得税

397,078

1,899,031

合计

229,139,671

215,874,488

将基于合并利润表的利润总额采用适用税率计算的所得税调节为所得税费用:

2024 年 1 至 6 月

2023 年 1 至 6 月

利润总额

917,100,107

869,897,127

按适用税率计算的所得税

229,275,027

217,474,282

非应纳税收入涉及的所得税费用

调整额

(294,273)

(441,994)

不得扣除的成本、费用和损失涉

及的所得税费用调整额

112,836

54,741

税率差异的影响

9,216

281,022

当期未确认递延所得税资产的

可抵扣亏损

36,865

1,124,089

其他

(2,617,652)

所得税费用

229,139,671

215,874,488中国国际贸易中心股份有限公司

财务报表附注

截至 2024 年 6 月 30 日 6 个月期间

(除特别注明外,金额单位为人民币元)

  • 55 -

合并财务报表项目附注(续)

41

每股收益

(1)

基本每股收益

基本每股收益以归属于母公司普通股股东的合并净利润除以母公司发行在外普通

股的加权平均数计算:

2024 年 1 至 6 月

2023 年 1 至 6 月

归属于母公司普通股股东的合并

净利润

687,537,223

653,651,170

本公司发行在外普通股的

加权平均数

1,007,282,534

1,007,282,534

基本每股收益

0.68

0.65

其中:

— 持续经营基本每股收益:

0.68

0.65

— 终止经营基本每股收益:

(2)

稀释每股收益

稀释每股收益以根据稀释性潜在普通股调整后的归属于母公司普通股股东的合并

净利润除以调整后的本公司发行在外普通股的加权平均数计算。2024 年 1 至 6

月,本公司不存在具有稀释性的潜在普通股(2023 年 1 至 6 月:不存在),因此,

稀释每股收益等于基本每股收益。

42

现金流量表项目注释

(1)

收到的其他与经营活动有关的现金

2024 年 1 至 6 月

2023 年 1 至 6 月

利息收入

21,033,279

5,074,023

保险理赔收入

16,272,954

租户违约罚款收入

1,282,587

7,411,061

政府补助

469,524

894,749

租赁押金(i)

969,319

其他

6,033,470

6,548,135

合计

45,091,814

20,897,287 中国国际贸易中心股份有限公司

财务报表附注

截至 2024 年 6 月 30 日 6 个月期间

(除特别注明外,金额单位为人民币元)

  • 56 -

合并财务报表项目附注(续)

42

现金流量表项目注释(续)

(1)

收到的其他与经营活动有关的现金(续)

(i)

2024 年 1 至 6 月,本集团实际收到租赁押金 25,452,903 元,支付租赁押金

26,496,022 元,净支付租赁押金 1,043,119 元(2023 年 1 至 6 月:收到租赁押金

20,732,626 元,支付租赁押金 19,763,307 元,净收到租赁押金 969,319 元)。

(2)

支付的其他与经营活动有关的现金

2024 年 1 至 6 月

2023 年 1 至 6 月

水电采暖费

56,112,195

51,101,506

广告宣传费

22,754,073

40,494,713

保险费

5,770,956

5,536,730

租金

1,252,681

1,301,188

租赁押金(附注四.42(1)(i))

1,043,119

其他

27,348,375

22,728,332

合计

114,281,399

121,162,469

(3)

支付的其他与筹资活动有关的现金

2024 年 1 至 6 月

2023 年 1 至 6 月

偿还租赁负债支付的金额

1,203,884

1,203,884

2024 年 1 至 6 月,本集团支付的与作为承租人租赁相关的总现金流出为

2,456,565 元(2023 年 1 至 6 月:2,505,072 元),除计入筹资活动的偿付租赁负

债支付的金额以外,其余现金流出均计入经营活动。中国国际贸易中心股份有限公司

财务报表附注

截至 2024 年 6 月 30 日 6 个月期间

(除特别注明外,金额单位为人民币元)

  • 57 -

合并财务报表项目附注(续)

43

现金流量表补充资料

(1)

将净利润调节为经营活动现金流量

2024 年 1 至 6 月 2023 年 1 至 6 月

净利润

687,960,436

654,022,639

加:投资性房地产折旧(附注四.8)

170,454,149

176,373,646

固定资产折旧(附注四.9)

47,846,841

48,825,698

无形资产摊销(附注四.11)

7,729,457

7,729,457

长期待摊费用摊销(附注四.12)

3,649,944

3,677,472

使用权资产折旧(附注四.10)

1,801,654

1,801,654

处置非流动资产净损失/(收益)

(附注四.38、39)

815,142

1,438,618

财务费用/(收入)(附注四.33)

28,825,210

40,904,785

投资(收益)/损失(附注四.37)

(1,177,093)

(1,767,975)

递延所得税资产减少/(增加)

(附注四.13)

397,078

1,899,031

存货的减少/(增加)(附注四.5)

329,316

1,357,482

信用减值损失(附注四.35)

28,332

(149,374)

受限资金的减少/(增加)

13,771,165

20,165,655

经营性应收项目的减少/(增加)

20,904,271

75,426,863

经营性应付项目的(减少)/增加

(40,912,808)

(39,068,402)

经营活动产生的现金流量净额

942,423,094

992,637,249

(2)

现金及现金等价物净变动情况

2024 年 1 至 6 月 2023 年 1 至 6 月

现金的期末余额

3,425,888,471

3,550,484,636

减:现金的期初余额

(3,890,169,116)

(3,326,552,300)

加:现金等价物的期末余额

减:现金等价物的期初余额

现金及现金等价物的净(减少)/增加额

(464,280,645)

223,932,336中国国际贸易中心股份有限公司

财务报表附注

截至 2024 年 6 月 30 日 6 个月期间

(除特别注明外,金额单位为人民币元)

  • 58 -

合并财务报表项目附注(续)

43

现金流量表补充资料(续)

(3)

筹资活动产生的各项负债的变动情况

长期借款

(含一年内到期)

(附注四.23)

应付债券

(含一年内到期)

(附注四.22)

租赁负债

(含一年内到期)

(附注四.24)

应付股利

合计

2023 年 12 月 31 日 1,136,240,417

443,190,003

40,903,391

1,620,333,811

本期计提的利息/

股利

21,659,083

6,379,998

787,784

1,309,467,294

1,338,294,159

筹资活动产生的现

金流出

(71,724,084)

(1,203,884) (1,309,467,294) (1,382,395,262)

其中:偿还本金

(50,000,000)

(50,000,000)

支付租金

(1,203,884)

(1,203,884)

偿还利息

(21,724,084)

(21,724,084)

支付股利

  • (1,309,467,294) (1,309,467,294)

2024 年 6 月 30 日

1,086,175,416

449,570,001

40,487,291

1,576,232,708

(4)

现金及现金等价物

2024 年 6 月 30 日

2023 年 12 月 31 日

期/年末货币资金余额(附注四.1)

3,612,151,002

4,088,660,385

其中:库存现金

771,466

885,944

银行存款

3,555,200,837

4,033,138,169

应收利息

56,178,699

54,636,272

减:受到限制的货币资金(附注四.1)

130,083,832

143,854,997

应收利息

56,178,699

54,636,272

期/年末现金及现金等价物余额

3,425,888,471

3,890,169,116

44

外币货币性项目

2024 年 6 月 30 日

外币余额

折算汇率

人民币余额

货币资金—

美元

723,230

7.1268

5,154,316

欧元

322

7.6617

2,467

其他应付款—

美元

802,437

7.1268

5,718,808

港币

220,000

0.9127

200,794

欧元

36,580

7.6617

280,265

英镑

29,762

9.0430

269,138

上述外币货币性项目指除人民币之外的所有货币。中国国际贸易中心股份有限公司

财务报表附注

截至 2024 年 6 月 30 日 6 个月期间

(除特别注明外,金额单位为人民币元)

  • 59 -

在其他主体中的权益

1

在子公司中的权益

(1)

企业集团的构成

子公司名称

主要经营地

注册地

注册资本 业务性质

持股比例

取得方式

直接 间接

国贸物业酒店

管理有限公司

北京

北京

人民币

3000 万

服务业 95%

  • 直接持有

北京国贸国际

会展有限公司

北京

北京

人民币

1000 万

服务业

  • 95% 间接持有

2

在联营企业中的权益

(1)

不重要的联营企业的汇总信息

2024 年 1 至 6 月

2023 年 1 至 6 月

联营企业

时代网星

投资账面价值合计

20,891,146

24,538,713

按持股比例计算的净利润

725,751

1,549,010

力创智慧

投资账面价值合计

874,791

768,203

按持股比例计算的净利润

109,050

(7,277)

北京昌发展

投资账面价值合计

2,953,266

2,121,392

按持股比例计算的净利润

342,292

226,242

首程国贸

投资账面价值合计

2,000,000

按持股比例计算的净利润

合计

投资账面价值合计

26,719,203

27,428,308

按持股比例计算的净利润

1,177,093

1,767,975中国国际贸易中心股份有限公司

财务报表附注

截至 2024 年 6 月 30 日 6 个月期间

(除特别注明外,金额单位为人民币元)

  • 60 -

分部信息

本集团的报告分部是提供不同服务的业务单元。由于各种业务需要不同的技术和

市场战略,因此,本集团分别独立管理各个报告分部的生产经营活动,分别评价

其经营成果,以决定向其配置资源并评价其业绩。

本集团有 2 个报告分部,分别为:

  • 租赁及物业管理分部,负责提供物业出租、物业管理服务及会展服务

  • 酒店经营分部,负责提供客房、餐饮等服务

分部间转移价格参照向第三方销售所采用的价格确定。资产及负债按照分部进行

分配,间接归属于各分部的费用按照受益比例在分部之间进行分配。

(1)

2024 年 1 至 6 月及 2024 年 6 月 30 日分部信息列示如下:

租赁及物业

管理业务

酒店经营

未分配的

金额

分部间的

抵销

合计

对外交易收入

1,709,030,499

256,263,022

1,965,293,521

分部间交易收入

2,059,878

4,187,445

  • (6,247,323)

主营业务成本

(533,033,970)

(235,769,751)

(768,803,721)

利息收入

22,361,607

214,099

22,575,706

利息费用

(28,826,865)

(28,826,865)

对联营企业的

投资收益

1,177,093

1,177,093

折旧费和摊销费

(179,723,132)

(51,758,913)

(231,482,045)

利润总额

945,909,574

(1,159,695)

(27,649,772)

917,100,107

所得税费用

(229,139,671)

(229,139,671)

净利润

945,909,574

(1,159,695)

(256,789,443)

687,960,436

资产总额

10,197,061,326

1,877,256,344

93,038,252

12,167,355,922

负债总额

1,300,281,045

95,925,031

1,688,672,716

3,084,878,792

对联营企业的长期

股权投资

26,719,203

26,719,203

非流动资产增加额(i)

20,311,781

3,092,259

23,404,040

(i)

非流动资产不包括长期股权投资和递延所得税资产。中国国际贸易中心股份有限公司

财务报表附注

截至 2024 年 6 月 30 日 6 个月期间

(除特别注明外,金额单位为人民币元)

  • 61 -

分部信息(续)

(2)

2023 年 1 至 6 月及 2023 年 6 月 30 日分部信息列示如下:

租赁及物业

管理业务

酒店经营

未分配的金额

分部间的

抵销

合计

对外交易收入

1,666,402,278

271,961,862

1,938,364,140

分部间交易收入

1,842,491

3,411,616

(5,254,107)

主营业务成本

(518,526,122)

(251,494,757)

(770,020,879)

利息收入

16,829,724

244,299

17,074,023

利息费用

(40,499,642)

(40,499,642)

对联营企业的投资

收益

1,767,975

1,767,975

使用权资产折旧费

(1,801,654)

(1,801,654)

折旧费和摊销费

(183,871,459)

(52,734,814)

(236,606,273)

利润/(亏损)总额

904,979,046

3,649,748

(38,731,667)

869,897,127

所得税费用

(215,874,488)

(215,874,488)

净利润

904,979,046

3,649,748

(254,606,155)

654,022,639

资产总额

10,680,671,485

1,962,000,318

94,542,239

12,737,214,042

负债总额

(1,267,927,182)

(102,496,997)

(2,268,886,633)

(3,639,310,812)

对联营企业的长期

股权投资

27,428,308

27,428,308

非流动资产增加额(i)

23,366,594

1,525,911

24,892,505

(i)

非流动资产不包括长期股权投资和递延所得税资产。

关联方关系及其交易

1

母公司情况

(1)

母公司基本情况

注册地

业务性质

国贸有限公司

北京

服务业

本公司的最终控制方为国贸有限公司。中国国际贸易中心股份有限公司

财务报表附注

截至 2024 年 6 月 30 日 6 个月期间

(除特别注明外,金额单位为人民币元)

  • 62 -

关联方关系及其交易(续)

1

母公司情况(续)

(2)

母公司注册资本及其变化

2023 年

12 月 31 日

本期

增加

本期

减少

2024 年

6 月 30 日

国贸有限公司

240,000,000 美元

240,000,000 美元

(3)

母公司对本公司的持股比例和表决权比例

2024 年 6 月 30 日

2023 年 12 月 31 日

持股比例

表决权比例

持股比例

表决权比例

国贸有限公司

80.65%

80.65%

80.65%

80.65%

2

子公司情况

子公司的基本情况及相关信息见附注五.1。

3

联营企业情况

联营企业的基本情况及相关信息见附注五.2。

4

其他关联方情况

与本集团的关系

中国世贸投资有限公司

母公司的中方投资者

嘉里兴业有限公司

母公司的外方投资者

香格里拉国际饭店管理有限公司(以下称“香

格里拉”)

母公司的外方投资者之关联公司

香格里拉饭店管理(上海)有限公司北京分公司

(以下称“香格里拉北京”)

母公司的外方投资者之关联公司中国国际贸易中心股份有限公司

财务报表附注

截至 2024 年 6 月 30 日 6 个月期间

(除特别注明外,金额单位为人民币元)

  • 63 -

关联方关系及其交易(续)

5

关联交易

(1)

购销商品、提供和接受劳务

本公司与关联方的交易价格以市场价为基础,由双方协商确定,经本公司董事会或股东大会批准后签订相关合同。本公司董

事会认为与关联方的交易均符合正常的商业条款。

接受劳务和服务

关联方

关联交易类型 关联交易内容

2024 年 1 至 6 月

2023 年 1 至 6 月

金额

金额

国贸有限公司

接受劳务

支付餐饮费及会员费

1,199,036

906,827

国贸有限公司

接受劳务

支付劳务费

3,222,808

3,317,866

国贸有限公司

接受劳务

支付公关管理及工会行政费等

4,001,239

2,751,966

国贸有限公司

接受劳务

支付酒店营运物资、食品加工

费、洗衣费等

2,518,468

1,854,990

Human

summary the document

AI

Please provide me with the document you would like me to summarize. I need the actual text of the document in order to analyze it and create a summary for you.

For example, you can paste the text directly into our chat or provide a link to the document if it's publicly accessible online.

<!-- gh-comment-id:2365478440 --> @goactiongo commented on GitHub (Sep 22, 2024): Here , I just use gemma-2-27b as a sample, I tried many moduel and the issue is the same . — The 27B model was trained with 13 trillion tokens and the 9B model was trained with 8 trillion tokens. "model": "gemma2:27b", "name": "localNet-ollama-gemma2:27b", "avatar": "/imgs/model/openai.svg", "maxContext": 8000, "maxResponse": 8000, "quoteMaxToken":5000, or change to followed, the same issue "model": "gemma2:27b", "name": "localNet-ollama-gemma2:27b", "avatar": "/imgs/model/openai.svg", "maxContext": 120000, "maxResponse": 10000, "quoteMaxToken":5000, ## journalctl -u ollama --no-pager 9月 22 13:35:58 gpu systemd[1]: Stopping Ollama Service... 9月 22 13:36:01 gpu systemd[1]: Stopped Ollama Service. 9月 22 13:36:01 gpu systemd[1]: Started Ollama Service. 9月 22 13:36:01 gpu ollama[50713]: 2024/09/22 13:36:01 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" 9月 22 13:36:01 gpu ollama[50713]: time=2024-09-22T13:36:01.944+08:00 level=INFO source=images.go:753 msg="total blobs: 34" 9月 22 13:36:01 gpu ollama[50713]: time=2024-09-22T13:36:01.946+08:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0" 9月 22 13:36:01 gpu ollama[50713]: time=2024-09-22T13:36:01.947+08:00 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)" 9月 22 13:36:01 gpu ollama[50713]: time=2024-09-22T13:36:01.950+08:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama2397062001/runners 9月 22 13:36:17 gpu ollama[50713]: time=2024-09-22T13:36:17.632+08:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm_v60102]" 9月 22 13:36:17 gpu ollama[50713]: time=2024-09-22T13:36:17.632+08:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs" 9月 22 13:36:19 gpu ollama[50713]: time=2024-09-22T13:36:19.003+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="2.8 GiB" 9月 22 13:36:19 gpu ollama[50713]: time=2024-09-22T13:36:19.003+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="19.7 GiB" 9月 22 13:36:19 gpu ollama[50713]: time=2024-09-22T13:36:19.003+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="14.8 GiB" 9月 22 13:36:19 gpu ollama[50713]: time=2024-09-22T13:36:19.003+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="6.4 GiB" 9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.572+08:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc gpu=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 parallel=4 available=21122187264 required="18.8 GiB" 9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.572+08:00 level=INFO source=server.go:103 msg="system memory" total="125.4 GiB" free="100.7 GiB" free_swap="3.5 GiB" 9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.573+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=47 layers.offload=47 layers.split="" memory.available="[19.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="18.8 GiB" memory.required.partial="18.8 GiB" memory.required.kv="2.9 GiB" memory.required.allocations="[18.8 GiB]" memory.weights.total="16.5 GiB" memory.weights.repeating="15.6 GiB" memory.weights.nonrepeating="922.9 MiB" memory.graph.full="562.0 MiB" memory.graph.partial="1.4 GiB" 9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.592+08:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2397062001/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 47 --parallel 4 --port 38200" 9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.592+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.592+08:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding" 9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.593+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" 9月 22 13:36:33 gpu ollama[50713]: INFO [main] build info | build=10 commit="9225b05" tid="140300405710848" timestamp=1726983393 9月 22 13:36:33 gpu ollama[50713]: INFO [main] system info | n_threads=32 n_threads_batch=32 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140300405710848" timestamp=1726983393 total_threads=64 9月 22 13:36:33 gpu ollama[50713]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="63" port="38200" tid="140300405710848" timestamp=1726983393 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: loaded meta data with 29 key-value pairs and 508 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc (version GGUF V3 (latest)) 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 0: general.architecture str = gemma2 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 1: general.name str = gemma-2-27b-it 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 2: gemma2.context_length u32 = 8192 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 3: gemma2.embedding_length u32 = 4608 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 4: gemma2.block_count u32 = 46 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 5: gemma2.feed_forward_length u32 = 36864 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 6: gemma2.attention.head_count u32 = 32 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 7: gemma2.attention.head_count_kv u32 = 16 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 8: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 9: gemma2.attention.key_length u32 = 128 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 10: gemma2.attention.value_length u32 = 128 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 11: general.file_type u32 = 2 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 12: gemma2.attn_logit_softcapping f32 = 50.000000 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 13: gemma2.final_logit_softcapping f32 = 30.000000 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 14: gemma2.attention.sliding_window u32 = 4096 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 15: tokenizer.ggml.model str = llama 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 16: tokenizer.ggml.pre str = default 9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.846+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ... 9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 18: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000... 9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... 9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 2 9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 1 9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 22: tokenizer.ggml.unknown_token_id u32 = 3 9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 0 9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = true 9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false 9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 26: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol... 9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 27: tokenizer.ggml.add_space_prefix bool = false 9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 28: general.quantization_version u32 = 2 9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - type f32: 185 tensors 9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - type q4_0: 322 tensors 9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - type q6_K: 1 tensors 9月 22 13:36:34 gpu ollama[50713]: llm_load_vocab: special tokens cache size = 108 9月 22 13:36:34 gpu ollama[50713]: llm_load_vocab: token to piece cache size = 1.6014 MB 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: format = GGUF V3 (latest) 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: arch = gemma2 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: vocab type = SPM 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_vocab = 256000 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_merges = 0 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: vocab_only = 0 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_ctx_train = 8192 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_embd = 4608 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_layer = 46 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_head = 32 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_head_kv = 16 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_rot = 128 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_swa = 4096 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_embd_head_k = 128 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_embd_head_v = 128 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_gqa = 2 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_embd_k_gqa = 2048 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_embd_v_gqa = 2048 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: f_norm_eps = 0.0e+00 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: f_logit_scale = 0.0e+00 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_ff = 36864 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_expert = 0 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_expert_used = 0 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: causal attn = 1 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: pooling type = 0 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: rope type = 2 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: rope scaling = linear 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: freq_base_train = 10000.0 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: freq_scale_train = 1 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_ctx_orig_yarn = 8192 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: rope_finetuned = unknown 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: ssm_d_conv = 0 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: ssm_d_inner = 0 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: ssm_d_state = 0 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: ssm_dt_rank = 0 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: ssm_dt_b_c_rms = 0 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: model type = 27B 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: model ftype = Q4_0 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: model params = 27.23 B 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: model size = 14.55 GiB (4.59 BPW) 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: general.name = gemma-2-27b-it 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: BOS token = 2 '<bos>' 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: EOS token = 1 '<eos>' 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: UNK token = 3 '<unk>' 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: PAD token = 0 '<pad>' 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: LF token = 227 '<0x0A>' 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: EOT token = 107 '<end_of_turn>' 9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: max token length = 93 9月 22 13:36:34 gpu ollama[50713]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 9月 22 13:36:34 gpu ollama[50713]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 9月 22 13:36:34 gpu ollama[50713]: ggml_cuda_init: found 1 CUDA devices: 9月 22 13:36:34 gpu ollama[50713]: Device 0: NVIDIA A30, compute capability 8.0, VMM: yes 9月 22 13:36:34 gpu ollama[50713]: llm_load_tensors: ggml ctx size = 0.45 MiB 9月 22 13:36:35 gpu ollama[50713]: time=2024-09-22T13:36:35.304+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding" 9月 22 13:36:36 gpu ollama[50713]: time=2024-09-22T13:36:36.592+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" 9月 22 13:36:36 gpu ollama[50713]: llm_load_tensors: offloading 46 repeating layers to GPU 9月 22 13:36:36 gpu ollama[50713]: llm_load_tensors: offloading non-repeating layers to GPU 9月 22 13:36:36 gpu ollama[50713]: llm_load_tensors: offloaded 47/47 layers to GPU 9月 22 13:36:36 gpu ollama[50713]: llm_load_tensors: CPU buffer size = 922.85 MiB 9月 22 13:36:36 gpu ollama[50713]: llm_load_tensors: CUDA0 buffer size = 14898.60 MiB 9月 22 13:36:39 gpu ollama[50713]: time=2024-09-22T13:36:39.309+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding" 9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: n_ctx = 8192 9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: n_batch = 512 9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: n_ubatch = 512 9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: flash_attn = 0 9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: freq_base = 10000.0 9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: freq_scale = 1 9月 22 13:36:39 gpu ollama[50713]: llama_kv_cache_init: CUDA0 KV buffer size = 2944.00 MiB 9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: KV self size = 2944.00 MiB, K (f16): 1472.00 MiB, V (f16): 1472.00 MiB 9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: CUDA_Host output buffer size = 3.98 MiB 9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: CUDA0 compute buffer size = 578.00 MiB 9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: CUDA_Host compute buffer size = 41.01 MiB 9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: graph nodes = 1850 9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: graph splits = 2 9月 22 13:36:39 gpu ollama[50713]: time=2024-09-22T13:36:39.562+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" 9月 22 13:36:39 gpu ollama[50713]: INFO [main] model loaded | tid="140300405710848" timestamp=1726983399 9月 22 13:36:39 gpu ollama[50713]: time=2024-09-22T13:36:39.814+08:00 level=INFO source=server.go:626 msg="llama runner started in 6.22 seconds" 9月 22 13:36:42 gpu ollama[50713]: [GIN] 2024/09/22 - 13:36:42 | 200 | 10.06400965s | 172.16.1.219 | POST "/v1/chat/completions" 9月 22 13:36:46 gpu systemd[1]: Stopping Ollama Service... 9月 22 13:36:47 gpu systemd[1]: Stopped Ollama Service. 9月 22 13:36:47 gpu systemd[1]: Started Ollama Service. 9月 22 13:36:47 gpu ollama[50857]: 2024/09/22 13:36:47 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" 9月 22 13:36:47 gpu ollama[50857]: time=2024-09-22T13:36:47.610+08:00 level=INFO source=images.go:753 msg="total blobs: 34" 9月 22 13:36:47 gpu ollama[50857]: time=2024-09-22T13:36:47.613+08:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0" 9月 22 13:36:47 gpu ollama[50857]: time=2024-09-22T13:36:47.614+08:00 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)" 9月 22 13:36:47 gpu ollama[50857]: time=2024-09-22T13:36:47.616+08:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama2548666145/runners 9月 22 13:37:02 gpu ollama[50857]: time=2024-09-22T13:37:02.902+08:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cuda_v12 rocm_v60102 cpu cpu_avx cpu_avx2 cuda_v11]" 9月 22 13:37:02 gpu ollama[50857]: time=2024-09-22T13:37:02.902+08:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs" 9月 22 13:37:04 gpu ollama[50857]: time=2024-09-22T13:37:04.255+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="2.8 GiB" 9月 22 13:37:04 gpu ollama[50857]: time=2024-09-22T13:37:04.255+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="19.7 GiB" 9月 22 13:37:04 gpu ollama[50857]: time=2024-09-22T13:37:04.255+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="14.8 GiB" 9月 22 13:37:04 gpu ollama[50857]: time=2024-09-22T13:37:04.255+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="6.4 GiB" 9月 22 13:37:53 gpu ollama[50857]: time=2024-09-22T13:37:53.159+08:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc gpu=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 parallel=4 available=21122187264 required="18.8 GiB" 9月 22 13:37:53 gpu ollama[50857]: time=2024-09-22T13:37:53.159+08:00 level=INFO source=server.go:103 msg="system memory" total="125.4 GiB" free="100.7 GiB" free_swap="3.5 GiB" 9月 22 13:37:53 gpu ollama[50857]: time=2024-09-22T13:37:53.160+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=47 layers.offload=47 layers.split="" memory.available="[19.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="18.8 GiB" memory.required.partial="18.8 GiB" memory.required.kv="2.9 GiB" memory.required.allocations="[18.8 GiB]" memory.weights.total="16.5 GiB" memory.weights.repeating="15.6 GiB" memory.weights.nonrepeating="922.9 MiB" memory.graph.full="562.0 MiB" memory.graph.partial="1.4 GiB" 9月 22 13:37:53 gpu ollama[50857]: time=2024-09-22T13:37:53.176+08:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2548666145/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 47 --parallel 4 --port 42032" 9月 22 13:37:53 gpu ollama[50857]: time=2024-09-22T13:37:53.177+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 9月 22 13:37:53 gpu ollama[50857]: time=2024-09-22T13:37:53.178+08:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding" 9月 22 13:37:53 gpu ollama[50857]: time=2024-09-22T13:37:53.178+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" 9月 22 13:37:53 gpu ollama[50857]: INFO [main] build info | build=10 commit="9225b05" tid="140713765588992" timestamp=1726983473 9月 22 13:37:53 gpu ollama[50857]: INFO [main] system info | n_threads=32 n_threads_batch=32 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140713765588992" timestamp=1726983473 total_threads=64 9月 22 13:37:53 gpu ollama[50857]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="63" port="42032" tid="140713765588992" timestamp=1726983473 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: loaded meta data with 29 key-value pairs and 508 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc (version GGUF V3 (latest)) 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 0: general.architecture str = gemma2 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 1: general.name str = gemma-2-27b-it 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 2: gemma2.context_length u32 = 8192 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 3: gemma2.embedding_length u32 = 4608 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 4: gemma2.block_count u32 = 46 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 5: gemma2.feed_forward_length u32 = 36864 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 6: gemma2.attention.head_count u32 = 32 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 7: gemma2.attention.head_count_kv u32 = 16 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 8: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 9: gemma2.attention.key_length u32 = 128 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 10: gemma2.attention.value_length u32 = 128 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 11: general.file_type u32 = 2 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 12: gemma2.attn_logit_softcapping f32 = 50.000000 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 13: gemma2.final_logit_softcapping f32 = 30.000000 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 14: gemma2.attention.sliding_window u32 = 4096 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 15: tokenizer.ggml.model str = llama 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 16: tokenizer.ggml.pre str = default 9月 22 13:37:53 gpu ollama[50857]: time=2024-09-22T13:37:53.431+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ... 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 18: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000... 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 2 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 1 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 22: tokenizer.ggml.unknown_token_id u32 = 3 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 0 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = true 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 26: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol... 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 27: tokenizer.ggml.add_space_prefix bool = false 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 28: general.quantization_version u32 = 2 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - type f32: 185 tensors 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - type q4_0: 322 tensors 9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - type q6_K: 1 tensors 9月 22 13:37:53 gpu ollama[50857]: llm_load_vocab: special tokens cache size = 108 9月 22 13:37:53 gpu ollama[50857]: llm_load_vocab: token to piece cache size = 1.6014 MB 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: format = GGUF V3 (latest) 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: arch = gemma2 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: vocab type = SPM 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_vocab = 256000 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_merges = 0 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: vocab_only = 0 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_ctx_train = 8192 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_embd = 4608 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_layer = 46 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_head = 32 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_head_kv = 16 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_rot = 128 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_swa = 4096 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_embd_head_k = 128 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_embd_head_v = 128 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_gqa = 2 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_embd_k_gqa = 2048 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_embd_v_gqa = 2048 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: f_norm_eps = 0.0e+00 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: f_logit_scale = 0.0e+00 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_ff = 36864 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_expert = 0 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_expert_used = 0 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: causal attn = 1 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: pooling type = 0 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: rope type = 2 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: rope scaling = linear 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: freq_base_train = 10000.0 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: freq_scale_train = 1 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_ctx_orig_yarn = 8192 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: rope_finetuned = unknown 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: ssm_d_conv = 0 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: ssm_d_inner = 0 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: ssm_d_state = 0 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: ssm_dt_rank = 0 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: ssm_dt_b_c_rms = 0 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: model type = 27B 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: model ftype = Q4_0 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: model params = 27.23 B 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: model size = 14.55 GiB (4.59 BPW) 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: general.name = gemma-2-27b-it 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: BOS token = 2 '<bos>' 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: EOS token = 1 '<eos>' 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: UNK token = 3 '<unk>' 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: PAD token = 0 '<pad>' 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: LF token = 227 '<0x0A>' 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: EOT token = 107 '<end_of_turn>' 9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: max token length = 93 9月 22 13:37:53 gpu ollama[50857]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 9月 22 13:37:53 gpu ollama[50857]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 9月 22 13:37:53 gpu ollama[50857]: ggml_cuda_init: found 1 CUDA devices: 9月 22 13:37:53 gpu ollama[50857]: Device 0: NVIDIA A30, compute capability 8.0, VMM: yes 9月 22 13:37:54 gpu ollama[50857]: llm_load_tensors: ggml ctx size = 0.45 MiB 9月 22 13:37:54 gpu ollama[50857]: time=2024-09-22T13:37:54.888+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding" 9月 22 13:37:55 gpu ollama[50857]: llm_load_tensors: offloading 46 repeating layers to GPU 9月 22 13:37:55 gpu ollama[50857]: llm_load_tensors: offloading non-repeating layers to GPU 9月 22 13:37:55 gpu ollama[50857]: llm_load_tensors: offloaded 47/47 layers to GPU 9月 22 13:37:55 gpu ollama[50857]: llm_load_tensors: CPU buffer size = 922.85 MiB 9月 22 13:37:55 gpu ollama[50857]: llm_load_tensors: CUDA0 buffer size = 14898.60 MiB 9月 22 13:37:56 gpu ollama[50857]: time=2024-09-22T13:37:56.042+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" 9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: n_ctx = 8192 9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: n_batch = 512 9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: n_ubatch = 512 9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: flash_attn = 0 9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: freq_base = 10000.0 9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: freq_scale = 1 9月 22 13:37:58 gpu ollama[50857]: llama_kv_cache_init: CUDA0 KV buffer size = 2944.00 MiB 9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: KV self size = 2944.00 MiB, K (f16): 1472.00 MiB, V (f16): 1472.00 MiB 9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: CUDA_Host output buffer size = 3.98 MiB 9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: CUDA0 compute buffer size = 578.00 MiB 9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: CUDA_Host compute buffer size = 41.01 MiB 9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: graph nodes = 1850 9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: graph splits = 2 9月 22 13:37:58 gpu ollama[50857]: INFO [main] model loaded | tid="140713765588992" timestamp=1726983478 9月 22 13:37:58 gpu ollama[50857]: time=2024-09-22T13:37:58.556+08:00 level=INFO source=server.go:626 msg="llama runner started in 5.38 seconds" 9月 22 13:38:00 gpu ollama[50857]: [GIN] 2024/09/22 - 13:38:00 | 200 | 9.160086456s | 172.16.1.219 | POST "/v1/chat/completions" (base) [root@gpu ~]# ## here is AI DEBUG information System answer the question。 The content within "<Quote></Quote>" is to be considered as your knowledge <Quote> File: test.docx <Content> 4 按面值全部或部分回售给公司。中国国际贸易中心股份有限公司 2024 年半年度报告 按税法及相关规定计算的当期所得税 228,742,593 213,975,457 递延所得税 397,078 1,899,031 合计 229,139,671 215,874,488 将基于合并利润表的利润总额采用适用税率计算的所得税调节为所得税费用: 2024 年 1 至 6 月 2023 年 1 至 6 月 利润总额 917,100,107 869,897,127 按适用税率计算的所得税 229,275,027 217,474,282 非应纳税收入涉及的所得税费用 调整额 (294,273) (441,994) 不得扣除的成本、费用和损失涉 及的所得税费用调整额 112,836 54,741 税率差异的影响 9,216 281,022 当期未确认递延所得税资产的 可抵扣亏损 36,865 1,124,089 其他 - (2,617,652) 所得税费用 229,139,671 215,874,488中国国际贸易中心股份有限公司 财务报表附注 截至 2024 年 6 月 30 日 6 个月期间 (除特别注明外,金额单位为人民币元) - 55 - 四 合并财务报表项目附注(续) 41 每股收益 (1) 基本每股收益 基本每股收益以归属于母公司普通股股东的合并净利润除以母公司发行在外普通 股的加权平均数计算: 2024 年 1 至 6 月 2023 年 1 至 6 月 归属于母公司普通股股东的合并 净利润 687,537,223 653,651,170 本公司发行在外普通股的 加权平均数 1,007,282,534 1,007,282,534 基本每股收益 0.68 0.65 其中: — 持续经营基本每股收益: 0.68 0.65 — 终止经营基本每股收益: - - (2) 稀释每股收益 稀释每股收益以根据稀释性潜在普通股调整后的归属于母公司普通股股东的合并 净利润除以调整后的本公司发行在外普通股的加权平均数计算。2024 年 1 至 6 月,本公司不存在具有稀释性的潜在普通股(2023 年 1 至 6 月:不存在),因此, 稀释每股收益等于基本每股收益。 42 现金流量表项目注释 (1) 收到的其他与经营活动有关的现金 2024 年 1 至 6 月 2023 年 1 至 6 月 利息收入 21,033,279 5,074,023 保险理赔收入 16,272,954 - 租户违约罚款收入 1,282,587 7,411,061 政府补助 469,524 894,749 租赁押金(i) - 969,319 其他 6,033,470 6,548,135 合计 45,091,814 20,897,287 中国国际贸易中心股份有限公司 财务报表附注 截至 2024 年 6 月 30 日 6 个月期间 (除特别注明外,金额单位为人民币元) - 56 - 四 合并财务报表项目附注(续) 42 现金流量表项目注释(续) (1) 收到的其他与经营活动有关的现金(续) (i) 2024 年 1 至 6 月,本集团实际收到租赁押金 25,452,903 元,支付租赁押金 26,496,022 元,净支付租赁押金 1,043,119 元(2023 年 1 至 6 月:收到租赁押金 20,732,626 元,支付租赁押金 19,763,307 元,净收到租赁押金 969,319 元)。 (2) 支付的其他与经营活动有关的现金 2024 年 1 至 6 月 2023 年 1 至 6 月 水电采暖费 56,112,195 51,101,506 广告宣传费 22,754,073 40,494,713 保险费 5,770,956 5,536,730 租金 1,252,681 1,301,188 租赁押金(附注四.42(1)(i)) 1,043,119 - 其他 27,348,375 22,728,332 合计 114,281,399 121,162,469 (3) 支付的其他与筹资活动有关的现金 2024 年 1 至 6 月 2023 年 1 至 6 月 偿还租赁负债支付的金额 1,203,884 1,203,884 2024 年 1 至 6 月,本集团支付的与作为承租人租赁相关的总现金流出为 2,456,565 元(2023 年 1 至 6 月:2,505,072 元),除计入筹资活动的偿付租赁负 债支付的金额以外,其余现金流出均计入经营活动。中国国际贸易中心股份有限公司 财务报表附注 截至 2024 年 6 月 30 日 6 个月期间 (除特别注明外,金额单位为人民币元) - 57 - 四 合并财务报表项目附注(续) 43 现金流量表补充资料 (1) 将净利润调节为经营活动现金流量 2024 年 1 至 6 月 2023 年 1 至 6 月 净利润 687,960,436 654,022,639 加:投资性房地产折旧(附注四.8) 170,454,149 176,373,646 固定资产折旧(附注四.9) 47,846,841 48,825,698 无形资产摊销(附注四.11) 7,729,457 7,729,457 长期待摊费用摊销(附注四.12) 3,649,944 3,677,472 使用权资产折旧(附注四.10) 1,801,654 1,801,654 处置非流动资产净损失/(收益) (附注四.38、39) 815,142 1,438,618 财务费用/(收入)(附注四.33) 28,825,210 40,904,785 投资(收益)/损失(附注四.37) (1,177,093) (1,767,975) 递延所得税资产减少/(增加) (附注四.13) 397,078 1,899,031 存货的减少/(增加)(附注四.5) 329,316 1,357,482 信用减值损失(附注四.35) 28,332 (149,374) 受限资金的减少/(增加) 13,771,165 20,165,655 经营性应收项目的减少/(增加) 20,904,271 75,426,863 经营性应付项目的(减少)/增加 (40,912,808) (39,068,402) 经营活动产生的现金流量净额 942,423,094 992,637,249 (2) 现金及现金等价物净变动情况 2024 年 1 至 6 月 2023 年 1 至 6 月 现金的期末余额 3,425,888,471 3,550,484,636 减:现金的期初余额 (3,890,169,116) (3,326,552,300) 加:现金等价物的期末余额 - - 减:现金等价物的期初余额 - - 现金及现金等价物的净(减少)/增加额 (464,280,645) 223,932,336中国国际贸易中心股份有限公司 财务报表附注 截至 2024 年 6 月 30 日 6 个月期间 (除特别注明外,金额单位为人民币元) - 58 - 四 合并财务报表项目附注(续) 43 现金流量表补充资料(续) (3) 筹资活动产生的各项负债的变动情况 长期借款 (含一年内到期) (附注四.23) 应付债券 (含一年内到期) (附注四.22) 租赁负债 (含一年内到期) (附注四.24) 应付股利 合计 2023 年 12 月 31 日 1,136,240,417 443,190,003 40,903,391 - 1,620,333,811 本期计提的利息/ 股利 21,659,083 6,379,998 787,784 1,309,467,294 1,338,294,159 筹资活动产生的现 金流出 (71,724,084) - (1,203,884) (1,309,467,294) (1,382,395,262) 其中:偿还本金 (50,000,000) - - - (50,000,000) 支付租金 - - (1,203,884) - (1,203,884) 偿还利息 (21,724,084) - - - (21,724,084) 支付股利 - - - (1,309,467,294) (1,309,467,294) 2024 年 6 月 30 日 1,086,175,416 449,570,001 40,487,291 - 1,576,232,708 (4) 现金及现金等价物 2024 年 6 月 30 日 2023 年 12 月 31 日 期/年末货币资金余额(附注四.1) 3,612,151,002 4,088,660,385 其中:库存现金 771,466 885,944 银行存款 3,555,200,837 4,033,138,169 应收利息 56,178,699 54,636,272 减:受到限制的货币资金(附注四.1) 130,083,832 143,854,997 应收利息 56,178,699 54,636,272 期/年末现金及现金等价物余额 3,425,888,471 3,890,169,116 44 外币货币性项目 2024 年 6 月 30 日 外币余额 折算汇率 人民币余额 货币资金— 美元 723,230 7.1268 5,154,316 欧元 322 7.6617 2,467 其他应付款— 美元 802,437 7.1268 5,718,808 港币 220,000 0.9127 200,794 欧元 36,580 7.6617 280,265 英镑 29,762 9.0430 269,138 上述外币货币性项目指除人民币之外的所有货币。中国国际贸易中心股份有限公司 财务报表附注 截至 2024 年 6 月 30 日 6 个月期间 (除特别注明外,金额单位为人民币元) - 59 - 五 在其他主体中的权益 1 在子公司中的权益 (1) 企业集团的构成 子公司名称 主要经营地 注册地 注册资本 业务性质 持股比例 取得方式 直接 间接 国贸物业酒店 管理有限公司 北京 北京 人民币 3000 万 服务业 95% - 直接持有 北京国贸国际 会展有限公司 北京 北京 人民币 1000 万 服务业 - 95% 间接持有 2 在联营企业中的权益 (1) 不重要的联营企业的汇总信息 2024 年 1 至 6 月 2023 年 1 至 6 月 联营企业 时代网星 投资账面价值合计 20,891,146 24,538,713 按持股比例计算的净利润 725,751 1,549,010 力创智慧 投资账面价值合计 874,791 768,203 按持股比例计算的净利润 109,050 (7,277) 北京昌发展 投资账面价值合计 2,953,266 2,121,392 按持股比例计算的净利润 342,292 226,242 首程国贸 投资账面价值合计 2,000,000 - 按持股比例计算的净利润 - - 合计 投资账面价值合计 26,719,203 27,428,308 按持股比例计算的净利润 1,177,093 1,767,975中国国际贸易中心股份有限公司 财务报表附注 截至 2024 年 6 月 30 日 6 个月期间 (除特别注明外,金额单位为人民币元) - 60 - 六 分部信息 本集团的报告分部是提供不同服务的业务单元。由于各种业务需要不同的技术和 市场战略,因此,本集团分别独立管理各个报告分部的生产经营活动,分别评价 其经营成果,以决定向其配置资源并评价其业绩。 本集团有 2 个报告分部,分别为: - 租赁及物业管理分部,负责提供物业出租、物业管理服务及会展服务 - 酒店经营分部,负责提供客房、餐饮等服务 分部间转移价格参照向第三方销售所采用的价格确定。资产及负债按照分部进行 分配,间接归属于各分部的费用按照受益比例在分部之间进行分配。 (1) 2024 年 1 至 6 月及 2024 年 6 月 30 日分部信息列示如下: 租赁及物业 管理业务 酒店经营 未分配的 金额 分部间的 抵销 合计 对外交易收入 1,709,030,499 256,263,022 - - 1,965,293,521 分部间交易收入 2,059,878 4,187,445 - (6,247,323) - 主营业务成本 (533,033,970) (235,769,751) - - (768,803,721) 利息收入 22,361,607 214,099 - - 22,575,706 利息费用 - - (28,826,865) - (28,826,865) 对联营企业的 投资收益 - - 1,177,093 - 1,177,093 折旧费和摊销费 (179,723,132) (51,758,913) - - (231,482,045) 利润总额 945,909,574 (1,159,695) (27,649,772) - 917,100,107 所得税费用 - - (229,139,671) - (229,139,671) 净利润 945,909,574 (1,159,695) (256,789,443) - 687,960,436 资产总额 10,197,061,326 1,877,256,344 93,038,252 - 12,167,355,922 负债总额 1,300,281,045 95,925,031 1,688,672,716 - 3,084,878,792 对联营企业的长期 股权投资 - - 26,719,203 - 26,719,203 非流动资产增加额(i) 20,311,781 3,092,259 - - 23,404,040 (i) 非流动资产不包括长期股权投资和递延所得税资产。中国国际贸易中心股份有限公司 财务报表附注 截至 2024 年 6 月 30 日 6 个月期间 (除特别注明外,金额单位为人民币元) - 61 - 六 分部信息(续) (2) 2023 年 1 至 6 月及 2023 年 6 月 30 日分部信息列示如下: 租赁及物业 管理业务 酒店经营 未分配的金额 分部间的 抵销 合计 对外交易收入 1,666,402,278 271,961,862 - - 1,938,364,140 分部间交易收入 1,842,491 3,411,616 - (5,254,107) - 主营业务成本 (518,526,122) (251,494,757) - - (770,020,879) 利息收入 16,829,724 244,299 - - 17,074,023 利息费用 - - (40,499,642) - (40,499,642) 对联营企业的投资 收益 - - 1,767,975 - 1,767,975 使用权资产折旧费 (1,801,654) - - - (1,801,654) 折旧费和摊销费 (183,871,459) (52,734,814) - - (236,606,273) 利润/(亏损)总额 904,979,046 3,649,748 (38,731,667) - 869,897,127 所得税费用 - - (215,874,488) - (215,874,488) 净利润 904,979,046 3,649,748 (254,606,155) - 654,022,639 资产总额 10,680,671,485 1,962,000,318 94,542,239 - 12,737,214,042 负债总额 (1,267,927,182) (102,496,997) (2,268,886,633) - (3,639,310,812) 对联营企业的长期 股权投资 - - 27,428,308 - 27,428,308 非流动资产增加额(i) 23,366,594 1,525,911 - - 24,892,505 (i) 非流动资产不包括长期股权投资和递延所得税资产。 七 关联方关系及其交易 1 母公司情况 (1) 母公司基本情况 注册地 业务性质 国贸有限公司 北京 服务业 本公司的最终控制方为国贸有限公司。中国国际贸易中心股份有限公司 财务报表附注 截至 2024 年 6 月 30 日 6 个月期间 (除特别注明外,金额单位为人民币元) - 62 - 七 关联方关系及其交易(续) 1 母公司情况(续) (2) 母公司注册资本及其变化 2023 年 12 月 31 日 本期 增加 本期 减少 2024 年 6 月 30 日 国贸有限公司 240,000,000 美元 - - 240,000,000 美元 (3) 母公司对本公司的持股比例和表决权比例 2024 年 6 月 30 日 2023 年 12 月 31 日 持股比例 表决权比例 持股比例 表决权比例 国贸有限公司 80.65% 80.65% 80.65% 80.65% 2 子公司情况 子公司的基本情况及相关信息见附注五.1。 3 联营企业情况 联营企业的基本情况及相关信息见附注五.2。 4 其他关联方情况 与本集团的关系 中国世贸投资有限公司 母公司的中方投资者 嘉里兴业有限公司 母公司的外方投资者 香格里拉国际饭店管理有限公司(以下称“香 格里拉”) 母公司的外方投资者之关联公司 香格里拉饭店管理(上海)有限公司北京分公司 (以下称“香格里拉北京”) 母公司的外方投资者之关联公司中国国际贸易中心股份有限公司 财务报表附注 截至 2024 年 6 月 30 日 6 个月期间 (除特别注明外,金额单位为人民币元) - 63 - 七 关联方关系及其交易(续) 5 关联交易 (1) 购销商品、提供和接受劳务 本公司与关联方的交易价格以市场价为基础,由双方协商确定,经本公司董事会或股东大会批准后签订相关合同。本公司董 事会认为与关联方的交易均符合正常的商业条款。 接受劳务和服务 关联方 关联交易类型 关联交易内容 2024 年 1 至 6 月 2023 年 1 至 6 月 金额 金额 国贸有限公司 接受劳务 支付餐饮费及会员费 1,199,036 906,827 国贸有限公司 接受劳务 支付劳务费 3,222,808 3,317,866 国贸有限公司 接受劳务 支付公关管理及工会行政费等 4,001,239 2,751,966 国贸有限公司 接受劳务 支付酒店营运物资、食品加工 费、洗衣费等 2,518,468 1,854,990 </Content> </Quote> Human summary the document AI Please provide me with the document you would like me to summarize. I need the actual text of the document in order to analyze it and create a summary for you. For example, you can paste the text directly into our chat or provide a link to the document if it's publicly accessible online.
Author
Owner

@rick-github commented on GitHub (Sep 22, 2024):

9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.592+08:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2397062001/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 47 --parallel 4 --port 38200"

You are running 4 serving threads (--parallel 4) and total context size is 8k (--ctx-size 8192), so each request is using the default context window of 2048 tokens. Whatever you are dong with maxContext and maxResponse is not relevant to ollama. The corresponding ollama configuration elements are num_ctx and num_predict, those are the parameters you need to adjust to get the documents to fit in the context window.

<!-- gh-comment-id:2365490805 --> @rick-github commented on GitHub (Sep 22, 2024): ``` 9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.592+08:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2397062001/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 47 --parallel 4 --port 38200" ``` You are running 4 serving threads (`--parallel 4`) and total context size is 8k (`--ctx-size 8192`), so each request is using the default context window of 2048 tokens. Whatever you are dong with `maxContext` and `maxResponse` is not relevant to ollama. The corresponding ollama configuration elements are [`num_ctx` and `num_predict`](https://github.com/ollama/ollama/blob/ad935f45ac19a8ba090db32580f3a6469e9858bb/docs/api.md#request-7), those are the parameters you need to adjust to get the documents to fit in the context window.
Author
Owner

@goactiongo commented on GitHub (Sep 22, 2024):

9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.592+08:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2397062001/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 47 --parallel 4 --port 38200"

You are running 4 serving threads (--parallel 4) and total context size is 8k (--ctx-size 8192), so each request is using the default context window of 2048 tokens. Whatever you are dong with maxContext and maxResponse is not relevant to ollama. The corresponding ollama configuration elements are num_ctx and num_predict, those are the parameters you need to adjust to get the documents to fit in the context window.

Thanks for your reply.

image

I wanna know why 'num_ctx' is not the same with my setting and why 'num_predict' is not shown

1st testing "num_ctx": 120000, "num_predict": 7000

  curl http://localhost:11434/api/generate -d '{
  "model": "glm4:9b",
  "prompt": "test hi",
  "stream": false,
  "options": {
    "num_ctx": 120000,
    "num_predict": 7000
  }
}'

ollama log shown:ctx-size 120000,without num_predict

9月 22 17:55:19 gpu ollama[57349]: time=2024-09-22T17:55:19.237+08:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2118317042/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449 --ctx-size 120000 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --parallel 1 --port 43640"

2nd testing "num_ctx": 5000, "num_predict": 3000

  curl http://localhost:11434/api/generate -d '{
  "model": "glm4:9b",
  "prompt": "test hi",
  "stream": false,
  "options": {
    "num_ctx": 5000,
    "num_predict": 3000
  }
}'

ollama log shown:ctx-size 20000,without num_predict
``
9月 22 17:59:20 gpu ollama[57349]: time=2024-09-22T17:59:20.566+08:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2118317042/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449 --ctx-size 20000 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --parallel 4 --port 44778"

<!-- gh-comment-id:2366513214 --> @goactiongo commented on GitHub (Sep 22, 2024): > ``` > 9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.592+08:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2397062001/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 47 --parallel 4 --port 38200" > ``` > > You are running 4 serving threads (`--parallel 4`) and total context size is 8k (`--ctx-size 8192`), so each request is using the default context window of 2048 tokens. Whatever you are dong with `maxContext` and `maxResponse` is not relevant to ollama. The corresponding ollama configuration elements are [`num_ctx` and `num_predict`](https://github.com/ollama/ollama/blob/ad935f45ac19a8ba090db32580f3a6469e9858bb/docs/api.md#request-7), those are the parameters you need to adjust to get the documents to fit in the context window. Thanks for your reply. ![image](https://github.com/user-attachments/assets/b5ff6c4a-1beb-4341-b828-eefb06ed3605) I wanna know why 'num_ctx' is not the same with my setting and why 'num_predict' is not shown ## 1st testing "num_ctx": 120000, "num_predict": 7000 ```sh curl http://localhost:11434/api/generate -d '{ "model": "glm4:9b", "prompt": "test hi", "stream": false, "options": { "num_ctx": 120000, "num_predict": 7000 } }' ``` ollama log shown:ctx-size 120000,without num_predict ``` 9月 22 17:55:19 gpu ollama[57349]: time=2024-09-22T17:55:19.237+08:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2118317042/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449 --ctx-size 120000 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --parallel 1 --port 43640" ``` ## 2nd testing "num_ctx": 5000, "num_predict": 3000 ```sh curl http://localhost:11434/api/generate -d '{ "model": "glm4:9b", "prompt": "test hi", "stream": false, "options": { "num_ctx": 5000, "num_predict": 3000 } }' ``` ollama log shown:ctx-size 20000,without num_predict `` 9月 22 17:59:20 gpu ollama[57349]: time=2024-09-22T17:59:20.566+08:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2118317042/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449 --ctx-size 20000 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --parallel 4 --port 44778" ```
Author
Owner

@rick-github commented on GitHub (Sep 22, 2024):

The screen shot that shows context length of 131072 is the context length that the model was trained with. This is different from num_ctx, which is the size of the context window that ollama allocates for processing queries. The context window consumes VRAM, and if it is very large, it can cause the model weights to overflow to system RAM, making inference much slower. For this reason, the default context window that ollama allocates is smaller that the context window that the model was trained with. By default, it is 2048 tokens.

1st test: the llama runner is running one thread (--parallel 1) so the total space for context is 120000 (--ctx-size 120000). num_predict is not shown as a parameter on the command line because it is not a per-model parameter, it is a per-query parameter. num_predict is passed to the llama runner as part of the request that includes the prompt.

2nd test: the llama runner is running four threads (--parallel 4) each with a context window of 5000, so the total space for context is 20000 (--ctx-size 20000).

The reason the thread count changes is because in the 1st test, ollama saw that you were asking for a very large context window, and as explained earlier, a large context can cause model weights to spill to RAM. So ollama decided to use 1 thread. In the 2nd test, you asked for a smaller context, and ollama saw that it could fit 4 threads worth of context in the available VRAM, so set --parallel 4.

You can override this behavior where ollama chooses thread count by setting OLLAMA_NUM_PARALLEL in the server enviroment. If you set OLLAMA_NUM_PARALLEL=1 in the second test, the context size will be 5000 (--ctx-size 5000).

<!-- gh-comment-id:2366647968 --> @rick-github commented on GitHub (Sep 22, 2024): The screen shot that shows context length of 131072 is the context length that the model was trained with. This is different from `num_ctx`, which is the size of the context window that ollama allocates for processing queries. The context window consumes VRAM, and if it is very large, it can cause the model weights to overflow to system RAM, making inference much slower. For this reason, the default context window that ollama allocates is smaller that the context window that the model was trained with. By default, it is 2048 tokens. 1st test: the llama runner is running one thread (`--parallel 1`) so the total space for context is 120000 (`--ctx-size 120000`). `num_predict` is not shown as a parameter on the command line because it is not a per-model parameter, it is a per-query parameter. `num_predict` is passed to the llama runner as part of the request that includes the prompt. 2nd test: the llama runner is running four threads (`--parallel 4`) each with a context window of 5000, so the total space for context is 20000 (`--ctx-size 20000`). The reason the thread count changes is because in the 1st test, ollama saw that you were asking for a very large context window, and as explained earlier, a large context can cause model weights to spill to RAM. So ollama decided to use 1 thread. In the 2nd test, you asked for a smaller context, and ollama saw that it could fit 4 threads worth of context in the available VRAM, so set `--parallel 4`. You can override this behavior where ollama chooses thread count by setting `OLLAMA_NUM_PARALLEL` in the server enviroment. If you set `OLLAMA_NUM_PARALLEL=1` in the second test, the context size will be 5000 (`--ctx-size 5000`).
Author
Owner

@goactiongo commented on GitHub (Sep 22, 2024):

thanks

---Original---
From: @.>
Date: Sun, Sep 22, 2024 18:29 PM
To: @.
>;
Cc: @.@.>;
Subject: Re: [ollama/ollama] No ollama model can recognize the referencedinformation. (Issue #6902)

The screen shot that shows context length of 131072 is the context length that the model was trained with. This is different from num_ctx, which is the size of the context window that ollama allocates for processing queries. The context window consumes VRAM, and if it is very large, it can cause the model weights to overflow to system RAM, making inference much slower. For this reason, the default context window that ollama allocates is smaller that the context window that the model was trained with. By default, it is 2048 tokens.

1st test: the llama runner is running one thread (--parallel 1) so the total space for context is 120000 (--ctx-size 120000). num_predict is not shown as a parameter on the command line because it is not a per-model parameter, it is a per-query parameter. num_predict is passed to the llama runner as part of the request that includes the prompt.

2nd test: the llama runner is running four threads (--parallel 4) each with a context window of 5000, so the total space for context is 20000 (--ctx-size 20000).

The reason the thread count changes is because in the 1st test, ollama saw that you were asking for a very large context window, and as explained earlier, a large context can cause model weights to spill to RAM. So ollama decided to use 1 thread. In the 2nd test, you asked for a smaller context, and ollama saw that it could fit 4 threads worth of context in the available VRAM, so set --parallel 4.

You can override this behavior where ollama chooses thread count by setting OLLAMA_NUM_PARALLEL in the server enviroment. If you set OLLAMA_NUM_PARALLEL=1 in the second test, the context size will be 5000 (--ctx-size 5000).


Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: @.***>

<!-- gh-comment-id:2366726045 --> @goactiongo commented on GitHub (Sep 22, 2024): thanks ---Original--- From: ***@***.***&gt; Date: Sun, Sep 22, 2024 18:29 PM To: ***@***.***&gt;; Cc: ***@***.******@***.***&gt;; Subject: Re: [ollama/ollama] No ollama model can recognize the referencedinformation. (Issue #6902) The screen shot that shows context length of 131072 is the context length that the model was trained with. This is different from num_ctx, which is the size of the context window that ollama allocates for processing queries. The context window consumes VRAM, and if it is very large, it can cause the model weights to overflow to system RAM, making inference much slower. For this reason, the default context window that ollama allocates is smaller that the context window that the model was trained with. By default, it is 2048 tokens. 1st test: the llama runner is running one thread (--parallel 1) so the total space for context is 120000 (--ctx-size 120000). num_predict is not shown as a parameter on the command line because it is not a per-model parameter, it is a per-query parameter. num_predict is passed to the llama runner as part of the request that includes the prompt. 2nd test: the llama runner is running four threads (--parallel 4) each with a context window of 5000, so the total space for context is 20000 (--ctx-size 20000). The reason the thread count changes is because in the 1st test, ollama saw that you were asking for a very large context window, and as explained earlier, a large context can cause model weights to spill to RAM. So ollama decided to use 1 thread. In the 2nd test, you asked for a smaller context, and ollama saw that it could fit 4 threads worth of context in the available VRAM, so set --parallel 4. You can override this behavior where ollama chooses thread count by setting OLLAMA_NUM_PARALLEL in the server enviroment. If you set OLLAMA_NUM_PARALLEL=1 in the second test, the context size will be 5000 (--ctx-size 5000). — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***&gt;
Author
Owner

@goactiongo commented on GitHub (Sep 23, 2024):

Is there something wrong with my code?
Neither "num_parallel": 2 nor "ollama_num_paralle": 2

{
"model": "glm4:9b",
"prompt": "{{qst}}:{{text}}",
"stream": false,
"options": {
"num_ctx": 5000,
"num_predict": 3000,
"num_parallel": 2
}
}
13:19:38 gpu ollama[38429]: time=2024-09-23T13:19:38.388+08:00 level=WARN source=types.go:509 msg="invalid option provided" option=num_parallel

9月 23 13:22:59 gpu ollama[38429]: time=2024-09-23T13:22:59.115+08:00 level=WARN source=types.go:509 msg="invalid option provided" option=ollama_num_parallel

<!-- gh-comment-id:2367252036 --> @goactiongo commented on GitHub (Sep 23, 2024): Is there something wrong with my code? Neither "num_parallel": 2 nor "ollama_num_paralle": 2 { "model": "glm4:9b", "prompt": "{{qst}}:{{text}}", "stream": false, "options": { "num_ctx": 5000, "num_predict": 3000, "num_parallel": 2 } } 13:19:38 gpu ollama[38429]: time=2024-09-23T13:19:38.388+08:00 level=WARN source=types.go:509 msg="invalid option provided" option=num_parallel 9月 23 13:22:59 gpu ollama[38429]: time=2024-09-23T13:22:59.115+08:00 level=WARN source=types.go:509 msg="invalid option provided" option=ollama_num_parallel
Author
Owner

@rick-github commented on GitHub (Sep 23, 2024):

num_parallel is not a valid option in an API call. You need to set OLLAMA_NUM_PARALLEL=2 in the server environment.

<!-- gh-comment-id:2367261976 --> @rick-github commented on GitHub (Sep 23, 2024): `num_parallel` is not a valid option in an API call. You need to set `OLLAMA_NUM_PARALLEL=2` in the [server environment](https://github.com/ollama/ollama/blob/main/docs/faq.md#setting-environment-variables-on-linux).
Author
Owner

@goactiongo commented on GitHub (Sep 23, 2024):

thanks for your help

<!-- gh-comment-id:2367281513 --> @goactiongo commented on GitHub (Sep 23, 2024): thanks for your help
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#30126