[GH-ISSUE #6093] Only one of the dual CPUs is in use #50321

Closed
opened 2026-04-28 15:06:29 -05:00 by GiteaMirror · 8 comments
Owner

Originally created by @Mipuqt on GitHub (Jul 31, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6093

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

My machine has two CPUs without GPUs, and when I run the model, I find that the CPUs are used at most 50%
PixPin_2024-07-31_16-11-31
image

OS

Linux

GPU

Other

CPU

Intel

Ollama version

0.3.0

Originally created by @Mipuqt on GitHub (Jul 31, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6093 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? My machine has two CPUs without GPUs, and when I run the model, I find that the CPUs are used at most 50% ![PixPin_2024-07-31_16-11-31](https://github.com/user-attachments/assets/36c049f8-e4d5-4be8-a6f2-420532362226) ![image](https://github.com/user-attachments/assets/028ababf-4824-42fd-b9d5-a6e88b3eb9c6) ### OS Linux ### GPU Other ### CPU Intel ### Ollama version 0.3.0
GiteaMirror added the buglinux labels 2026-04-28 15:06:29 -05:00
Author
Owner

@mxmp210 commented on GitHub (Jul 31, 2024):

It's because of SMT that inference has no performance gain while CPU has limited floating point pipelines which is half of ALU ( cpu ) count.

To make sure, OS is reporting correct CPU identifiers check with grep -E 'processor|core id' /proc/cpuinfo The number of cores having same ID will only be effective once as they reside on same pipeline and considered as single compute thread.

If you believe SMT is disabled you can explicitly set number of threads by passing threads count in api.

Currently, there isn't any environment variable exposed which may change in future.

<!-- gh-comment-id:2260128922 --> @mxmp210 commented on GitHub (Jul 31, 2024): It's because of SMT that inference has no performance gain while CPU has limited floating point pipelines which is half of ALU ( cpu ) count. To make sure, OS is reporting correct CPU identifiers check with `grep -E 'processor|core id' /proc/cpuinfo` The number of cores having same ID will only be effective once as they reside on same pipeline and considered as single compute thread. If you believe SMT is disabled you can explicitly set number of threads by passing threads count in api. Currently, there isn't any environment variable exposed which may change in future.
Author
Owner

@rick-github commented on GitHub (Jul 31, 2024):

You can add "options": { "num_thread": x} to the API call to change the number of threads used for inference, but there's diminishing returns as thread count increases. On my 16 core 24 processor machine that's at about 10 threads.

     +---------------------------------------------------------------------+
  14 |-+      +        +       +        +        +        +       +      +-|
     |                                              threads vs tps ******* |
     |                                                                     |
  12 |-+                                                                 +-|
     |                                         ****************************|
     |                                *********                            |
  10 |-+                          ****                                   +-|
     |                    ********                                         |
   8 |-+                **                                               +-|
     |                **                                                   |
     |              **                                                     |
   6 |-+         ***                                                     +-|
     |         **                                                          |
     |        *                                                            |
   4 |-+    **                                                           +-|
     |    **                                                               |
   2 |-+ *                                                               +-|
     |                                                                     |
     |        +        +       +        +        +        +       +        |
   0 +---------------------------------------------------------------------+
     0        2        4       6        8        10       12      14       16

<!-- gh-comment-id:2260223514 --> @rick-github commented on GitHub (Jul 31, 2024): You can add `"options": { "num_thread": x}` to the API call to change the number of threads used for inference, but there's diminishing returns as thread count increases. On my 16 core 24 processor machine that's at about 10 threads. ``` +---------------------------------------------------------------------+ 14 |-+ + + + + + + + +-| | threads vs tps ******* | | | 12 |-+ +-| | ****************************| | ********* | 10 |-+ **** +-| | ******** | 8 |-+ ** +-| | ** | | ** | 6 |-+ *** +-| | ** | | * | 4 |-+ ** +-| | ** | 2 |-+ * +-| | | | + + + + + + + | 0 +---------------------------------------------------------------------+ 0 2 4 6 8 10 12 14 16 ```
Author
Owner

@dhiltgen commented on GitHub (Aug 1, 2024):

This sounds like it's probably a variation of #2496

@Mipuqt can you share the output of the following commands on your system?

ls /sys/devices/system/cpu/
cat /sys/devices/system/cpu/cpu*/topology/thread_siblings
<!-- gh-comment-id:2264103692 --> @dhiltgen commented on GitHub (Aug 1, 2024): This sounds like it's probably a variation of #2496 @Mipuqt can you share the output of the following commands on your system? ``` ls /sys/devices/system/cpu/ cat /sys/devices/system/cpu/cpu*/topology/thread_siblings ```
Author
Owner

@Mipuqt commented on GitHub (Aug 2, 2024):

This sounds like it's probably a variation of #2496

@Mipuqt can you share the output of the following commands on your system?

ls /sys/devices/system/cpu/
cat /sys/devices/system/cpu/cpu*/topology/thread_siblings
root@ubuntu:~# ls /sys/devices/system/cpu/
cpu0   cpu14  cpu2   cpu25  cpu30  cpu36  cpu41  cpu47  cpu52  cpu58  cpu63  cpu69  cpu74  cpu8          isolated    possible
cpu1   cpu15  cpu20  cpu26  cpu31  cpu37  cpu42  cpu48  cpu53  cpu59  cpu64  cpu7   cpu75  cpu9          kernel_max  power
cpu10  cpu16  cpu21  cpu27  cpu32  cpu38  cpu43  cpu49  cpu54  cpu6   cpu65  cpu70  cpu76  cpufreq       microcode   present
cpu11  cpu17  cpu22  cpu28  cpu33  cpu39  cpu44  cpu5   cpu55  cpu60  cpu66  cpu71  cpu77  cpuidle       modalias    smt
cpu12  cpu18  cpu23  cpu29  cpu34  cpu4   cpu45  cpu50  cpu56  cpu61  cpu67  cpu72  cpu78  hotplug       offline     uevent
cpu13  cpu19  cpu24  cpu3   cpu35  cpu40  cpu46  cpu51  cpu57  cpu62  cpu68  cpu73  cpu79  intel_pstate  online      vulnerabilities
root@ubuntu:~# cat /sys/devices/system/cpu/cpu*/topology/thread_siblings
0000,00000100,00000001
0000,00040000,00000400
0000,00080000,00000800
0000,00100000,00001000
0000,00200000,00002000
0000,00400000,00004000
0000,00800000,00008000
0000,01000000,00010000
0000,02000000,00020000
0000,04000000,00040000
0000,08000000,00080000
0000,00000200,00000002
0000,10000000,00100000
0000,20000000,00200000
0000,40000000,00400000
0000,80000000,00800000
0001,00000000,01000000
0002,00000000,02000000
0004,00000000,04000000
0008,00000000,08000000
0010,00000000,10000000
0020,00000000,20000000
0000,00000400,00000004
0040,00000000,40000000
0080,00000000,80000000
0100,00000001,00000000
0200,00000002,00000000
0400,00000004,00000000
0800,00000008,00000000
1000,00000010,00000000
2000,00000020,00000000
4000,00000040,00000000
8000,00000080,00000000
0000,00000800,00000008
0000,00000100,00000001
0000,00000200,00000002
0000,00000400,00000004
0000,00000800,00000008
0000,00001000,00000010
0000,00002000,00000020
0000,00004000,00000040
0000,00008000,00000080
0000,00010000,00000100
0000,00020000,00000200
0000,00001000,00000010
0000,00040000,00000400
0000,00080000,00000800
0000,00100000,00001000
0000,00200000,00002000
0000,00400000,00004000
0000,00800000,00008000
0000,01000000,00010000
0000,02000000,00020000
0000,04000000,00040000
0000,08000000,00080000
0000,00002000,00000020
0000,10000000,00100000
0000,20000000,00200000
0000,40000000,00400000
0000,80000000,00800000
0001,00000000,01000000
0002,00000000,02000000
0004,00000000,04000000
0008,00000000,08000000
0010,00000000,10000000
0020,00000000,20000000
0000,00004000,00000040
0040,00000000,40000000
0080,00000000,80000000
0100,00000001,00000000
0200,00000002,00000000
0400,00000004,00000000
0800,00000008,00000000
1000,00000010,00000000
2000,00000020,00000000
4000,00000040,00000000
8000,00000080,00000000
0000,00008000,00000080
0000,00010000,00000100
0000,00020000,00000200
<!-- gh-comment-id:2265016109 --> @Mipuqt commented on GitHub (Aug 2, 2024): > This sounds like it's probably a variation of #2496 > > @Mipuqt can you share the output of the following commands on your system? > > ``` > ls /sys/devices/system/cpu/ > cat /sys/devices/system/cpu/cpu*/topology/thread_siblings > ``` ``` root@ubuntu:~# ls /sys/devices/system/cpu/ cpu0 cpu14 cpu2 cpu25 cpu30 cpu36 cpu41 cpu47 cpu52 cpu58 cpu63 cpu69 cpu74 cpu8 isolated possible cpu1 cpu15 cpu20 cpu26 cpu31 cpu37 cpu42 cpu48 cpu53 cpu59 cpu64 cpu7 cpu75 cpu9 kernel_max power cpu10 cpu16 cpu21 cpu27 cpu32 cpu38 cpu43 cpu49 cpu54 cpu6 cpu65 cpu70 cpu76 cpufreq microcode present cpu11 cpu17 cpu22 cpu28 cpu33 cpu39 cpu44 cpu5 cpu55 cpu60 cpu66 cpu71 cpu77 cpuidle modalias smt cpu12 cpu18 cpu23 cpu29 cpu34 cpu4 cpu45 cpu50 cpu56 cpu61 cpu67 cpu72 cpu78 hotplug offline uevent cpu13 cpu19 cpu24 cpu3 cpu35 cpu40 cpu46 cpu51 cpu57 cpu62 cpu68 cpu73 cpu79 intel_pstate online vulnerabilities root@ubuntu:~# cat /sys/devices/system/cpu/cpu*/topology/thread_siblings 0000,00000100,00000001 0000,00040000,00000400 0000,00080000,00000800 0000,00100000,00001000 0000,00200000,00002000 0000,00400000,00004000 0000,00800000,00008000 0000,01000000,00010000 0000,02000000,00020000 0000,04000000,00040000 0000,08000000,00080000 0000,00000200,00000002 0000,10000000,00100000 0000,20000000,00200000 0000,40000000,00400000 0000,80000000,00800000 0001,00000000,01000000 0002,00000000,02000000 0004,00000000,04000000 0008,00000000,08000000 0010,00000000,10000000 0020,00000000,20000000 0000,00000400,00000004 0040,00000000,40000000 0080,00000000,80000000 0100,00000001,00000000 0200,00000002,00000000 0400,00000004,00000000 0800,00000008,00000000 1000,00000010,00000000 2000,00000020,00000000 4000,00000040,00000000 8000,00000080,00000000 0000,00000800,00000008 0000,00000100,00000001 0000,00000200,00000002 0000,00000400,00000004 0000,00000800,00000008 0000,00001000,00000010 0000,00002000,00000020 0000,00004000,00000040 0000,00008000,00000080 0000,00010000,00000100 0000,00020000,00000200 0000,00001000,00000010 0000,00040000,00000400 0000,00080000,00000800 0000,00100000,00001000 0000,00200000,00002000 0000,00400000,00004000 0000,00800000,00008000 0000,01000000,00010000 0000,02000000,00020000 0000,04000000,00040000 0000,08000000,00080000 0000,00002000,00000020 0000,10000000,00100000 0000,20000000,00200000 0000,40000000,00400000 0000,80000000,00800000 0001,00000000,01000000 0002,00000000,02000000 0004,00000000,04000000 0008,00000000,08000000 0010,00000000,10000000 0020,00000000,20000000 0000,00004000,00000040 0040,00000000,40000000 0080,00000000,80000000 0100,00000001,00000000 0200,00000002,00000000 0400,00000004,00000000 0800,00000008,00000000 1000,00000010,00000000 2000,00000020,00000000 4000,00000040,00000000 8000,00000080,00000000 0000,00008000,00000080 0000,00010000,00000100 0000,00020000,00000200 ```
Author
Owner

@dhiltgen commented on GitHub (Aug 2, 2024):

40 cores, 80 hyperthreads.

Can you share the server log loading a model so I can see how many threads we allocated for inference?

<!-- gh-comment-id:2265644370 --> @dhiltgen commented on GitHub (Aug 2, 2024): 40 cores, 80 hyperthreads. Can you share the server log loading a model so I can see how many threads we allocated for inference?
Author
Owner

@Mipuqt commented on GitHub (Aug 2, 2024):

Sorry, I can only access the logs on weekdays. The CPU is an Intel® Xeon® Gold 6138 Processor (20 cores, 40 threads) , but I couldn’t find how many ALUs it has. I tried adding "options": { "num_thread": x} to test ,i could see the threads were used,This configuration takes effect. The model test: fp16 is FlagAlpha/Llama2-Chinese-7b-Chat-LoRA. I use llama.cpp safetensors -> ollama . No matter how I tested, using 20~30 threads was the most efficient. Maybe there’s something wrong with the code. avg_time_per_token = eval_duration / eval_count
image
image

<!-- gh-comment-id:2265766845 --> @Mipuqt commented on GitHub (Aug 2, 2024): Sorry, I can only access the logs on weekdays. The CPU is an Intel® Xeon® Gold 6138 Processor (20 cores, 40 threads) , but I couldn’t find how many ALUs it has. I tried adding "options": { "num_thread": x} to test ,i could see the threads were used,This configuration takes effect. The model test: fp16 is FlagAlpha/Llama2-Chinese-7b-Chat-LoRA. I use llama.cpp safetensors -> ollama . No matter how I tested, using 20~30 threads was the most efficient. Maybe there’s something wrong with the code. avg_time_per_token = eval_duration / eval_count ![image](https://github.com/user-attachments/assets/3c63d592-b565-412b-99ec-605f6c64186b) ![image](https://github.com/user-attachments/assets/3ba27c5a-13e4-4694-a4ed-974d1b3eccc3)
Author
Owner

@Mipuqt commented on GitHub (Aug 5, 2024):

40 cores, 80 hyperthreads.

Can you share the server log loading a model so I can see how many threads we allocated for inference?

Aug 05 15:04:19 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:04:19 | 200 |          6m2s |      172.17.0.4 | POST     "/api/chat"
Aug 05 15:04:19 ubuntu ollama[1919922]: time=2024-08-05T15:04:19.491+08:00 level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=-1 layers.model=29 layers.offload=0 layers.split="" memory.available="[236.7 GiB]" memory.required.full="4.9 GiB" memory.required.partial="0 B" memory.required.kv="448.0 MiB" memory.required.allocations="[4.9 GiB]" memory.weights.total="3.9 GiB" memory.weights.repeating="3.4 GiB" memory.weights.nonrepeating="426.4 MiB" memory.graph.full="478.0 MiB" memory.graph.partial="730.4 MiB"
Aug 05 15:04:19 ubuntu ollama[1919922]: time=2024-08-05T15:04:19.492+08:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama3509060138/runners/cpu_avx2/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-43f7a214e5329f672bb05404cfba1913cbb70fdaa1a17497224e1925046b0ed5 --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --parallel 4 --port 34971"
Aug 05 15:04:19 ubuntu ollama[1919922]: time=2024-08-05T15:04:19.492+08:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
Aug 05 15:04:19 ubuntu ollama[1919922]: time=2024-08-05T15:04:19.492+08:00 level=INFO source=server.go:583 msg="waiting for llama runner to start responding"
Aug 05 15:04:19 ubuntu ollama[1919922]: time=2024-08-05T15:04:19.492+08:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server error"
Aug 05 15:04:19 ubuntu ollama[448639]: INFO [main] build info | build=1 commit="d94c6e0" tid="140398815795072" timestamp=1722841459
Aug 05 15:04:19 ubuntu ollama[448639]: INFO [main] system info | n_threads=40 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="140398815795072" timestamp=1722841459 total_threads=80
Aug 05 15:04:19 ubuntu ollama[448639]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="79" port="34971" tid="140398815795072" timestamp=1722841459
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: loaded meta data with 21 key-value pairs and 339 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-43f7a214e5329f672bb05404cfba1913cbb70fdaa1a17497224e1925046b0ed5 (version GGUF V3 (latest))
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv   0:                       general.architecture str              = qwen2
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv   1:                               general.name str              = Qwen2-7B-Instruct
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv   2:                          qwen2.block_count u32              = 28
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 3584
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 18944
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv   6:                 qwen2.attention.head_count u32              = 28
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv   7:              qwen2.attention.head_count_kv u32              = 4
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv   8:                       qwen2.rope.freq_base f32              = 1000000.000000
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv   9:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv  10:                          general.file_type u32              = 2
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = gpt2
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv  12:                         tokenizer.ggml.pre str              = qwen2
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 151645
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32              = 151643
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 151643
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv  19:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv  20:               general.quantization_version u32              = 2
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - type  f32:  141 tensors
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - type q4_0:  197 tensors
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - type q6_K:    1 tensors
Aug 05 15:04:19 ubuntu ollama[1919922]: time=2024-08-05T15:04:19.744+08:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server loading model"
Aug 05 15:04:19 ubuntu ollama[1919922]: llm_load_vocab: special tokens cache size = 421
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_vocab: token to piece cache size = 0.9352 MB
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: format           = GGUF V3 (latest)
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: arch             = qwen2
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: vocab type       = BPE
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_vocab          = 152064
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_merges         = 151387
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: vocab_only       = 0
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_ctx_train      = 32768
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_embd           = 3584
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_layer          = 28
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_head           = 28
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_head_kv        = 4
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_rot            = 128
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_swa            = 0
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_embd_head_k    = 128
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_embd_head_v    = 128
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_gqa            = 7
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_embd_k_gqa     = 512
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_embd_v_gqa     = 512
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_ff             = 18944
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_expert         = 0
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_expert_used    = 0
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: causal attn      = 1
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: pooling type     = 0
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: rope type        = 2
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: rope scaling     = linear
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: freq_base_train  = 1000000.0
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: freq_scale_train = 1
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_ctx_orig_yarn  = 32768
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: rope_finetuned   = unknown
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: ssm_d_conv       = 0
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: ssm_d_inner      = 0
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: ssm_d_state      = 0
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: ssm_dt_rank      = 0
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: model type       = ?B
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: model ftype      = Q4_0
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: model params     = 7.62 B
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: model size       = 4.12 GiB (4.65 BPW)
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: general.name     = Qwen2-7B-Instruct
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: LF token         = 148848 'ÄĬ'
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: max token length = 256
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_tensors: ggml ctx size =    0.15 MiB
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_tensors:        CPU buffer size =  4220.43 MiB
Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: n_ctx      = 8192
Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: n_batch    = 512
Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: n_ubatch   = 512
Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: flash_attn = 0
Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: freq_base  = 1000000.0
Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: freq_scale = 1
Aug 05 15:04:22 ubuntu ollama[1919922]: llama_kv_cache_init:        CPU KV buffer size =   448.00 MiB
Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: KV self size  =  448.00 MiB, K (f16):  224.00 MiB, V (f16):  224.00 MiB
Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model:        CPU  output buffer size =     2.38 MiB
Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model:        CPU compute buffer size =   492.01 MiB
Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: graph nodes  = 986
Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: graph splits = 1
Aug 05 15:04:22 ubuntu ollama[448639]: INFO [main] model loaded | tid="140398815795072" timestamp=1722841462
Aug 05 15:04:23 ubuntu ollama[1919922]: time=2024-08-05T15:04:23.011+08:00 level=INFO source=server.go:622 msg="llama runner started in 3.52 seconds"
Aug 05 15:08:21 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:08:21 | 200 |         4m34s |      172.17.0.4 | POST     "/api/chat"
Aug 05 15:09:20 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:09:20 | 200 |         5m57s |      172.17.0.4 | POST     "/api/chat"
Aug 05 15:10:30 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:10:30 | 200 |         11m3s |      172.17.0.4 | POST     "/api/chat"
Aug 05 15:11:38 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:11:38 | 200 |         7m36s |      172.17.0.4 | POST     "/api/chat"
Aug 05 15:12:40 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:12:40 | 200 |         8m33s |      172.17.0.4 | POST     "/api/chat"
Aug 05 15:13:53 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:13:53 | 200 |         9m34s |      172.17.0.4 | POST     "/api/chat"
Aug 05 15:14:26 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:14:26 | 200 |          5m5s |      172.17.0.4 | POST     "/api/chat"
Aug 05 15:14:35 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:14:35 | 200 |         6m13s |      172.17.0.4 | POST     "/api/chat"
Aug 05 15:14:52 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:14:52 | 200 |         4m21s |      172.17.0.4 | POST     "/api/chat"
Aug 05 15:15:30 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:15:30 | 200 |         3m51s |      172.17.0.4 | POST     "/api/chat"
Aug 05 15:16:54 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:16:54 | 200 |         4m13s |      172.17.0.4 | POST     "/api/chat"
Aug 05 15:18:15 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:18:15 | 200 |         4m21s |      172.17.0.4 | POST     "/api/chat"
Aug 05 15:18:25 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:18:25 | 200 |         3m49s |      172.17.0.4 | POST     "/api/chat"
Aug 05 15:18:30 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:18:30 | 200 |          4m4s |      172.17.0.4 | POST     "/api/chat"
Aug 05 15:18:44 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:18:44 | 200 |         3m51s |      172.17.0.4 | POST     "/api/chat"
Aug 05 15:34:36 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:34:36 | 200 |    1.913155ms |      172.17.0.4 | GET      "/api/tags"
Aug 05 15:34:36 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:34:36 | 200 |      51.917µs |      172.17.0.4 | GET      "/api/version"
Aug 05 15:36:29 ubuntu ollama[1919922]: time=2024-08-05T15:36:29.925+08:00 level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=-1 layers.model=29 layers.offload=0 layers.split="" memory.available="[236.7 GiB]" memory.required.full="4.9 GiB" memory.required.partial="0 B" memory.required.kv="448.0 MiB" memory.required.allocations="[4.9 GiB]" memory.weights.total="3.9 GiB" memory.weights.repeating="3.4 GiB" memory.weights.nonrepeating="426.4 MiB" memory.graph.full="478.0 MiB" memory.graph.partial="730.4 MiB"
Aug 05 15:36:29 ubuntu ollama[1919922]: time=2024-08-05T15:36:29.926+08:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama3509060138/runners/cpu_avx2/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-43f7a214e5329f672bb05404cfba1913cbb70fdaa1a17497224e1925046b0ed5 --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --parallel 4 --port 45271"
Aug 05 15:36:29 ubuntu ollama[1919922]: time=2024-08-05T15:36:29.926+08:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
Aug 05 15:36:29 ubuntu ollama[1919922]: time=2024-08-05T15:36:29.926+08:00 level=INFO source=server.go:583 msg="waiting for llama runner to start responding"
Aug 05 15:36:29 ubuntu ollama[1919922]: time=2024-08-05T15:36:29.927+08:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server error"
Aug 05 15:36:29 ubuntu ollama[527197]: INFO [main] build info | build=1 commit="d94c6e0" tid="140582874306432" timestamp=1722843389
Aug 05 15:36:29 ubuntu ollama[527197]: INFO [main] system info | n_threads=40 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="140582874306432" timestamp=1722843389 total_threads=80
Aug 05 15:36:29 ubuntu ollama[527197]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="79" port="45271" tid="140582874306432" timestamp=1722843389
Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: loaded meta data with 21 key-value pairs and 339 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-43f7a214e5329f672bb05404cfba1913cbb70fdaa1a17497224e1925046b0ed5 (version GGUF V3 (latest))
Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv   0:                       general.architecture str              = qwen2
Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv   1:                               general.name str              = Qwen2-7B-Instruct
Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv   2:                          qwen2.block_count u32              = 28
Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 3584
Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 18944
Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv   6:                 qwen2.attention.head_count u32              = 28
Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv   7:              qwen2.attention.head_count_kv u32              = 4
Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv   8:                       qwen2.rope.freq_base f32              = 1000000.000000
Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv   9:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv  10:                          general.file_type u32              = 2
Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = gpt2
Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv  12:                         tokenizer.ggml.pre str              = qwen2
Aug 05 15:36:30 ubuntu ollama[1919922]: llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Aug 05 15:36:30 ubuntu ollama[1919922]: llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Aug 05 15:36:30 ubuntu ollama[1919922]: llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
Aug 05 15:36:30 ubuntu ollama[1919922]: llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 151645
Aug 05 15:36:30 ubuntu ollama[1919922]: llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32              = 151643
Aug 05 15:36:30 ubuntu ollama[1919922]: llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 151643
Aug 05 15:36:30 ubuntu ollama[1919922]: llama_model_loader: - kv  19:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
Aug 05 15:36:30 ubuntu ollama[1919922]: llama_model_loader: - kv  20:               general.quantization_version u32   
<!-- gh-comment-id:2268389963 --> @Mipuqt commented on GitHub (Aug 5, 2024): > 40 cores, 80 hyperthreads. > > Can you share the server log loading a model so I can see how many threads we allocated for inference? ``` Aug 05 15:04:19 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:04:19 | 200 | 6m2s | 172.17.0.4 | POST "/api/chat" Aug 05 15:04:19 ubuntu ollama[1919922]: time=2024-08-05T15:04:19.491+08:00 level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=-1 layers.model=29 layers.offload=0 layers.split="" memory.available="[236.7 GiB]" memory.required.full="4.9 GiB" memory.required.partial="0 B" memory.required.kv="448.0 MiB" memory.required.allocations="[4.9 GiB]" memory.weights.total="3.9 GiB" memory.weights.repeating="3.4 GiB" memory.weights.nonrepeating="426.4 MiB" memory.graph.full="478.0 MiB" memory.graph.partial="730.4 MiB" Aug 05 15:04:19 ubuntu ollama[1919922]: time=2024-08-05T15:04:19.492+08:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama3509060138/runners/cpu_avx2/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-43f7a214e5329f672bb05404cfba1913cbb70fdaa1a17497224e1925046b0ed5 --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --parallel 4 --port 34971" Aug 05 15:04:19 ubuntu ollama[1919922]: time=2024-08-05T15:04:19.492+08:00 level=INFO source=sched.go:437 msg="loaded runners" count=1 Aug 05 15:04:19 ubuntu ollama[1919922]: time=2024-08-05T15:04:19.492+08:00 level=INFO source=server.go:583 msg="waiting for llama runner to start responding" Aug 05 15:04:19 ubuntu ollama[1919922]: time=2024-08-05T15:04:19.492+08:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server error" Aug 05 15:04:19 ubuntu ollama[448639]: INFO [main] build info | build=1 commit="d94c6e0" tid="140398815795072" timestamp=1722841459 Aug 05 15:04:19 ubuntu ollama[448639]: INFO [main] system info | n_threads=40 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="140398815795072" timestamp=1722841459 total_threads=80 Aug 05 15:04:19 ubuntu ollama[448639]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="79" port="34971" tid="140398815795072" timestamp=1722841459 Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: loaded meta data with 21 key-value pairs and 339 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-43f7a214e5329f672bb05404cfba1913cbb70fdaa1a17497224e1925046b0ed5 (version GGUF V3 (latest)) Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 0: general.architecture str = qwen2 Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 1: general.name str = Qwen2-7B-Instruct Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 2: qwen2.block_count u32 = 28 Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 3: qwen2.context_length u32 = 32768 Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 4: qwen2.embedding_length u32 = 3584 Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 18944 Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 28 Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 4 Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000 Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 10: general.file_type u32 = 2 Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2 Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2 Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151645 Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643 Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643 Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 19: tokenizer.chat_template str = {% for message in messages %}{% if lo... Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 20: general.quantization_version u32 = 2 Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - type f32: 141 tensors Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - type q4_0: 197 tensors Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - type q6_K: 1 tensors Aug 05 15:04:19 ubuntu ollama[1919922]: time=2024-08-05T15:04:19.744+08:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server loading model" Aug 05 15:04:19 ubuntu ollama[1919922]: llm_load_vocab: special tokens cache size = 421 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_vocab: token to piece cache size = 0.9352 MB Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: format = GGUF V3 (latest) Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: arch = qwen2 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: vocab type = BPE Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_vocab = 152064 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_merges = 151387 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: vocab_only = 0 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_ctx_train = 32768 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_embd = 3584 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_layer = 28 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_head = 28 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_head_kv = 4 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_rot = 128 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_swa = 0 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_embd_head_k = 128 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_embd_head_v = 128 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_gqa = 7 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_embd_k_gqa = 512 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_embd_v_gqa = 512 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: f_norm_eps = 0.0e+00 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: f_logit_scale = 0.0e+00 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_ff = 18944 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_expert = 0 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_expert_used = 0 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: causal attn = 1 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: pooling type = 0 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: rope type = 2 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: rope scaling = linear Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: freq_base_train = 1000000.0 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: freq_scale_train = 1 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_ctx_orig_yarn = 32768 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: rope_finetuned = unknown Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: ssm_d_conv = 0 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: ssm_d_inner = 0 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: ssm_d_state = 0 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: ssm_dt_rank = 0 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: model type = ?B Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: model ftype = Q4_0 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: model params = 7.62 B Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: model size = 4.12 GiB (4.65 BPW) Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: general.name = Qwen2-7B-Instruct Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: BOS token = 151643 '<|endoftext|>' Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: EOS token = 151645 '<|im_end|>' Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: PAD token = 151643 '<|endoftext|>' Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: LF token = 148848 'ÄĬ' Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: EOT token = 151645 '<|im_end|>' Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: max token length = 256 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_tensors: ggml ctx size = 0.15 MiB Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_tensors: CPU buffer size = 4220.43 MiB Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: n_ctx = 8192 Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: n_batch = 512 Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: n_ubatch = 512 Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: flash_attn = 0 Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: freq_base = 1000000.0 Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: freq_scale = 1 Aug 05 15:04:22 ubuntu ollama[1919922]: llama_kv_cache_init: CPU KV buffer size = 448.00 MiB Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: KV self size = 448.00 MiB, K (f16): 224.00 MiB, V (f16): 224.00 MiB Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: CPU output buffer size = 2.38 MiB Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: CPU compute buffer size = 492.01 MiB Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: graph nodes = 986 Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: graph splits = 1 Aug 05 15:04:22 ubuntu ollama[448639]: INFO [main] model loaded | tid="140398815795072" timestamp=1722841462 Aug 05 15:04:23 ubuntu ollama[1919922]: time=2024-08-05T15:04:23.011+08:00 level=INFO source=server.go:622 msg="llama runner started in 3.52 seconds" Aug 05 15:08:21 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:08:21 | 200 | 4m34s | 172.17.0.4 | POST "/api/chat" Aug 05 15:09:20 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:09:20 | 200 | 5m57s | 172.17.0.4 | POST "/api/chat" Aug 05 15:10:30 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:10:30 | 200 | 11m3s | 172.17.0.4 | POST "/api/chat" Aug 05 15:11:38 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:11:38 | 200 | 7m36s | 172.17.0.4 | POST "/api/chat" Aug 05 15:12:40 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:12:40 | 200 | 8m33s | 172.17.0.4 | POST "/api/chat" Aug 05 15:13:53 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:13:53 | 200 | 9m34s | 172.17.0.4 | POST "/api/chat" Aug 05 15:14:26 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:14:26 | 200 | 5m5s | 172.17.0.4 | POST "/api/chat" Aug 05 15:14:35 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:14:35 | 200 | 6m13s | 172.17.0.4 | POST "/api/chat" Aug 05 15:14:52 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:14:52 | 200 | 4m21s | 172.17.0.4 | POST "/api/chat" Aug 05 15:15:30 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:15:30 | 200 | 3m51s | 172.17.0.4 | POST "/api/chat" Aug 05 15:16:54 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:16:54 | 200 | 4m13s | 172.17.0.4 | POST "/api/chat" Aug 05 15:18:15 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:18:15 | 200 | 4m21s | 172.17.0.4 | POST "/api/chat" Aug 05 15:18:25 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:18:25 | 200 | 3m49s | 172.17.0.4 | POST "/api/chat" Aug 05 15:18:30 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:18:30 | 200 | 4m4s | 172.17.0.4 | POST "/api/chat" Aug 05 15:18:44 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:18:44 | 200 | 3m51s | 172.17.0.4 | POST "/api/chat" Aug 05 15:34:36 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:34:36 | 200 | 1.913155ms | 172.17.0.4 | GET "/api/tags" Aug 05 15:34:36 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:34:36 | 200 | 51.917µs | 172.17.0.4 | GET "/api/version" Aug 05 15:36:29 ubuntu ollama[1919922]: time=2024-08-05T15:36:29.925+08:00 level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=-1 layers.model=29 layers.offload=0 layers.split="" memory.available="[236.7 GiB]" memory.required.full="4.9 GiB" memory.required.partial="0 B" memory.required.kv="448.0 MiB" memory.required.allocations="[4.9 GiB]" memory.weights.total="3.9 GiB" memory.weights.repeating="3.4 GiB" memory.weights.nonrepeating="426.4 MiB" memory.graph.full="478.0 MiB" memory.graph.partial="730.4 MiB" Aug 05 15:36:29 ubuntu ollama[1919922]: time=2024-08-05T15:36:29.926+08:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama3509060138/runners/cpu_avx2/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-43f7a214e5329f672bb05404cfba1913cbb70fdaa1a17497224e1925046b0ed5 --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --parallel 4 --port 45271" Aug 05 15:36:29 ubuntu ollama[1919922]: time=2024-08-05T15:36:29.926+08:00 level=INFO source=sched.go:437 msg="loaded runners" count=1 Aug 05 15:36:29 ubuntu ollama[1919922]: time=2024-08-05T15:36:29.926+08:00 level=INFO source=server.go:583 msg="waiting for llama runner to start responding" Aug 05 15:36:29 ubuntu ollama[1919922]: time=2024-08-05T15:36:29.927+08:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server error" Aug 05 15:36:29 ubuntu ollama[527197]: INFO [main] build info | build=1 commit="d94c6e0" tid="140582874306432" timestamp=1722843389 Aug 05 15:36:29 ubuntu ollama[527197]: INFO [main] system info | n_threads=40 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="140582874306432" timestamp=1722843389 total_threads=80 Aug 05 15:36:29 ubuntu ollama[527197]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="79" port="45271" tid="140582874306432" timestamp=1722843389 Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: loaded meta data with 21 key-value pairs and 339 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-43f7a214e5329f672bb05404cfba1913cbb70fdaa1a17497224e1925046b0ed5 (version GGUF V3 (latest)) Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv 0: general.architecture str = qwen2 Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv 1: general.name str = Qwen2-7B-Instruct Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv 2: qwen2.block_count u32 = 28 Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv 3: qwen2.context_length u32 = 32768 Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv 4: qwen2.embedding_length u32 = 3584 Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 18944 Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 28 Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 4 Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000 Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv 10: general.file_type u32 = 2 Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2 Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2 Aug 05 15:36:30 ubuntu ollama[1919922]: llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... Aug 05 15:36:30 ubuntu ollama[1919922]: llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Aug 05 15:36:30 ubuntu ollama[1919922]: llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... Aug 05 15:36:30 ubuntu ollama[1919922]: llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151645 Aug 05 15:36:30 ubuntu ollama[1919922]: llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643 Aug 05 15:36:30 ubuntu ollama[1919922]: llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643 Aug 05 15:36:30 ubuntu ollama[1919922]: llama_model_loader: - kv 19: tokenizer.chat_template str = {% for message in messages %}{% if lo... Aug 05 15:36:30 ubuntu ollama[1919922]: llama_model_loader: - kv 20: general.quantization_version u32 ```
Author
Owner

@dhiltgen commented on GitHub (Aug 5, 2024):

Thanks for sharing that log. I was incorrect. We are correctly allocating 40 threads for 40 cores total, so this isn't a thread problem, it's a numa setup problem, and I think we're likely making it worse by stuffing them into 1 socket.

<!-- gh-comment-id:2269486842 --> @dhiltgen commented on GitHub (Aug 5, 2024): Thanks for sharing that log. I was incorrect. We are correctly allocating 40 threads for 40 cores total, so this isn't a thread problem, it's a numa setup problem, and I think we're likely making it worse by stuffing them into 1 socket.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#50321