[GH-ISSUE #6093] Only one of the dual CPUs is in use #50321

New Issue

GiteaMirror · 2026-04-28T15:06:29-05:00

GiteaMirror commented

2026-04-28 15:06:29 -05:00

Originally created by @Mipuqt on GitHub (Jul 31, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6093

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

My machine has two CPUs without GPUs, and when I run the model, I find that the CPUs are used at most 50%

OS

Linux

GPU

Other

CPU

Intel

Ollama version

0.3.0

Originally created by @Mipuqt on GitHub (Jul 31, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6093 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? My machine has two CPUs without GPUs, and when I run the model, I find that the CPUs are used at most 50% ![PixPin_2024-07-31_16-11-31](https://github.com/user-attachments/assets/36c049f8-e4d5-4be8-a6f2-420532362226) ![image](https://github.com/user-attachments/assets/028ababf-4824-42fd-b9d5-a6e88b3eb9c6) ### OS Linux ### GPU Other ### CPU Intel ### Ollama version 0.3.0

GiteaMirror added the bug linux labels 2026-04-28 15:06:29 -05:00

GiteaMirror closed this issue

2026-04-28 15:06:31 -05:00

GiteaMirror commented

2026-04-28 15:06:36 -05:00

@mxmp210 commented on GitHub (Jul 31, 2024):

It's because of SMT that inference has no performance gain while CPU has limited floating point pipelines which is half of ALU ( cpu ) count.

To make sure, OS is reporting correct CPU identifiers check with grep -E 'processor|core id' /proc/cpuinfo The number of cores having same ID will only be effective once as they reside on same pipeline and considered as single compute thread.

If you believe SMT is disabled you can explicitly set number of threads by passing threads count in api.

Currently, there isn't any environment variable exposed which may change in future.

@mxmp210 commented on GitHub (Jul 31, 2024): It's because of SMT that inference has no performance gain while CPU has limited floating point pipelines which is half of ALU ( cpu ) count. To make sure, OS is reporting correct CPU identifiers check with `grep -E 'processor|core id' /proc/cpuinfo` The number of cores having same ID will only be effective once as they reside on same pipeline and considered as single compute thread. If you believe SMT is disabled you can explicitly set number of threads by passing threads count in api. Currently, there isn't any environment variable exposed which may change in future.

GiteaMirror commented

2026-04-28 15:06:39 -05:00

@rick-github commented on GitHub (Jul 31, 2024):

You can add "options": { "num_thread": x} to the API call to change the number of threads used for inference, but there's diminishing returns as thread count increases. On my 16 core 24 processor machine that's at about 10 threads.

     +---------------------------------------------------------------------+
  14 |-+      +        +       +        +        +        +       +      +-|
     |                                              threads vs tps ******* |
     |                                                                     |
  12 |-+                                                                 +-|
     |                                         ****************************|
     |                                *********                            |
  10 |-+                          ****                                   +-|
     |                    ********                                         |
   8 |-+                **                                               +-|
     |                **                                                   |
     |              **                                                     |
   6 |-+         ***                                                     +-|
     |         **                                                          |
     |        *                                                            |
   4 |-+    **                                                           +-|
     |    **                                                               |
   2 |-+ *                                                               +-|
     |                                                                     |
     |        +        +       +        +        +        +       +        |
   0 +---------------------------------------------------------------------+
     0        2        4       6        8        10       12      14       16

@rick-github commented on GitHub (Jul 31, 2024): You can add `"options": { "num_thread": x}` to the API call to change the number of threads used for inference, but there's diminishing returns as thread count increases. On my 16 core 24 processor machine that's at about 10 threads. ``` +---------------------------------------------------------------------+ 14 |-+ + + + + + + + +-| | threads vs tps ******* | | | 12 |-+ +-| | ****************************| | ********* | 10 |-+ **** +-| | ******** | 8 |-+ ** +-| | ** | | ** | 6 |-+ *** +-| | ** | | * | 4 |-+ ** +-| | ** | 2 |-+ * +-| | | | + + + + + + + | 0 +---------------------------------------------------------------------+ 0 2 4 6 8 10 12 14 16 ```

GiteaMirror commented

2026-04-28 15:06:40 -05:00

@dhiltgen commented on GitHub (Aug 1, 2024):

This sounds like it's probably a variation of #2496

@Mipuqt can you share the output of the following commands on your system?

ls /sys/devices/system/cpu/
cat /sys/devices/system/cpu/cpu*/topology/thread_siblings

@dhiltgen commented on GitHub (Aug 1, 2024): This sounds like it's probably a variation of #2496 @Mipuqt can you share the output of the following commands on your system? ``` ls /sys/devices/system/cpu/ cat /sys/devices/system/cpu/cpu*/topology/thread_siblings ```

GiteaMirror commented

2026-04-28 15:06:41 -05:00

@Mipuqt commented on GitHub (Aug 2, 2024):

This sounds like it's probably a variation of #2496

@Mipuqt can you share the output of the following commands on your system?
ls /sys/devices/system/cpu/
cat /sys/devices/system/cpu/cpu*/topology/thread_siblings

root@ubuntu:~# ls /sys/devices/system/cpu/
cpu0   cpu14  cpu2   cpu25  cpu30  cpu36  cpu41  cpu47  cpu52  cpu58  cpu63  cpu69  cpu74  cpu8          isolated    possible
cpu1   cpu15  cpu20  cpu26  cpu31  cpu37  cpu42  cpu48  cpu53  cpu59  cpu64  cpu7   cpu75  cpu9          kernel_max  power
cpu10  cpu16  cpu21  cpu27  cpu32  cpu38  cpu43  cpu49  cpu54  cpu6   cpu65  cpu70  cpu76  cpufreq       microcode   present
cpu11  cpu17  cpu22  cpu28  cpu33  cpu39  cpu44  cpu5   cpu55  cpu60  cpu66  cpu71  cpu77  cpuidle       modalias    smt
cpu12  cpu18  cpu23  cpu29  cpu34  cpu4   cpu45  cpu50  cpu56  cpu61  cpu67  cpu72  cpu78  hotplug       offline     uevent
cpu13  cpu19  cpu24  cpu3   cpu35  cpu40  cpu46  cpu51  cpu57  cpu62  cpu68  cpu73  cpu79  intel_pstate  online      vulnerabilities
root@ubuntu:~# cat /sys/devices/system/cpu/cpu*/topology/thread_siblings
0000,00000100,00000001
0000,00040000,00000400
0000,00080000,00000800
0000,00100000,00001000
0000,00200000,00002000
0000,00400000,00004000
0000,00800000,00008000
0000,01000000,00010000
0000,02000000,00020000
0000,04000000,00040000
0000,08000000,00080000
0000,00000200,00000002
0000,10000000,00100000
0000,20000000,00200000
0000,40000000,00400000
0000,80000000,00800000
0001,00000000,01000000
0002,00000000,02000000
0004,00000000,04000000
0008,00000000,08000000
0010,00000000,10000000
0020,00000000,20000000
0000,00000400,00000004
0040,00000000,40000000
0080,00000000,80000000
0100,00000001,00000000
0200,00000002,00000000
0400,00000004,00000000
0800,00000008,00000000
1000,00000010,00000000
2000,00000020,00000000
4000,00000040,00000000
8000,00000080,00000000
0000,00000800,00000008
0000,00000100,00000001
0000,00000200,00000002
0000,00000400,00000004
0000,00000800,00000008
0000,00001000,00000010
0000,00002000,00000020
0000,00004000,00000040
0000,00008000,00000080
0000,00010000,00000100
0000,00020000,00000200
0000,00001000,00000010
0000,00040000,00000400
0000,00080000,00000800
0000,00100000,00001000
0000,00200000,00002000
0000,00400000,00004000
0000,00800000,00008000
0000,01000000,00010000
0000,02000000,00020000
0000,04000000,00040000
0000,08000000,00080000
0000,00002000,00000020
0000,10000000,00100000
0000,20000000,00200000
0000,40000000,00400000
0000,80000000,00800000
0001,00000000,01000000
0002,00000000,02000000
0004,00000000,04000000
0008,00000000,08000000
0010,00000000,10000000
0020,00000000,20000000
0000,00004000,00000040
0040,00000000,40000000
0080,00000000,80000000
0100,00000001,00000000
0200,00000002,00000000
0400,00000004,00000000
0800,00000008,00000000
1000,00000010,00000000
2000,00000020,00000000
4000,00000040,00000000
8000,00000080,00000000
0000,00008000,00000080
0000,00010000,00000100
0000,00020000,00000200

@Mipuqt commented on GitHub (Aug 2, 2024): > This sounds like it's probably a variation of #2496 > > @Mipuqt can you share the output of the following commands on your system? > > ``` > ls /sys/devices/system/cpu/ > cat /sys/devices/system/cpu/cpu*/topology/thread_siblings > ``` ``` root@ubuntu:~# ls /sys/devices/system/cpu/ cpu0 cpu14 cpu2 cpu25 cpu30 cpu36 cpu41 cpu47 cpu52 cpu58 cpu63 cpu69 cpu74 cpu8 isolated possible cpu1 cpu15 cpu20 cpu26 cpu31 cpu37 cpu42 cpu48 cpu53 cpu59 cpu64 cpu7 cpu75 cpu9 kernel_max power cpu10 cpu16 cpu21 cpu27 cpu32 cpu38 cpu43 cpu49 cpu54 cpu6 cpu65 cpu70 cpu76 cpufreq microcode present cpu11 cpu17 cpu22 cpu28 cpu33 cpu39 cpu44 cpu5 cpu55 cpu60 cpu66 cpu71 cpu77 cpuidle modalias smt cpu12 cpu18 cpu23 cpu29 cpu34 cpu4 cpu45 cpu50 cpu56 cpu61 cpu67 cpu72 cpu78 hotplug offline uevent cpu13 cpu19 cpu24 cpu3 cpu35 cpu40 cpu46 cpu51 cpu57 cpu62 cpu68 cpu73 cpu79 intel_pstate online vulnerabilities root@ubuntu:~# cat /sys/devices/system/cpu/cpu*/topology/thread_siblings 0000,00000100,00000001 0000,00040000,00000400 0000,00080000,00000800 0000,00100000,00001000 0000,00200000,00002000 0000,00400000,00004000 0000,00800000,00008000 0000,01000000,00010000 0000,02000000,00020000 0000,04000000,00040000 0000,08000000,00080000 0000,00000200,00000002 0000,10000000,00100000 0000,20000000,00200000 0000,40000000,00400000 0000,80000000,00800000 0001,00000000,01000000 0002,00000000,02000000 0004,00000000,04000000 0008,00000000,08000000 0010,00000000,10000000 0020,00000000,20000000 0000,00000400,00000004 0040,00000000,40000000 0080,00000000,80000000 0100,00000001,00000000 0200,00000002,00000000 0400,00000004,00000000 0800,00000008,00000000 1000,00000010,00000000 2000,00000020,00000000 4000,00000040,00000000 8000,00000080,00000000 0000,00000800,00000008 0000,00000100,00000001 0000,00000200,00000002 0000,00000400,00000004 0000,00000800,00000008 0000,00001000,00000010 0000,00002000,00000020 0000,00004000,00000040 0000,00008000,00000080 0000,00010000,00000100 0000,00020000,00000200 0000,00001000,00000010 0000,00040000,00000400 0000,00080000,00000800 0000,00100000,00001000 0000,00200000,00002000 0000,00400000,00004000 0000,00800000,00008000 0000,01000000,00010000 0000,02000000,00020000 0000,04000000,00040000 0000,08000000,00080000 0000,00002000,00000020 0000,10000000,00100000 0000,20000000,00200000 0000,40000000,00400000 0000,80000000,00800000 0001,00000000,01000000 0002,00000000,02000000 0004,00000000,04000000 0008,00000000,08000000 0010,00000000,10000000 0020,00000000,20000000 0000,00004000,00000040 0040,00000000,40000000 0080,00000000,80000000 0100,00000001,00000000 0200,00000002,00000000 0400,00000004,00000000 0800,00000008,00000000 1000,00000010,00000000 2000,00000020,00000000 4000,00000040,00000000 8000,00000080,00000000 0000,00008000,00000080 0000,00010000,00000100 0000,00020000,00000200 ```

GiteaMirror commented

2026-04-28 15:06:43 -05:00

@dhiltgen commented on GitHub (Aug 2, 2024):

40 cores, 80 hyperthreads.

Can you share the server log loading a model so I can see how many threads we allocated for inference?

@dhiltgen commented on GitHub (Aug 2, 2024): 40 cores, 80 hyperthreads. Can you share the server log loading a model so I can see how many threads we allocated for inference?

GiteaMirror commented

2026-04-28 15:06:44 -05:00

@Mipuqt commented on GitHub (Aug 2, 2024):

Sorry, I can only access the logs on weekdays. The CPU is an Intel® Xeon® Gold 6138 Processor (20 cores, 40 threads) , but I couldn’t find how many ALUs it has. I tried adding "options": { "num_thread": x} to test ,i could see the threads were used,This configuration takes effect. The model test: fp16 is FlagAlpha/Llama2-Chinese-7b-Chat-LoRA. I use llama.cpp safetensors -＞ ollama . No matter how I tested, using 20~30 threads was the most efficient. Maybe there’s something wrong with the code. avg_time_per_token = eval_duration / eval_count

@Mipuqt commented on GitHub (Aug 2, 2024): Sorry, I can only access the logs on weekdays. The CPU is an Intel® Xeon® Gold 6138 Processor (20 cores, 40 threads) , but I couldn’t find how many ALUs it has. I tried adding "options": { "num_thread": x} to test ,i could see the threads were used,This configuration takes effect. The model test: fp16 is FlagAlpha/Llama2-Chinese-7b-Chat-LoRA. I use llama.cpp safetensors -＞ ollama . No matter how I tested, using 20~30 threads was the most efficient. Maybe there’s something wrong with the code. avg_time_per_token = eval_duration / eval_count ![image](https://github.com/user-attachments/assets/3c63d592-b565-412b-99ec-605f6c64186b) ![image](https://github.com/user-attachments/assets/3ba27c5a-13e4-4694-a4ed-974d1b3eccc3)

GiteaMirror commented

2026-04-28 15:06:46 -05:00

@Mipuqt commented on GitHub (Aug 5, 2024):

40 cores, 80 hyperthreads.

Can you share the server log loading a model so I can see how many threads we allocated for inference?

Aug 05 15:04:19 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:04:19 | 200 |          6m2s |      172.17.0.4 | POST     "/api/chat"
Aug 05 15:04:19 ubuntu ollama[1919922]: time=2024-08-05T15:04:19.491+08:00 level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=-1 layers.model=29 layers.offload=0 layers.split="" memory.available="[236.7 GiB]" memory.required.full="4.9 GiB" memory.required.partial="0 B" memory.required.kv="448.0 MiB" memory.required.allocations="[4.9 GiB]" memory.weights.total="3.9 GiB" memory.weights.repeating="3.4 GiB" memory.weights.nonrepeating="426.4 MiB" memory.graph.full="478.0 MiB" memory.graph.partial="730.4 MiB"
Aug 05 15:04:19 ubuntu ollama[1919922]: time=2024-08-05T15:04:19.492+08:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama3509060138/runners/cpu_avx2/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-43f7a214e5329f672bb05404cfba1913cbb70fdaa1a17497224e1925046b0ed5 --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --parallel 4 --port 34971"
Aug 05 15:04:19 ubuntu ollama[1919922]: time=2024-08-05T15:04:19.492+08:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
Aug 05 15:04:19 ubuntu ollama[1919922]: time=2024-08-05T15:04:19.492+08:00 level=INFO source=server.go:583 msg="waiting for llama runner to start responding"
Aug 05 15:04:19 ubuntu ollama[1919922]: time=2024-08-05T15:04:19.492+08:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server error"
Aug 05 15:04:19 ubuntu ollama[448639]: INFO [main] build info | build=1 commit="d94c6e0" tid="140398815795072" timestamp=1722841459
Aug 05 15:04:19 ubuntu ollama[448639]: INFO [main] system info | n_threads=40 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="140398815795072" timestamp=1722841459 total_threads=80
Aug 05 15:04:19 ubuntu ollama[448639]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="79" port="34971" tid="140398815795072" timestamp=1722841459
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: loaded meta data with 21 key-value pairs and 339 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-43f7a214e5329f672bb05404cfba1913cbb70fdaa1a17497224e1925046b0ed5 (version GGUF V3 (latest))
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv   0:                       general.architecture str              = qwen2
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv   1:                               general.name str              = Qwen2-7B-Instruct
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv   2:                          qwen2.block_count u32              = 28
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 3584
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 18944
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv   6:                 qwen2.attention.head_count u32              = 28
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv   7:              qwen2.attention.head_count_kv u32              = 4
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv   8:                       qwen2.rope.freq_base f32              = 1000000.000000
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv   9:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv  10:                          general.file_type u32              = 2
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = gpt2
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv  12:                         tokenizer.ggml.pre str              = qwen2
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 151645
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32              = 151643
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 151643
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv  19:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv  20:               general.quantization_version u32              = 2
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - type  f32:  141 tensors
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - type q4_0:  197 tensors
Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - type q6_K:    1 tensors
Aug 05 15:04:19 ubuntu ollama[1919922]: time=2024-08-05T15:04:19.744+08:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server loading model"
Aug 05 15:04:19 ubuntu ollama[1919922]: llm_load_vocab: special tokens cache size = 421
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_vocab: token to piece cache size = 0.9352 MB
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: format           = GGUF V3 (latest)
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: arch             = qwen2
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: vocab type       = BPE
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_vocab          = 152064
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_merges         = 151387
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: vocab_only       = 0
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_ctx_train      = 32768
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_embd           = 3584
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_layer          = 28
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_head           = 28
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_head_kv        = 4
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_rot            = 128
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_swa            = 0
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_embd_head_k    = 128
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_embd_head_v    = 128
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_gqa            = 7
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_embd_k_gqa     = 512
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_embd_v_gqa     = 512
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_ff             = 18944
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_expert         = 0
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_expert_used    = 0
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: causal attn      = 1
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: pooling type     = 0
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: rope type        = 2
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: rope scaling     = linear
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: freq_base_train  = 1000000.0
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: freq_scale_train = 1
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_ctx_orig_yarn  = 32768
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: rope_finetuned   = unknown
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: ssm_d_conv       = 0
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: ssm_d_inner      = 0
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: ssm_d_state      = 0
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: ssm_dt_rank      = 0
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: model type       = ?B
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: model ftype      = Q4_0
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: model params     = 7.62 B
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: model size       = 4.12 GiB (4.65 BPW)
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: general.name     = Qwen2-7B-Instruct
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: LF token         = 148848 'ÄĬ'
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: max token length = 256
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_tensors: ggml ctx size =    0.15 MiB
Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_tensors:        CPU buffer size =  4220.43 MiB
Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: n_ctx      = 8192
Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: n_batch    = 512
Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: n_ubatch   = 512
Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: flash_attn = 0
Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: freq_base  = 1000000.0
Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: freq_scale = 1
Aug 05 15:04:22 ubuntu ollama[1919922]: llama_kv_cache_init:        CPU KV buffer size =   448.00 MiB
Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: KV self size  =  448.00 MiB, K (f16):  224.00 MiB, V (f16):  224.00 MiB
Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model:        CPU  output buffer size =     2.38 MiB
Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model:        CPU compute buffer size =   492.01 MiB
Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: graph nodes  = 986
Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: graph splits = 1
Aug 05 15:04:22 ubuntu ollama[448639]: INFO [main] model loaded | tid="140398815795072" timestamp=1722841462
Aug 05 15:04:23 ubuntu ollama[1919922]: time=2024-08-05T15:04:23.011+08:00 level=INFO source=server.go:622 msg="llama runner started in 3.52 seconds"
Aug 05 15:08:21 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:08:21 | 200 |         4m34s |      172.17.0.4 | POST     "/api/chat"
Aug 05 15:09:20 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:09:20 | 200 |         5m57s |      172.17.0.4 | POST     "/api/chat"
Aug 05 15:10:30 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:10:30 | 200 |         11m3s |      172.17.0.4 | POST     "/api/chat"
Aug 05 15:11:38 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:11:38 | 200 |         7m36s |      172.17.0.4 | POST     "/api/chat"
Aug 05 15:12:40 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:12:40 | 200 |         8m33s |      172.17.0.4 | POST     "/api/chat"
Aug 05 15:13:53 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:13:53 | 200 |         9m34s |      172.17.0.4 | POST     "/api/chat"
Aug 05 15:14:26 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:14:26 | 200 |          5m5s |      172.17.0.4 | POST     "/api/chat"
Aug 05 15:14:35 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:14:35 | 200 |         6m13s |      172.17.0.4 | POST     "/api/chat"
Aug 05 15:14:52 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:14:52 | 200 |         4m21s |      172.17.0.4 | POST     "/api/chat"
Aug 05 15:15:30 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:15:30 | 200 |         3m51s |      172.17.0.4 | POST     "/api/chat"
Aug 05 15:16:54 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:16:54 | 200 |         4m13s |      172.17.0.4 | POST     "/api/chat"
Aug 05 15:18:15 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:18:15 | 200 |         4m21s |      172.17.0.4 | POST     "/api/chat"
Aug 05 15:18:25 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:18:25 | 200 |         3m49s |      172.17.0.4 | POST     "/api/chat"
Aug 05 15:18:30 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:18:30 | 200 |          4m4s |      172.17.0.4 | POST     "/api/chat"
Aug 05 15:18:44 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:18:44 | 200 |         3m51s |      172.17.0.4 | POST     "/api/chat"
Aug 05 15:34:36 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:34:36 | 200 |    1.913155ms |      172.17.0.4 | GET      "/api/tags"
Aug 05 15:34:36 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:34:36 | 200 |      51.917µs |      172.17.0.4 | GET      "/api/version"
Aug 05 15:36:29 ubuntu ollama[1919922]: time=2024-08-05T15:36:29.925+08:00 level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=-1 layers.model=29 layers.offload=0 layers.split="" memory.available="[236.7 GiB]" memory.required.full="4.9 GiB" memory.required.partial="0 B" memory.required.kv="448.0 MiB" memory.required.allocations="[4.9 GiB]" memory.weights.total="3.9 GiB" memory.weights.repeating="3.4 GiB" memory.weights.nonrepeating="426.4 MiB" memory.graph.full="478.0 MiB" memory.graph.partial="730.4 MiB"
Aug 05 15:36:29 ubuntu ollama[1919922]: time=2024-08-05T15:36:29.926+08:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama3509060138/runners/cpu_avx2/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-43f7a214e5329f672bb05404cfba1913cbb70fdaa1a17497224e1925046b0ed5 --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --parallel 4 --port 45271"
Aug 05 15:36:29 ubuntu ollama[1919922]: time=2024-08-05T15:36:29.926+08:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
Aug 05 15:36:29 ubuntu ollama[1919922]: time=2024-08-05T15:36:29.926+08:00 level=INFO source=server.go:583 msg="waiting for llama runner to start responding"
Aug 05 15:36:29 ubuntu ollama[1919922]: time=2024-08-05T15:36:29.927+08:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server error"
Aug 05 15:36:29 ubuntu ollama[527197]: INFO [main] build info | build=1 commit="d94c6e0" tid="140582874306432" timestamp=1722843389
Aug 05 15:36:29 ubuntu ollama[527197]: INFO [main] system info | n_threads=40 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="140582874306432" timestamp=1722843389 total_threads=80
Aug 05 15:36:29 ubuntu ollama[527197]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="79" port="45271" tid="140582874306432" timestamp=1722843389
Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: loaded meta data with 21 key-value pairs and 339 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-43f7a214e5329f672bb05404cfba1913cbb70fdaa1a17497224e1925046b0ed5 (version GGUF V3 (latest))
Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv   0:                       general.architecture str              = qwen2
Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv   1:                               general.name str              = Qwen2-7B-Instruct
Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv   2:                          qwen2.block_count u32              = 28
Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 3584
Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 18944
Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv   6:                 qwen2.attention.head_count u32              = 28
Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv   7:              qwen2.attention.head_count_kv u32              = 4
Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv   8:                       qwen2.rope.freq_base f32              = 1000000.000000
Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv   9:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv  10:                          general.file_type u32              = 2
Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = gpt2
Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv  12:                         tokenizer.ggml.pre str              = qwen2
Aug 05 15:36:30 ubuntu ollama[1919922]: llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Aug 05 15:36:30 ubuntu ollama[1919922]: llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Aug 05 15:36:30 ubuntu ollama[1919922]: llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
Aug 05 15:36:30 ubuntu ollama[1919922]: llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 151645
Aug 05 15:36:30 ubuntu ollama[1919922]: llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32              = 151643
Aug 05 15:36:30 ubuntu ollama[1919922]: llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 151643
Aug 05 15:36:30 ubuntu ollama[1919922]: llama_model_loader: - kv  19:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
Aug 05 15:36:30 ubuntu ollama[1919922]: llama_model_loader: - kv  20:               general.quantization_version u32

@Mipuqt commented on GitHub (Aug 5, 2024): > 40 cores, 80 hyperthreads. > > Can you share the server log loading a model so I can see how many threads we allocated for inference? ``` Aug 05 15:04:19 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:04:19 | 200 | 6m2s | 172.17.0.4 | POST "/api/chat" Aug 05 15:04:19 ubuntu ollama[1919922]: time=2024-08-05T15:04:19.491+08:00 level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=-1 layers.model=29 layers.offload=0 layers.split="" memory.available="[236.7 GiB]" memory.required.full="4.9 GiB" memory.required.partial="0 B" memory.required.kv="448.0 MiB" memory.required.allocations="[4.9 GiB]" memory.weights.total="3.9 GiB" memory.weights.repeating="3.4 GiB" memory.weights.nonrepeating="426.4 MiB" memory.graph.full="478.0 MiB" memory.graph.partial="730.4 MiB" Aug 05 15:04:19 ubuntu ollama[1919922]: time=2024-08-05T15:04:19.492+08:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama3509060138/runners/cpu_avx2/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-43f7a214e5329f672bb05404cfba1913cbb70fdaa1a17497224e1925046b0ed5 --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --parallel 4 --port 34971" Aug 05 15:04:19 ubuntu ollama[1919922]: time=2024-08-05T15:04:19.492+08:00 level=INFO source=sched.go:437 msg="loaded runners" count=1 Aug 05 15:04:19 ubuntu ollama[1919922]: time=2024-08-05T15:04:19.492+08:00 level=INFO source=server.go:583 msg="waiting for llama runner to start responding" Aug 05 15:04:19 ubuntu ollama[1919922]: time=2024-08-05T15:04:19.492+08:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server error" Aug 05 15:04:19 ubuntu ollama[448639]: INFO [main] build info | build=1 commit="d94c6e0" tid="140398815795072" timestamp=1722841459 Aug 05 15:04:19 ubuntu ollama[448639]: INFO [main] system info | n_threads=40 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="140398815795072" timestamp=1722841459 total_threads=80 Aug 05 15:04:19 ubuntu ollama[448639]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="79" port="34971" tid="140398815795072" timestamp=1722841459 Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: loaded meta data with 21 key-value pairs and 339 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-43f7a214e5329f672bb05404cfba1913cbb70fdaa1a17497224e1925046b0ed5 (version GGUF V3 (latest)) Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 0: general.architecture str = qwen2 Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 1: general.name str = Qwen2-7B-Instruct Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 2: qwen2.block_count u32 = 28 Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 3: qwen2.context_length u32 = 32768 Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 4: qwen2.embedding_length u32 = 3584 Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 18944 Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 28 Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 4 Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000 Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 10: general.file_type u32 = 2 Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2 Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2 Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151645 Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643 Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643 Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 19: tokenizer.chat_template str = {% for message in messages %}{% if lo... Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - kv 20: general.quantization_version u32 = 2 Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - type f32: 141 tensors Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - type q4_0: 197 tensors Aug 05 15:04:19 ubuntu ollama[1919922]: llama_model_loader: - type q6_K: 1 tensors Aug 05 15:04:19 ubuntu ollama[1919922]: time=2024-08-05T15:04:19.744+08:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server loading model" Aug 05 15:04:19 ubuntu ollama[1919922]: llm_load_vocab: special tokens cache size = 421 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_vocab: token to piece cache size = 0.9352 MB Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: format = GGUF V3 (latest) Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: arch = qwen2 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: vocab type = BPE Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_vocab = 152064 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_merges = 151387 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: vocab_only = 0 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_ctx_train = 32768 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_embd = 3584 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_layer = 28 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_head = 28 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_head_kv = 4 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_rot = 128 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_swa = 0 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_embd_head_k = 128 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_embd_head_v = 128 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_gqa = 7 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_embd_k_gqa = 512 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_embd_v_gqa = 512 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: f_norm_eps = 0.0e+00 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: f_logit_scale = 0.0e+00 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_ff = 18944 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_expert = 0 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_expert_used = 0 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: causal attn = 1 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: pooling type = 0 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: rope type = 2 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: rope scaling = linear Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: freq_base_train = 1000000.0 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: freq_scale_train = 1 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: n_ctx_orig_yarn = 32768 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: rope_finetuned = unknown Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: ssm_d_conv = 0 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: ssm_d_inner = 0 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: ssm_d_state = 0 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: ssm_dt_rank = 0 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: model type = ?B Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: model ftype = Q4_0 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: model params = 7.62 B Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: model size = 4.12 GiB (4.65 BPW) Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: general.name = Qwen2-7B-Instruct Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: BOS token = 151643 '<|endoftext|>' Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: EOS token = 151645 '<|im_end|>' Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: PAD token = 151643 '<|endoftext|>' Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: LF token = 148848 'ÄĬ' Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: EOT token = 151645 '<|im_end|>' Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_print_meta: max token length = 256 Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_tensors: ggml ctx size = 0.15 MiB Aug 05 15:04:20 ubuntu ollama[1919922]: llm_load_tensors: CPU buffer size = 4220.43 MiB Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: n_ctx = 8192 Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: n_batch = 512 Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: n_ubatch = 512 Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: flash_attn = 0 Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: freq_base = 1000000.0 Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: freq_scale = 1 Aug 05 15:04:22 ubuntu ollama[1919922]: llama_kv_cache_init: CPU KV buffer size = 448.00 MiB Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: KV self size = 448.00 MiB, K (f16): 224.00 MiB, V (f16): 224.00 MiB Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: CPU output buffer size = 2.38 MiB Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: CPU compute buffer size = 492.01 MiB Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: graph nodes = 986 Aug 05 15:04:22 ubuntu ollama[1919922]: llama_new_context_with_model: graph splits = 1 Aug 05 15:04:22 ubuntu ollama[448639]: INFO [main] model loaded | tid="140398815795072" timestamp=1722841462 Aug 05 15:04:23 ubuntu ollama[1919922]: time=2024-08-05T15:04:23.011+08:00 level=INFO source=server.go:622 msg="llama runner started in 3.52 seconds" Aug 05 15:08:21 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:08:21 | 200 | 4m34s | 172.17.0.4 | POST "/api/chat" Aug 05 15:09:20 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:09:20 | 200 | 5m57s | 172.17.0.4 | POST "/api/chat" Aug 05 15:10:30 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:10:30 | 200 | 11m3s | 172.17.0.4 | POST "/api/chat" Aug 05 15:11:38 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:11:38 | 200 | 7m36s | 172.17.0.4 | POST "/api/chat" Aug 05 15:12:40 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:12:40 | 200 | 8m33s | 172.17.0.4 | POST "/api/chat" Aug 05 15:13:53 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:13:53 | 200 | 9m34s | 172.17.0.4 | POST "/api/chat" Aug 05 15:14:26 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:14:26 | 200 | 5m5s | 172.17.0.4 | POST "/api/chat" Aug 05 15:14:35 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:14:35 | 200 | 6m13s | 172.17.0.4 | POST "/api/chat" Aug 05 15:14:52 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:14:52 | 200 | 4m21s | 172.17.0.4 | POST "/api/chat" Aug 05 15:15:30 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:15:30 | 200 | 3m51s | 172.17.0.4 | POST "/api/chat" Aug 05 15:16:54 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:16:54 | 200 | 4m13s | 172.17.0.4 | POST "/api/chat" Aug 05 15:18:15 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:18:15 | 200 | 4m21s | 172.17.0.4 | POST "/api/chat" Aug 05 15:18:25 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:18:25 | 200 | 3m49s | 172.17.0.4 | POST "/api/chat" Aug 05 15:18:30 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:18:30 | 200 | 4m4s | 172.17.0.4 | POST "/api/chat" Aug 05 15:18:44 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:18:44 | 200 | 3m51s | 172.17.0.4 | POST "/api/chat" Aug 05 15:34:36 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:34:36 | 200 | 1.913155ms | 172.17.0.4 | GET "/api/tags" Aug 05 15:34:36 ubuntu ollama[1919922]: [GIN] 2024/08/05 - 15:34:36 | 200 | 51.917µs | 172.17.0.4 | GET "/api/version" Aug 05 15:36:29 ubuntu ollama[1919922]: time=2024-08-05T15:36:29.925+08:00 level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=-1 layers.model=29 layers.offload=0 layers.split="" memory.available="[236.7 GiB]" memory.required.full="4.9 GiB" memory.required.partial="0 B" memory.required.kv="448.0 MiB" memory.required.allocations="[4.9 GiB]" memory.weights.total="3.9 GiB" memory.weights.repeating="3.4 GiB" memory.weights.nonrepeating="426.4 MiB" memory.graph.full="478.0 MiB" memory.graph.partial="730.4 MiB" Aug 05 15:36:29 ubuntu ollama[1919922]: time=2024-08-05T15:36:29.926+08:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama3509060138/runners/cpu_avx2/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-43f7a214e5329f672bb05404cfba1913cbb70fdaa1a17497224e1925046b0ed5 --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --parallel 4 --port 45271" Aug 05 15:36:29 ubuntu ollama[1919922]: time=2024-08-05T15:36:29.926+08:00 level=INFO source=sched.go:437 msg="loaded runners" count=1 Aug 05 15:36:29 ubuntu ollama[1919922]: time=2024-08-05T15:36:29.926+08:00 level=INFO source=server.go:583 msg="waiting for llama runner to start responding" Aug 05 15:36:29 ubuntu ollama[1919922]: time=2024-08-05T15:36:29.927+08:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server error" Aug 05 15:36:29 ubuntu ollama[527197]: INFO [main] build info | build=1 commit="d94c6e0" tid="140582874306432" timestamp=1722843389 Aug 05 15:36:29 ubuntu ollama[527197]: INFO [main] system info | n_threads=40 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="140582874306432" timestamp=1722843389 total_threads=80 Aug 05 15:36:29 ubuntu ollama[527197]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="79" port="45271" tid="140582874306432" timestamp=1722843389 Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: loaded meta data with 21 key-value pairs and 339 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-43f7a214e5329f672bb05404cfba1913cbb70fdaa1a17497224e1925046b0ed5 (version GGUF V3 (latest)) Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv 0: general.architecture str = qwen2 Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv 1: general.name str = Qwen2-7B-Instruct Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv 2: qwen2.block_count u32 = 28 Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv 3: qwen2.context_length u32 = 32768 Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv 4: qwen2.embedding_length u32 = 3584 Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 18944 Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 28 Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 4 Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000 Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv 10: general.file_type u32 = 2 Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2 Aug 05 15:36:29 ubuntu ollama[1919922]: llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2 Aug 05 15:36:30 ubuntu ollama[1919922]: llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... Aug 05 15:36:30 ubuntu ollama[1919922]: llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Aug 05 15:36:30 ubuntu ollama[1919922]: llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... Aug 05 15:36:30 ubuntu ollama[1919922]: llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151645 Aug 05 15:36:30 ubuntu ollama[1919922]: llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643 Aug 05 15:36:30 ubuntu ollama[1919922]: llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643 Aug 05 15:36:30 ubuntu ollama[1919922]: llama_model_loader: - kv 19: tokenizer.chat_template str = {% for message in messages %}{% if lo... Aug 05 15:36:30 ubuntu ollama[1919922]: llama_model_loader: - kv 20: general.quantization_version u32 ```

GiteaMirror commented

2026-04-28 15:06:51 -05:00

@dhiltgen commented on GitHub (Aug 5, 2024):

Thanks for sharing that log. I was incorrect. We are correctly allocating 40 threads for 40 cores total, so this isn't a thread problem, it's a numa setup problem, and I think we're likely making it worse by stuffing them into 1 socket.

@dhiltgen commented on GitHub (Aug 5, 2024): Thanks for sharing that log. I was incorrect. We are correctly allocating 40 threads for 40 cores total, so this isn't a thread problem, it's a numa setup problem, and I think we're likely making it worse by stuffing them into 1 socket.

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#50321