[GH-ISSUE #6930] Tesla p40 24G with quadro M6000 24G can not work together #50898

Closed
opened 2026-04-28 17:22:59 -05:00 by GiteaMirror · 11 comments
Owner

Originally created by @Blake110 on GitHub (Sep 24, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6930

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

P40 with M6000, just P40 works, and M6000 memory not be used by ollama. even modified ollama.service for multi GPU.
I try to use P40 with 1080ti, works fine with default ollama.service. P40 with RTX 2060, works fine with default ollama.service.
anyone can tell me why and is there a chance to make them working together? Thx.

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.3.11

Originally created by @Blake110 on GitHub (Sep 24, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6930 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? P40 with M6000, just P40 works, and M6000 memory not be used by ollama. even modified ollama.service for multi GPU. I try to use P40 with 1080ti, works fine with default ollama.service. P40 with RTX 2060, works fine with default ollama.service. anyone can tell me why and is there a chance to make them working together? Thx. ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.3.11
GiteaMirror added the bugnvidia labels 2026-04-28 17:22:59 -05:00
Author
Owner

@rick-github commented on GitHub (Sep 24, 2024):

Server logs may help in debugging.

<!-- gh-comment-id:2370961319 --> @rick-github commented on GitHub (Sep 24, 2024): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may help in debugging.
Author
Owner

@Blake110 commented on GitHub (Sep 25, 2024):

Server logs may help in debugging.

thanks for your reply first. logs below.

Sep 25 01:54:13 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:54:13 | 200 | 43.216µs | 127.0.0.1 | HEAD "/"
Sep 25 01:54:13 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:54:13 | 200 | 164.080988ms | 127.0.0.1 | GET "/api/tags"
Sep 25 01:54:21 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:54:21 | 200 | 22.7µs | 127.0.0.1 | HEAD "/"
Sep 25 01:54:21 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:54:21 | 200 | 83.877µs | 127.0.0.1 | GET "/api/ps"
Sep 25 01:54:49 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:54:49 | 200 | 24.045µs | 127.0.0.1 | HEAD "/"
Sep 25 01:54:49 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:54:49 | 200 | 1.135969ms | 127.0.0.1 | GET "/api/tags"
Sep 25 01:55:03 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:55:03 | 200 | 22.533µs | 127.0.0.1 | HEAD "/"
Sep 25 01:55:03 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:55:03 | 200 | 186.942878ms | 127.0.0.1 | POST "/api/show"
Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.644-07:00 level=INFO source=server.go:103 msg="system memory" total="78.5 GiB" free="77.2 GiB" free_swap="8.0 GiB"
Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.645-07:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=47 layers.split="" memory.available="[23.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="39.3 GiB" memory.required.partial="23.4 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[23.4 GiB]" memory.weights.total="36.5 GiB" memory.weights.repeating="35.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="1.1 GiB"
Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.647-07:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2509469182/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-6439716a5b6269ac02585fa4b90ab622c28d9fa8d93772cc713414642ffa6efd --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 47 --parallel 1 --port 37095"
Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.647-07:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.647-07:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.648-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
Sep 25 01:55:03 ai-platform ollama[2194]: INFO [main] build info | build=10 commit="9225b05" tid="139687410753536" timestamp=1727254503
Sep 25 01:55:03 ai-platform ollama[2194]: INFO [main] system info | n_threads=14 n_threads_batch=14 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="139687410753536" timestamp=1727254503 total_threads=14
Sep 25 01:55:03 ai-platform ollama[2194]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="13" port="37095" tid="139687410753536" timestamp=1727254503
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-6439716a5b6269ac02585fa4b90ab622c28d9fa8d93772cc713414642ffa6efd (version GGUF V3 (latest))
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 0: general.architecture str = llama
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 1: general.type str = model
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 70B Instruct
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 3: general.finetune str = Instruct
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 5: general.size_label str = 70B
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 6: general.license str = llama3.1
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 9: llama.block_count u32 = 80
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 10: llama.context_length u32 = 131072
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 11: llama.embedding_length u32 = 8192
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 12: llama.feed_forward_length u32 = 28672
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 13: llama.attention.head_count u32 = 64
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 17: general.file_type u32 = 2
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 28: general.quantization_version u32 = 2
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - type f32: 162 tensors
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - type q4_0: 561 tensors
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - type q6_K: 1 tensors
Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.899-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_vocab: special tokens cache size = 256
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_vocab: token to piece cache size = 0.7999 MB
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: format = GGUF V3 (latest)
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: arch = llama
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: vocab type = BPE
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_vocab = 128256
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_merges = 280147
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: vocab_only = 0
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_ctx_train = 131072
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_embd = 8192
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_layer = 80
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_head = 64
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_head_kv = 8
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_rot = 128
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_swa = 0
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_embd_head_k = 128
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_embd_head_v = 128
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_gqa = 8
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_embd_k_gqa = 1024
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_embd_v_gqa = 1024
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: f_norm_eps = 0.0e+00
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: f_logit_scale = 0.0e+00
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_ff = 28672
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_expert = 0
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_expert_used = 0
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: causal attn = 1
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: pooling type = 0
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: rope type = 0
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: rope scaling = linear
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: freq_base_train = 500000.0
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: freq_scale_train = 1
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_ctx_orig_yarn = 131072
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: rope_finetuned = unknown
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: ssm_d_conv = 0
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: ssm_d_inner = 0
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: ssm_d_state = 0
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: ssm_dt_rank = 0
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: ssm_dt_b_c_rms = 0
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: model type = 70B
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: model ftype = Q4_0
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: model params = 70.55 B
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: model size = 37.22 GiB (4.53 BPW)
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: general.name = Meta Llama 3.1 70B Instruct
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: LF token = 128 'Ä'
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: max token length = 256
Sep 25 01:55:04 ai-platform ollama[2049]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
Sep 25 01:55:04 ai-platform ollama[2049]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 25 01:55:04 ai-platform ollama[2049]: ggml_cuda_init: found 1 CUDA devices:
Sep 25 01:55:04 ai-platform ollama[2049]: Device 0: Tesla P40, compute capability 6.1, VMM: yes
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_tensors: ggml ctx size = 0.68 MiB
Sep 25 01:57:33 ai-platform ollama[2049]: llm_load_tensors: offloading 47 repeating layers to GPU
Sep 25 01:57:33 ai-platform ollama[2049]: llm_load_tensors: offloaded 47/81 layers to GPU
Sep 25 01:57:33 ai-platform ollama[2049]: llm_load_tensors: CPU buffer size = 38110.61 MiB
Sep 25 01:57:33 ai-platform ollama[2049]: llm_load_tensors: CUDA0 buffer size = 21575.95 MiB
Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: n_ctx = 2048
Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: n_batch = 512
Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: n_ubatch = 512
Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: flash_attn = 0
Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: freq_base = 500000.0
Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: freq_scale = 1
Sep 25 01:57:36 ai-platform ollama[2049]: llama_kv_cache_init: CUDA_Host KV buffer size = 264.00 MiB
Sep 25 01:57:36 ai-platform ollama[2049]: llama_kv_cache_init: CUDA0 KV buffer size = 376.00 MiB
Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB
Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: CUDA_Host output buffer size = 0.52 MiB
Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB
Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB
Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: graph nodes = 2566
Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: graph splits = 433
Sep 25 01:57:37 ai-platform ollama[2194]: INFO [main] model loaded | tid="139687410753536" timestamp=1727254657
Sep 25 01:57:37 ai-platform ollama[2049]: time=2024-09-25T01:57:37.386-07:00 level=INFO source=server.go:626 msg="llama runner started in 153.74 seconds"
Sep 25 01:57:37 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:57:37 | 200 | 2m34s | 127.0.0.1 | POST "/api/generate"

<!-- gh-comment-id:2373488289 --> @Blake110 commented on GitHub (Sep 25, 2024): > [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may help in debugging. thanks for your reply first. logs below. Sep 25 01:54:13 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:54:13 | 200 | 43.216µs | 127.0.0.1 | HEAD "/" Sep 25 01:54:13 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:54:13 | 200 | 164.080988ms | 127.0.0.1 | GET "/api/tags" Sep 25 01:54:21 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:54:21 | 200 | 22.7µs | 127.0.0.1 | HEAD "/" Sep 25 01:54:21 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:54:21 | 200 | 83.877µs | 127.0.0.1 | GET "/api/ps" Sep 25 01:54:49 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:54:49 | 200 | 24.045µs | 127.0.0.1 | HEAD "/" Sep 25 01:54:49 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:54:49 | 200 | 1.135969ms | 127.0.0.1 | GET "/api/tags" Sep 25 01:55:03 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:55:03 | 200 | 22.533µs | 127.0.0.1 | HEAD "/" Sep 25 01:55:03 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:55:03 | 200 | 186.942878ms | 127.0.0.1 | POST "/api/show" Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.644-07:00 level=INFO source=server.go:103 msg="system memory" total="78.5 GiB" free="77.2 GiB" free_swap="8.0 GiB" Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.645-07:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=47 layers.split="" memory.available="[23.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="39.3 GiB" memory.required.partial="23.4 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[23.4 GiB]" memory.weights.total="36.5 GiB" memory.weights.repeating="35.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="1.1 GiB" Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.647-07:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2509469182/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-6439716a5b6269ac02585fa4b90ab622c28d9fa8d93772cc713414642ffa6efd --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 47 --parallel 1 --port 37095" Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.647-07:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.647-07:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding" Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.648-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" Sep 25 01:55:03 ai-platform ollama[2194]: INFO [main] build info | build=10 commit="9225b05" tid="139687410753536" timestamp=1727254503 Sep 25 01:55:03 ai-platform ollama[2194]: INFO [main] system info | n_threads=14 n_threads_batch=14 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="139687410753536" timestamp=1727254503 total_threads=14 Sep 25 01:55:03 ai-platform ollama[2194]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="13" port="37095" tid="139687410753536" timestamp=1727254503 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-6439716a5b6269ac02585fa4b90ab622c28d9fa8d93772cc713414642ffa6efd (version GGUF V3 (latest)) Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 0: general.architecture str = llama Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 1: general.type str = model Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 70B Instruct Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 3: general.finetune str = Instruct Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 5: general.size_label str = 70B Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 6: general.license str = llama3.1 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 9: llama.block_count u32 = 80 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 10: llama.context_length u32 = 131072 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 11: llama.embedding_length u32 = 8192 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 12: llama.feed_forward_length u32 = 28672 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 13: llama.attention.head_count u32 = 64 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 17: general.file_type u32 = 2 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 28: general.quantization_version u32 = 2 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - type f32: 162 tensors Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - type q4_0: 561 tensors Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - type q6_K: 1 tensors Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.899-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_vocab: special tokens cache size = 256 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_vocab: token to piece cache size = 0.7999 MB Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: format = GGUF V3 (latest) Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: arch = llama Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: vocab type = BPE Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_vocab = 128256 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_merges = 280147 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: vocab_only = 0 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_ctx_train = 131072 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_embd = 8192 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_layer = 80 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_head = 64 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_head_kv = 8 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_rot = 128 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_swa = 0 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_embd_head_k = 128 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_embd_head_v = 128 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_gqa = 8 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_embd_k_gqa = 1024 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_embd_v_gqa = 1024 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: f_norm_eps = 0.0e+00 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: f_logit_scale = 0.0e+00 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_ff = 28672 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_expert = 0 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_expert_used = 0 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: causal attn = 1 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: pooling type = 0 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: rope type = 0 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: rope scaling = linear Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: freq_base_train = 500000.0 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: freq_scale_train = 1 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_ctx_orig_yarn = 131072 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: rope_finetuned = unknown Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: ssm_d_conv = 0 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: ssm_d_inner = 0 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: ssm_d_state = 0 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: ssm_dt_rank = 0 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: ssm_dt_b_c_rms = 0 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: model type = 70B Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: model ftype = Q4_0 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: model params = 70.55 B Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: model size = 37.22 GiB (4.53 BPW) Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: general.name = Meta Llama 3.1 70B Instruct Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>' Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: LF token = 128 'Ä' Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>' Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: max token length = 256 Sep 25 01:55:04 ai-platform ollama[2049]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Sep 25 01:55:04 ai-platform ollama[2049]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Sep 25 01:55:04 ai-platform ollama[2049]: ggml_cuda_init: found 1 CUDA devices: Sep 25 01:55:04 ai-platform ollama[2049]: Device 0: Tesla P40, compute capability 6.1, VMM: yes Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_tensors: ggml ctx size = 0.68 MiB Sep 25 01:57:33 ai-platform ollama[2049]: llm_load_tensors: offloading 47 repeating layers to GPU Sep 25 01:57:33 ai-platform ollama[2049]: llm_load_tensors: offloaded 47/81 layers to GPU Sep 25 01:57:33 ai-platform ollama[2049]: llm_load_tensors: CPU buffer size = 38110.61 MiB Sep 25 01:57:33 ai-platform ollama[2049]: llm_load_tensors: CUDA0 buffer size = 21575.95 MiB Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: n_ctx = 2048 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: n_batch = 512 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: n_ubatch = 512 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: flash_attn = 0 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: freq_base = 500000.0 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: freq_scale = 1 Sep 25 01:57:36 ai-platform ollama[2049]: llama_kv_cache_init: CUDA_Host KV buffer size = 264.00 MiB Sep 25 01:57:36 ai-platform ollama[2049]: llama_kv_cache_init: CUDA0 KV buffer size = 376.00 MiB Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: CUDA_Host output buffer size = 0.52 MiB Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: graph nodes = 2566 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: graph splits = 433 Sep 25 01:57:37 ai-platform ollama[2194]: INFO [main] model loaded | tid="139687410753536" timestamp=1727254657 Sep 25 01:57:37 ai-platform ollama[2049]: time=2024-09-25T01:57:37.386-07:00 level=INFO source=server.go:626 msg="llama runner started in 153.74 seconds" Sep 25 01:57:37 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:57:37 | 200 | 2m34s | 127.0.0.1 | POST "/api/generate"
Author
Owner

@Blake110 commented on GitHub (Sep 25, 2024):

and here a screenshot for nvtop. I loaded llama3.1:70b.
Screenshot 2024-09-25

<!-- gh-comment-id:2373495191 --> @Blake110 commented on GitHub (Sep 25, 2024): and here a screenshot for nvtop. I loaded llama3.1:70b. ![Screenshot 2024-09-25](https://github.com/user-attachments/assets/1acc3609-d1c3-48dc-a3bf-ad0ee490752a)
Author
Owner

@rick-github commented on GitHub (Sep 25, 2024):

Please post the full log, there is information earlier in the log which shows device detection.

<!-- gh-comment-id:2373629765 --> @rick-github commented on GitHub (Sep 25, 2024): Please post the full log, there is information earlier in the log which shows device detection.
Author
Owner

@Blake110 commented on GitHub (Sep 25, 2024):

all about ollama logs are here. BTW, after pulling out the P40, the M6000 can work with ollama under the same NV cuda driver.

and I tried the RTX 2060 with P40 and 1080ti with p40 with the same NV cuda driver, they can work together with no issus. So the different GPU architecture should be ok.

I used NVIDIA-Linux-x86_64-550.100.run driver

in line 352 - 353 - 354, (line 494 - 495 - 496 )2 gpu was found.

" 352 Sep 25 13:09:20 ai-platform ollama[863]: time=2024-09-25T13:09:20.795-07:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
353 Sep 25 13:09:22 ai-platform ollama[863]: time=2024-09-25T13:09:22.492-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-8b16ac03-19ea-264b-44f5-0ba4e7a3cdea library=cuda variant=v12 compute=6.1 driver=12.4 name="Tesla P40" total="23.9 GiB" available="23.7 GiB"
354 Sep 25 13:09:22 ai-platform ollama[863]: time=2024-09-25T13:09:22.492-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-9402de2e-20d7-ff58-45c7-f25025132ba7 library=cuda variant=v11 compute=5.2 driver=12.4 name="Quadro M6000 24GB" total="23.9 GiB" available="23.8 GiB"

and in line 719 - 720, just P40 was found with ggml_cuda.

719 Sep 25 13:57:37 ai-platform ollama[863]: ggml_cuda_init: found 1 CUDA devices:
720 Sep 25 13:57:37 ai-platform ollama[863]: Device 0: Tesla P40, compute capability 6.1, VMM: yes

 1	Sep 25 01:54:13 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:54:13 | 200 |      43.216µs |       127.0.0.1 | HEAD     "/"
 2	Sep 25 01:54:13 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:54:13 | 200 |  164.080988ms |       127.0.0.1 | GET      "/api/tags"
 3	Sep 25 01:54:21 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:54:21 | 200 |        22.7µs |       127.0.0.1 | HEAD     "/"
 4	Sep 25 01:54:21 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:54:21 | 200 |      83.877µs |       127.0.0.1 | GET      "/api/ps"
 5	Sep 25 01:54:49 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:54:49 | 200 |      24.045µs |       127.0.0.1 | HEAD     "/"
 6	Sep 25 01:54:49 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:54:49 | 200 |    1.135969ms |       127.0.0.1 | GET      "/api/tags"
 7	Sep 25 01:55:03 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:55:03 | 200 |      22.533µs |       127.0.0.1 | HEAD     "/"
 8	Sep 25 01:55:03 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:55:03 | 200 |  186.942878ms |       127.0.0.1 | POST     "/api/show"
 9	Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.644-07:00 level=INFO source=server.go:103 msg="system memory" total="78.5 GiB" free="77.2 GiB" free_swap="8.0 GiB"
10	Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.645-07:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=47 layers.split="" memory.available="[23.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="39.3 GiB" memory.required.partial="23.4 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[23.4 GiB]" memory.weights.total="36.5 GiB" memory.weights.repeating="35.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="1.1 GiB"
11	Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.647-07:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2509469182/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-6439716a5b6269ac02585fa4b90ab622c28d9fa8d93772cc713414642ffa6efd --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 47 --parallel 1 --port 37095"
12	Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.647-07:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
13	Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.647-07:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
14	Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.648-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
15	Sep 25 01:55:03 ai-platform ollama[2194]: INFO [main] build info | build=10 commit="9225b05" tid="139687410753536" timestamp=1727254503
16	Sep 25 01:55:03 ai-platform ollama[2194]: INFO [main] system info | n_threads=14 n_threads_batch=14 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="139687410753536" timestamp=1727254503 total_threads=14
17	Sep 25 01:55:03 ai-platform ollama[2194]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="13" port="37095" tid="139687410753536" timestamp=1727254503
18	Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-6439716a5b6269ac02585fa4b90ab622c28d9fa8d93772cc713414642ffa6efd (version GGUF V3 (latest))
19	Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
20	Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv   0:                       general.architecture str              = llama
21	Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv   1:                               general.type str              = model
22	Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 70B Instruct
23	Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv   3:                           general.finetune str              = Instruct
24	Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
25	Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv   5:                         general.size_label str              = 70B
26	Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv   6:                            general.license str              = llama3.1
27	Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
28	Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
29	Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv   9:                          llama.block_count u32              = 80
30	Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
31	Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv  11:                     llama.embedding_length u32              = 8192
32	Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 28672
33	Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 64
34	Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
35	Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
36	Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
37	Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv  17:                          general.file_type u32              = 2
38	Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
39	Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
40	Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
41	Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
42	Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
43	Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
44	Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
45	Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
46	Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
47	Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
48	Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv  28:               general.quantization_version u32              = 2
49	Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - type  f32:  162 tensors
50	Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - type q4_0:  561 tensors
51	Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - type q6_K:    1 tensors
52	Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.899-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
53	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_vocab: special tokens cache size = 256
54	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_vocab: token to piece cache size = 0.7999 MB
55	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: format           = GGUF V3 (latest)
56	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: arch             = llama
57	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: vocab type       = BPE
58	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_vocab          = 128256
59	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_merges         = 280147
60	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: vocab_only       = 0
61	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_ctx_train      = 131072
62	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_embd           = 8192
63	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_layer          = 80
64	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_head           = 64
65	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_head_kv        = 8
66	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_rot            = 128
67	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_swa            = 0
68	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_embd_head_k    = 128
69	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_embd_head_v    = 128
70	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_gqa            = 8
71	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_embd_k_gqa     = 1024
72	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_embd_v_gqa     = 1024
73	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: f_norm_eps       = 0.0e+00
74	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
75	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
76	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
77	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: f_logit_scale    = 0.0e+00
78	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_ff             = 28672
79	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_expert         = 0
80	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_expert_used    = 0
81	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: causal attn      = 1
82	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: pooling type     = 0
83	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: rope type        = 0
84	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: rope scaling     = linear
85	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: freq_base_train  = 500000.0
86	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: freq_scale_train = 1
87	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_ctx_orig_yarn  = 131072
88	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: rope_finetuned   = unknown
89	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: ssm_d_conv       = 0
90	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: ssm_d_inner      = 0
91	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: ssm_d_state      = 0
92	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: ssm_dt_rank      = 0
93	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: ssm_dt_b_c_rms   = 0
94	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: model type       = 70B
95	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: model ftype      = Q4_0
96	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: model params     = 70.55 B
97	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: model size       = 37.22 GiB (4.53 BPW)
98	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: general.name     = Meta Llama 3.1 70B Instruct
99	Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'

100 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
101 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: LF token = 128 'Ä'
102 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
103 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: max token length = 256
104 Sep 25 01:55:04 ai-platform ollama[2049]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
105 Sep 25 01:55:04 ai-platform ollama[2049]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
106 Sep 25 01:55:04 ai-platform ollama[2049]: ggml_cuda_init: found 1 CUDA devices:
107 Sep 25 01:55:04 ai-platform ollama[2049]: Device 0: Tesla P40, compute capability 6.1, VMM: yes
108 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_tensors: ggml ctx size = 0.68 MiB
109 Sep 25 01:57:33 ai-platform ollama[2049]: llm_load_tensors: offloading 47 repeating layers to GPU
110 Sep 25 01:57:33 ai-platform ollama[2049]: llm_load_tensors: offloaded 47/81 layers to GPU
111 Sep 25 01:57:33 ai-platform ollama[2049]: llm_load_tensors: CPU buffer size = 38110.61 MiB
112 Sep 25 01:57:33 ai-platform ollama[2049]: llm_load_tensors: CUDA0 buffer size = 21575.95 MiB
113 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: n_ctx = 2048
114 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: n_batch = 512
115 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: n_ubatch = 512
116 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: flash_attn = 0
117 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: freq_base = 500000.0
118 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: freq_scale = 1
119 Sep 25 01:57:36 ai-platform ollama[2049]: llama_kv_cache_init: CUDA_Host KV buffer size = 264.00 MiB
120 Sep 25 01:57:36 ai-platform ollama[2049]: llama_kv_cache_init: CUDA0 KV buffer size = 376.00 MiB
121 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB
122 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: CUDA_Host output buffer size = 0.52 MiB
123 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB
124 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB
125 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: graph nodes = 2566
126 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: graph splits = 433
127 Sep 25 01:57:37 ai-platform ollama[2194]: INFO [main] model loaded | tid="139687410753536" timestamp=1727254657
128 Sep 25 01:57:37 ai-platform ollama[2049]: time=2024-09-25T01:57:37.386-07:00 level=INFO source=server.go:626 msg="llama runner started in 153.74 seconds"
129 Sep 25 01:57:37 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:57:37 | 200 | 2m34s | 127.0.0.1 | POST "/api/generate"
130 Sep 25 02:08:35 ai-platform systemd[1]: Stopping Ollama Service...
131 Sep 25 02:08:36 ai-platform systemd[1]: ollama.service: Deactivated successfully.
132 Sep 25 02:08:36 ai-platform systemd[1]: Stopped Ollama Service.
133 Sep 25 02:08:36 ai-platform systemd[1]: ollama.service: Consumed 1min 4.919s CPU time.
134 -- Boot 997666db52994643b5bfc4ed04149e37 --
135 Sep 25 02:10:03 ai-platform systemd[1]: Started Ollama Service.
136 Sep 25 02:10:03 ai-platform ollama[846]: 2024/09/25 02:10:03 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
137 Sep 25 02:10:03 ai-platform ollama[846]: time=2024-09-25T02:10:03.720-07:00 level=INFO source=images.go:753 msg="total blobs: 33"
138 Sep 25 02:10:03 ai-platform ollama[846]: time=2024-09-25T02:10:03.729-07:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
139 Sep 25 02:10:03 ai-platform ollama[846]: time=2024-09-25T02:10:03.730-07:00 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)"
140 Sep 25 02:10:03 ai-platform ollama[846]: time=2024-09-25T02:10:03.732-07:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama4199965130/runners
141 Sep 25 02:10:19 ai-platform ollama[846]: time=2024-09-25T02:10:19.610-07:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm_v60102]"
142 Sep 25 02:10:19 ai-platform ollama[846]: time=2024-09-25T02:10:19.612-07:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
143 Sep 25 02:10:21 ai-platform ollama[846]: time=2024-09-25T02:10:21.299-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-8b16ac03-19ea-264b-44f5-0ba4e7a3cdea library=cuda variant=v12 compute=6.1 driver=12.4 name="Tesla P40" total="23.9 GiB" available="23.7 GiB"
144 Sep 25 02:10:21 ai-platform ollama[846]: time=2024-09-25T02:10:21.299-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-9402de2e-20d7-ff58-45c7-f25025132ba7 library=cuda variant=v11 compute=5.2 driver=12.4 name="Quadro M6000 24GB" total="23.9 GiB" available="23.8 GiB"
145 Sep 25 02:12:57 ai-platform ollama[846]: [GIN] 2024/09/25 - 02:12:57 | 200 | 996.696µs | 127.0.0.1 | HEAD "/"
146 Sep 25 02:12:57 ai-platform ollama[846]: [GIN] 2024/09/25 - 02:12:57 | 200 | 180.045µs | 127.0.0.1 | GET "/api/ps"
147 Sep 25 02:13:13 ai-platform systemd[1]: Stopping Ollama Service...
148 Sep 25 02:13:14 ai-platform systemd[1]: ollama.service: Deactivated successfully.
149 Sep 25 02:13:14 ai-platform systemd[1]: Stopped Ollama Service.
150 Sep 25 02:13:14 ai-platform systemd[1]: ollama.service: Consumed 30.152s CPU time.
151 -- Boot 37f48c8c97cd44a0bb24888eb055fc69 --
152 Sep 25 02:17:12 ai-platform systemd[1]: Started Ollama Service.
153 Sep 25 02:17:15 ai-platform ollama[869]: 2024/09/25 02:17:15 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:
https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
154 Sep 25 02:17:15 ai-platform ollama[869]: time=2024-09-25T02:17:15.092-07:00 level=INFO source=images.go:753 msg="total blobs: 33"
155 Sep 25 02:17:15 ai-platform ollama[869]: time=2024-09-25T02:17:15.165-07:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
156 Sep 25 02:17:15 ai-platform ollama[869]: time=2024-09-25T02:17:15.170-07:00 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)"
157 Sep 25 02:17:15 ai-platform ollama[869]: time=2024-09-25T02:17:15.171-07:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama4047483176/runners
158 Sep 25 02:17:59 ai-platform ollama[869]: time=2024-09-25T02:17:59.819-07:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm_v60102 cpu]"
159 Sep 25 02:17:59 ai-platform ollama[869]: time=2024-09-25T02:17:59.820-07:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
160 Sep 25 02:18:01 ai-platform ollama[869]: time=2024-09-25T02:18:01.531-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-8b16ac03-19ea-264b-44f5-0ba4e7a3cdea library=cuda variant=v12 compute=6.1 driver=12.4 name="Tesla P40" total="23.9 GiB" available="23.7 GiB"
161 Sep 25 02:18:01 ai-platform ollama[869]: time=2024-09-25T02:18:01.532-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-9402de2e-20d7-ff58-45c7-f25025132ba7 library=cuda variant=v11 compute=5.2 driver=12.4 name="Quadro M6000 24GB" total="23.9 GiB" available="23.8 GiB"
162 Sep 25 02:23:45 ai-platform systemd[1]: Stopping Ollama Service...
163 Sep 25 02:23:46 ai-platform systemd[1]: ollama.service: Deactivated successfully.
164 Sep 25 02:23:46 ai-platform systemd[1]: Stopped Ollama Service.
165 Sep 25 02:23:46 ai-platform systemd[1]: ollama.service: Consumed 30.855s CPU time.
166 -- Boot 573ea622850a4a3d8eb2de36dd38cae3 --
167 Sep 25 02:25:05 ai-platform systemd[1]: Started Ollama Service.
168 Sep 25 02:25:07 ai-platform ollama[864]: 2024/09/25 02:25:07 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:
https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
169 Sep 25 02:25:08 ai-platform ollama[864]: time=2024-09-25T02:25:08.064-07:00 level=INFO source=images.go:753 msg="total blobs: 33"
170 Sep 25 02:25:08 ai-platform ollama[864]: time=2024-09-25T02:25:08.803-07:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
171 Sep 25 02:25:08 ai-platform ollama[864]: time=2024-09-25T02:25:08.810-07:00 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)"
172 Sep 25 02:25:08 ai-platform ollama[864]: time=2024-09-25T02:25:08.817-07:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama2283343905/runners
173 Sep 25 02:25:51 ai-platform ollama[864]: time=2024-09-25T02:25:51.063-07:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu_avx2 cuda_v11 cuda_v12 rocm_v60102 cpu cpu_avx]"
174 Sep 25 02:25:51 ai-platform ollama[864]: time=2024-09-25T02:25:51.070-07:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
175 Sep 25 02:25:52 ai-platform ollama[864]: time=2024-09-25T02:25:52.804-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-8b16ac03-19ea-264b-44f5-0ba4e7a3cdea library=cuda variant=v12 compute=6.1 driver=12.4 name="Tesla P40" total="23.9 GiB" available="23.7 GiB"
176 Sep 25 02:25:52 ai-platform ollama[864]: time=2024-09-25T02:25:52.804-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-9402de2e-20d7-ff58-45c7-f25025132ba7 library=cuda variant=v11 compute=5.2 driver=12.4 name="Quadro M6000 24GB" total="23.9 GiB" available="23.8 GiB"
177 Sep 25 02:49:48 ai-platform ollama[864]: [GIN] 2024/09/25 - 02:49:48 | 200 | 806.784µs | 127.0.0.1 | HEAD "/"
178 Sep 25 02:49:48 ai-platform ollama[864]: [GIN] 2024/09/25 - 02:49:48 | 200 | 14.434265ms | 127.0.0.1 | GET "/api/tags"
179 Sep 25 02:55:02 ai-platform ollama[864]: [GIN] 2024/09/25 - 02:55:02 | 200 | 37.488µs | 127.0.0.1 | HEAD "/"
180 Sep 25 02:55:02 ai-platform ollama[864]: [GIN] 2024/09/25 - 02:55:02 | 200 | 79.297887ms | 127.0.0.1 | DELETE "/api/delete"
181 Sep 25 02:55:07 ai-platform ollama[864]: [GIN] 2024/09/25 - 02:55:07 | 200 | 35.511µs | 127.0.0.1 | HEAD "/"
182 Sep 25 02:55:07 ai-platform ollama[864]: [GIN] 2024/09/25 - 02:55:07 | 200 | 1.096253ms | 127.0.0.1 | GET "/api/tags"
183 Sep 25 02:55:25 ai-platform ollama[864]: [GIN] 2024/09/25 - 02:55:25 | 200 | 24.964µs | 127.0.0.1 | HEAD "/"
184 Sep 25 02:55:26 ai-platform ollama[864]: time=2024-09-25T02:55:26.794-07:00 level=INFO source=download.go:175 msg="downloading 09cd6813dc2e in 17 1 GB part(s)"
185 Sep 25 02:55:52 ai-platform ollama[864]: time=2024-09-25T02:55:52.901-07:00 level=INFO source=download.go:370 msg="09cd6813dc2e part 10 stalled; retrying. If this persists, press ctrl-c to exit, then 'ollama pull' to find a faster connection."
186 Sep 25 02:55:52 ai-platform ollama[864]: time=2024-09-25T02:55:52.901-07:00 level=INFO source=download.go:370 msg="09cd6813dc2e part 15 stalled; retrying. If this persists, press ctrl-c to exit, then 'ollama pull' to find a faster connection."
187 Sep 25 02:55:52 ai-platform ollama[864]: time=2024-09-25T02:55:52.901-07:00 level=INFO source=download.go:370 msg="09cd6813dc2e part 9 stalled; retrying. If this persists, press ctrl-c to exit, then 'ollama pull' to find a faster connection."
188 Sep 25 02:55:52 ai-platform ollama[864]: time=2024-09-25T02:55:52.901-07:00 level=INFO source=download.go:370 msg="09cd6813dc2e part 7 stalled; retrying. If this persists, press ctrl-c to exit, then 'ollama pull' to find a faster connection."
189 Sep 25 02:58:03 ai-platform ollama[864]: time=2024-09-25T02:58:03.626-07:00 level=INFO source=download.go:175 msg="downloading 948af2743fc7 in 1 1.5 KB part(s)"
190 Sep 25 02:58:05 ai-platform ollama[864]: time=2024-09-25T02:58:05.536-07:00 level=INFO source=download.go:175 msg="downloading daa7d15f6d0b in 1 484 B part(s)"
191 Sep 25 02:58:53 ai-platform ollama[864]: [GIN] 2024/09/25 - 02:58:53 | 200 | 3m28s | 127.0.0.1 | POST "/api/pull"
192 Sep 25 03:00:30 ai-platform ollama[864]: [GIN] 2024/09/25 - 03:00:30 | 200 | 28.578µs | 127.0.0.1 | HEAD "/"
193 Sep 25 03:00:30 ai-platform ollama[864]: [GIN] 2024/09/25 - 03:00:30 | 200 | 1.274784ms | 127.0.0.1 | GET "/api/tags"
194 Sep 25 03:00:49 ai-platform ollama[864]: [GIN] 2024/09/25 - 03:00:49 | 200 | 22.555µs | 127.0.0.1 | HEAD "/"
195 Sep 25 03:00:49 ai-platform ollama[864]: [GIN] 2024/09/25 - 03:00:49 | 404 | 135.621µs | 127.0.0.1 | POST "/api/show"
196 Sep 25 03:00:50 ai-platform ollama[864]: [GIN] 2024/09/25 - 03:00:50 | 200 | 468.40178ms | 127.0.0.1 | POST "/api/pull"
197 Sep 25 03:01:06 ai-platform ollama[864]: [GIN] 2024/09/25 - 03:01:06 | 200 | 27.667µs | 127.0.0.1 | HEAD "/"
198 Sep 25 03:01:06 ai-platform ollama[864]: [GIN] 2024/09/25 - 03:01:06 | 200 | 26.493541ms | 127.0.0.1 | POST "/api/show"
199 Sep 25 03:01:06 ai-platform ollama[864]: time=2024-09-25T03:01:06.425-07:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-09cd6813dc2e73d9db9345123ee1b3385bb7cee88a46f13dc37bc3d5e96ba3a4 gpu=GPU-8b16ac03-19ea-264b-44f5-0ba4e7a3cdea parallel=4 available=25470566400 required="16.4 GiB"
200 Sep 25 03:01:06 ai-platform ollama[864]: time=2024-09-25T03:01:06.425-07:00 level=INFO source=server.go:103 msg="system memory" total="62.8 GiB" free="61.5 GiB" free_swap="8.0 GiB"
201 Sep 25 03:01:06 ai-platform ollama[864]: time=2024-09-25T03:01:06.426-07:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[23.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="16.4 GiB" memory.required.partial="16.4 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[16.4 GiB]" memory.weights.total="14.0 GiB" memory.weights.repeating="13.0 GiB" memory.weights.nonrepeating="1002.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
202 Sep 25 03:01:06 ai-platform ollama[864]: time=2024-09-25T03:01:06.428-07:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2283343905/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-09cd6813dc2e73d9db9345123ee1b3385bb7cee88a46f13dc37bc3d5e96ba3a4 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 34317"
203 Sep 25 03:01:06 ai-platform ollama[864]: time=2024-09-25T03:01:06.429-07:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
204 Sep 25 03:01:06 ai-platform ollama[864]: time=2024-09-25T03:01:06.429-07:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
205 Sep 25 03:01:06 ai-platform ollama[864]: time=2024-09-25T03:01:06.429-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
206 Sep 25 03:01:07 ai-platform ollama[1805]: INFO [main] build info | build=10 commit="9225b05" tid="140325808095232" timestamp=1727258467
207 Sep 25 03:01:07 ai-platform ollama[1805]: INFO [main] system info | n_threads=14 n_threads_batch=14 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140325808095232" timestamp=1727258467 total_threads=14
208 Sep 25 03:01:07 ai-platform ollama[1805]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="13" port="34317" tid="140325808095232" timestamp=1727258467
209 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-09cd6813dc2e73d9db9345123ee1b3385bb7cee88a46f13dc37bc3d5e96ba3a4 (version GGUF V3 (latest))
210 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
211 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 0: general.architecture str = llama
212 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 1: general.type str = model
213 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct
214 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 3: general.finetune str = Instruct
215 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1
216 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 5: general.size_label str = 8B
217 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 6: general.license str = llama3.1
218 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
219 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
220 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 9: llama.block_count u32 = 32
221 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 10: llama.context_length u32 = 131072
222 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 11: llama.embedding_length u32 = 4096
223 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336
224 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 13: llama.attention.head_count u32 = 32
225 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
226 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
227 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
228 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 17: general.file_type u32 = 1
229 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
230 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
231 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
232 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
233 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
234 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
235 Sep 25 03:01:07 ai-platform ollama[864]: time=2024-09-25T03:01:07.685-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
236 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
237 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
238 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
239 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
240 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 28: general.quantization_version u32 = 2
241 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - type f32: 66 tensors
242 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - type f16: 226 tensors
243 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_vocab: special tokens cache size = 256
244 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_vocab: token to piece cache size = 0.7999 MB
245 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: format = GGUF V3 (latest)
246 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: arch = llama
247 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: vocab type = BPE
248 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_vocab = 128256
249 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_merges = 280147
250 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: vocab_only = 0
251 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_ctx_train = 131072
252 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_embd = 4096
253 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_layer = 32
254 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_head = 32
255 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_head_kv = 8
256 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_rot = 128
257 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_swa = 0
258 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_embd_head_k = 128
259 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_embd_head_v = 128
260 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_gqa = 4
261 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_embd_k_gqa = 1024
262 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_embd_v_gqa = 1024
263 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: f_norm_eps = 0.0e+00
264 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05
265 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
266 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
267 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: f_logit_scale = 0.0e+00
268 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_ff = 14336
269 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_expert = 0
270 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_expert_used = 0
271 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: causal attn = 1
272 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: pooling type = 0
273 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: rope type = 0
274 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: rope scaling = linear
275 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: freq_base_train = 500000.0
276 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: freq_scale_train = 1
277 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_ctx_orig_yarn = 131072
278 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: rope_finetuned = unknown
279 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: ssm_d_conv = 0
280 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: ssm_d_inner = 0
281 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: ssm_d_state = 0
282 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: ssm_dt_rank = 0
283 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: ssm_dt_b_c_rms = 0
284 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: model type = 8B
285 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: model ftype = F16
286 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: model params = 8.03 B
287 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: model size = 14.96 GiB (16.00 BPW)
288 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct
289 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
290 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
291 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: LF token = 128 'Ä'
292 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
293 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: max token length = 256
294 Sep 25 03:01:08 ai-platform ollama[864]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
295 Sep 25 03:01:08 ai-platform ollama[864]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
296 Sep 25 03:01:08 ai-platform ollama[864]: ggml_cuda_init: found 1 CUDA devices:
297 Sep 25 03:01:08 ai-platform ollama[864]: Device 0: Tesla P40, compute capability 6.1, VMM: yes
298 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_tensors: ggml ctx size = 0.27 MiB
299 Sep 25 03:01:09 ai-platform ollama[864]: time=2024-09-25T03:01:09.140-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding"
300 Sep 25 03:01:09 ai-platform ollama[864]: llm_load_tensors: offloading 32 repeating layers to GPU
301 Sep 25 03:01:09 ai-platform ollama[864]: llm_load_tensors: offloading non-repeating layers to GPU
302 Sep 25 03:01:09 ai-platform ollama[864]: llm_load_tensors: offloaded 33/33 layers to GPU
303 Sep 25 03:01:09 ai-platform ollama[864]: llm_load_tensors: CPU buffer size = 1002.00 MiB
304 Sep 25 03:01:09 ai-platform ollama[864]: llm_load_tensors: CUDA0 buffer size = 14315.02 MiB
305 Sep 25 03:01:09 ai-platform ollama[864]: time=2024-09-25T03:01:09.843-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
306 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: n_ctx = 8192
307 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: n_batch = 512
308 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: n_ubatch = 512
309 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: flash_attn = 0
310 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: freq_base = 500000.0
311 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: freq_scale = 1
312 Sep 25 03:01:11 ai-platform ollama[864]: llama_kv_cache_init: CUDA0 KV buffer size = 1024.00 MiB
313 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB
314 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: CUDA_Host output buffer size = 2.02 MiB
315 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: CUDA0 compute buffer size = 560.00 MiB
316 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: CUDA_Host compute buffer size = 24.01 MiB
317 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: graph nodes = 1030
318 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: graph splits = 2
319 Sep 25 03:01:12 ai-platform ollama[1805]: INFO [main] model loaded | tid="140325808095232" timestamp=1727258472
320 Sep 25 03:01:12 ai-platform ollama[864]: time=2024-09-25T03:01:12.354-07:00 level=INFO source=server.go:626 msg="llama runner started in 5.93 seconds"
321 Sep 25 03:01:12 ai-platform ollama[864]: [GIN] 2024/09/25 - 03:01:12 | 200 | 6.266740535s | 127.0.0.1 | POST "/api/generate"
322 Sep 25 03:01:26 ai-platform ollama[864]: [GIN] 2024/09/25 - 03:01:26 | 200 | 1.650481207s | 127.0.0.1 | POST "/api/chat"
323 Sep 25 03:02:29 ai-platform ollama[864]: [GIN] 2024/09/25 - 03:02:29 | 200 | 27.755757066s | 127.0.0.1 | POST "/api/chat"
324 Sep 25 03:05:31 ai-platform ollama[864]: [GIN] 2024/09/25 - 03:05:31 | 200 | 34.507382868s | 127.0.0.1 | POST "/api/chat"
325 Sep 25 06:26:33 ai-platform systemd[1]: Stopping Ollama Service...
326 Sep 25 06:26:34 ai-platform systemd[1]: ollama.service: Deactivated successfully.
327 Sep 25 06:26:34 ai-platform systemd[1]: Stopped Ollama Service.
328 Sep 25 06:26:34 ai-platform systemd[1]: ollama.service: Consumed 6min 21.822s CPU time.
329 -- Boot 07e33ef45ce6476f8795bb10410b0122 --
330 Sep 25 12:42:32 ai-platform systemd[1]: Started Ollama Service.
331 Sep 25 12:42:32 ai-platform ollama[861]: 2024/09/25 12:42:32 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:
https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
332 Sep 25 12:42:32 ai-platform ollama[861]: time=2024-09-25T12:42:32.988-07:00 level=INFO source=images.go:753 msg="total blobs: 31"
333 Sep 25 12:42:32 ai-platform ollama[861]: time=2024-09-25T12:42:32.997-07:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
334 Sep 25 12:42:32 ai-platform ollama[861]: time=2024-09-25T12:42:32.999-07:00 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)"
335 Sep 25 12:42:33 ai-platform ollama[861]: time=2024-09-25T12:42:33.003-07:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama1762789208/runners
336 Sep 25 12:42:48 ai-platform ollama[861]: time=2024-09-25T12:42:48.061-07:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm_v60102]"
337 Sep 25 12:42:48 ai-platform ollama[861]: time=2024-09-25T12:42:48.069-07:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
338 Sep 25 12:42:50 ai-platform ollama[861]: time=2024-09-25T12:42:50.306-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-8b16ac03-19ea-264b-44f5-0ba4e7a3cdea library=cuda variant=v12 compute=6.1 driver=12.4 name="Tesla P40" total="23.9 GiB" available="23.7 GiB"
339 Sep 25 12:42:50 ai-platform ollama[861]: time=2024-09-25T12:42:50.306-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-9402de2e-20d7-ff58-45c7-f25025132ba7 library=cuda variant=v11 compute=5.2 driver=12.4 name="Quadro M6000 24GB" total="23.9 GiB" available="23.8 GiB"
340 Sep 25 13:07:38 ai-platform systemd[1]: Stopping Ollama Service...
341 Sep 25 13:07:40 ai-platform systemd[1]: ollama.service: Deactivated successfully.
342 Sep 25 13:07:40 ai-platform systemd[1]: Stopped Ollama Service.
343 Sep 25 13:07:40 ai-platform systemd[1]: ollama.service: Consumed 29.339s CPU time.
344 -- Boot 24c5cad9e4db4be8951d9cf2bc3114c5 --
345 Sep 25 13:08:55 ai-platform systemd[1]: Started Ollama Service.
346 Sep 25 13:08:59 ai-platform ollama[863]: 2024/09/25 13:08:59 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:
https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
347 Sep 25 13:08:59 ai-platform ollama[863]: time=2024-09-25T13:08:59.497-07:00 level=INFO source=images.go:753 msg="total blobs: 31"
348 Sep 25 13:08:59 ai-platform ollama[863]: time=2024-09-25T13:08:59.546-07:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
349 Sep 25 13:08:59 ai-platform ollama[863]: time=2024-09-25T13:08:59.547-07:00 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)"
350 Sep 25 13:08:59 ai-platform ollama[863]: time=2024-09-25T13:08:59.547-07:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama2555802688/runners
351 Sep 25 13:09:20 ai-platform ollama[863]: time=2024-09-25T13:09:20.794-07:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm_v60102 cpu]"
352 Sep 25 13:09:20 ai-platform ollama[863]: time=2024-09-25T13:09:20.795-07:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
353 Sep 25 13:09:22 ai-platform ollama[863]: time=2024-09-25T13:09:22.492-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-8b16ac03-19ea-264b-44f5-0ba4e7a3cdea library=cuda variant=v12 compute=6.1 driver=12.4 name="Tesla P40" total="23.9 GiB" available="23.7 GiB"
354 Sep 25 13:09:22 ai-platform ollama[863]: time=2024-09-25T13:09:22.492-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-9402de2e-20d7-ff58-45c7-f25025132ba7 library=cuda variant=v11 compute=5.2 driver=12.4 name="Quadro M6000 24GB" total="23.9 GiB" available="23.8 GiB"
355 Sep 25 13:15:35 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:15:35 | 200 | 28.717µs | 127.0.0.1 | HEAD "/"
356 Sep 25 13:15:35 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:15:35 | 200 | 11.959577ms | 127.0.0.1 | GET "/api/tags"
357 Sep 25 13:15:57 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:15:57 | 200 | 24.363µs | 127.0.0.1 | HEAD "/"
358 Sep 25 13:15:57 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:15:57 | 200 | 160.361228ms | 127.0.0.1 | POST "/api/show"
359 Sep 25 13:15:58 ai-platform ollama[863]: time=2024-09-25T13:15:58.207-07:00 level=INFO source=server.go:103 msg="system memory" total="62.8 GiB" free="61.6 GiB" free_swap="8.0 GiB"
360 Sep 25 13:15:58 ai-platform ollama[863]: time=2024-09-25T13:15:58.208-07:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=47 layers.split="" memory.available="[23.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="39.3 GiB" memory.required.partial="23.4 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[23.4 GiB]" memory.weights.total="36.5 GiB" memory.weights.repeating="35.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="1.1 GiB"
361 Sep 25 13:15:58 ai-platform ollama[863]: time=2024-09-25T13:15:58.210-07:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2555802688/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-6439716a5b6269ac02585fa4b90ab622c28d9fa8d93772cc713414642ffa6efd --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 47 --parallel 1 --port 33157"
362 Sep 25 13:15:58 ai-platform ollama[863]: time=2024-09-25T13:15:58.211-07:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
363 Sep 25 13:15:58 ai-platform ollama[863]: time=2024-09-25T13:15:58.211-07:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
364 Sep 25 13:15:58 ai-platform ollama[863]: time=2024-09-25T13:15:58.211-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
365 Sep 25 13:15:59 ai-platform ollama[1836]: INFO [main] build info | build=10 commit="9225b05" tid="139726698635264" timestamp=1727295359
366 Sep 25 13:15:59 ai-platform ollama[1836]: INFO [main] system info | n_threads=14 n_threads_batch=14 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="139726698635264" timestamp=1727295359 total_threads=14
367 Sep 25 13:15:59 ai-platform ollama[1836]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="13" port="33157" tid="139726698635264" timestamp=1727295359
368 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-6439716a5b6269ac02585fa4b90ab622c28d9fa8d93772cc713414642ffa6efd (version GGUF V3 (latest))
369 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
370 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 0: general.architecture str = llama
371 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 1: general.type str = model
372 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 70B Instruct
373 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 3: general.finetune str = Instruct
374 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1
375 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 5: general.size_label str = 70B
376 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 6: general.license str = llama3.1
377 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
378 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
379 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 9: llama.block_count u32 = 80
380 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 10: llama.context_length u32 = 131072
381 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 11: llama.embedding_length u32 = 8192
382 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 12: llama.feed_forward_length u32 = 28672
383 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 13: llama.attention.head_count u32 = 64
384 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
385 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
386 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
387 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 17: general.file_type u32 = 2
388 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
389 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
390 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
391 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
392 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
393 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
394 Sep 25 13:15:59 ai-platform ollama[863]: time=2024-09-25T13:15:59.717-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
395 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
396 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
397 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
398 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
399 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 28: general.quantization_version u32 = 2
400 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - type f32: 162 tensors
401 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - type q4_0: 561 tensors
402 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - type q6_K: 1 tensors
403 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_vocab: special tokens cache size = 256
404 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_vocab: token to piece cache size = 0.7999 MB
405 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: format = GGUF V3 (latest)
406 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: arch = llama
407 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: vocab type = BPE
408 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_vocab = 128256
409 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_merges = 280147
410 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: vocab_only = 0
411 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_ctx_train = 131072
412 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_embd = 8192
413 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_layer = 80
414 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_head = 64
415 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_head_kv = 8
416 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_rot = 128
417 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_swa = 0
418 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_embd_head_k = 128
419 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_embd_head_v = 128
420 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_gqa = 8
421 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_embd_k_gqa = 1024
422 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_embd_v_gqa = 1024
423 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: f_norm_eps = 0.0e+00
424 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05
425 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
426 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
427 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: f_logit_scale = 0.0e+00
428 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_ff = 28672
429 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_expert = 0
430 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_expert_used = 0
431 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: causal attn = 1
432 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: pooling type = 0
433 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: rope type = 0
434 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: rope scaling = linear
435 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: freq_base_train = 500000.0
436 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: freq_scale_train = 1
437 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_ctx_orig_yarn = 131072
438 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: rope_finetuned = unknown
439 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: ssm_d_conv = 0
440 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: ssm_d_inner = 0
441 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: ssm_d_state = 0
442 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: ssm_dt_rank = 0
443 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: ssm_dt_b_c_rms = 0
444 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: model type = 70B
445 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: model ftype = Q4_0
446 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: model params = 70.55 B
447 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: model size = 37.22 GiB (4.53 BPW)
448 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: general.name = Meta Llama 3.1 70B Instruct
449 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
450 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
451 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: LF token = 128 'Ä'
452 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
453 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: max token length = 256
454 Sep 25 13:16:00 ai-platform ollama[863]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
455 Sep 25 13:16:00 ai-platform ollama[863]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
456 Sep 25 13:16:00 ai-platform ollama[863]: ggml_cuda_init: found 1 CUDA devices:
457 Sep 25 13:16:00 ai-platform ollama[863]: Device 0: Tesla P40, compute capability 6.1, VMM: yes
458 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_tensors: ggml ctx size = 0.68 MiB
459 Sep 25 13:20:04 ai-platform ollama[863]: llm_load_tensors: offloading 47 repeating layers to GPU
460 Sep 25 13:20:04 ai-platform ollama[863]: llm_load_tensors: offloaded 47/81 layers to GPU
461 Sep 25 13:20:04 ai-platform ollama[863]: llm_load_tensors: CPU buffer size = 38110.61 MiB
462 Sep 25 13:20:04 ai-platform ollama[863]: llm_load_tensors: CUDA0 buffer size = 21575.95 MiB
463 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: n_ctx = 2048
464 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: n_batch = 512
465 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: n_ubatch = 512
466 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: flash_attn = 0
467 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: freq_base = 500000.0
468 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: freq_scale = 1
469 Sep 25 13:20:06 ai-platform ollama[863]: llama_kv_cache_init: CUDA_Host KV buffer size = 264.00 MiB
470 Sep 25 13:20:06 ai-platform ollama[863]: llama_kv_cache_init: CUDA0 KV buffer size = 376.00 MiB
471 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB
472 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: CUDA_Host output buffer size = 0.52 MiB
473 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB
474 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB
475 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: graph nodes = 2566
476 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: graph splits = 433
477 Sep 25 13:20:07 ai-platform ollama[1836]: INFO [main] model loaded | tid="139726698635264" timestamp=1727295607
478 Sep 25 13:20:07 ai-platform ollama[863]: time=2024-09-25T13:20:07.942-07:00 level=INFO source=server.go:626 msg="llama runner started in 249.73 seconds"
479 Sep 25 13:20:07 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:20:07 | 200 | 4m10s | 127.0.0.1 | POST "/api/generate"
480 Sep 25 13:21:49 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:21:49 | 200 | 8.160347761s | 127.0.0.1 | POST "/api/chat"
481 Sep 25 13:22:11 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:22:11 | 200 | 14.763904056s | 127.0.0.1 | POST "/api/chat"
482 Sep 25 13:27:04 ai-platform systemd[1]: Stopping Ollama Service...
483 Sep 25 13:27:07 ai-platform systemd[1]: ollama.service: Deactivated successfully.
484 Sep 25 13:27:07 ai-platform systemd[1]: Stopped Ollama Service.
485 Sep 25 13:27:07 ai-platform systemd[1]: ollama.service: Consumed 5min 33.070s CPU time.
486 -- Boot 16c6f123db1c41d89aa8afa1dcd6c4fc --
487 Sep 25 13:28:14 ai-platform systemd[1]: Started Ollama Service.
488 Sep 25 13:28:16 ai-platform ollama[863]: 2024/09/25 13:28:16 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:
https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
489 Sep 25 13:28:16 ai-platform ollama[863]: time=2024-09-25T13:28:16.609-07:00 level=INFO source=images.go:753 msg="total blobs: 31"
490 Sep 25 13:28:16 ai-platform ollama[863]: time=2024-09-25T13:28:16.666-07:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
491 Sep 25 13:28:16 ai-platform ollama[863]: time=2024-09-25T13:28:16.668-07:00 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)"
492 Sep 25 13:28:16 ai-platform ollama[863]: time=2024-09-25T13:28:16.668-07:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama2431661027/runners
493 Sep 25 13:28:46 ai-platform ollama[863]: time=2024-09-25T13:28:46.470-07:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm_v60102]"
494 Sep 25 13:28:46 ai-platform ollama[863]: time=2024-09-25T13:28:46.470-07:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
495 Sep 25 13:28:48 ai-platform ollama[863]: time=2024-09-25T13:28:48.635-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-8b16ac03-19ea-264b-44f5-0ba4e7a3cdea library=cuda variant=v12 compute=6.1 driver=12.4 name="Tesla P40" total="23.9 GiB" available="23.7 GiB"
496 Sep 25 13:28:48 ai-platform ollama[863]: time=2024-09-25T13:28:48.635-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-9402de2e-20d7-ff58-45c7-f25025132ba7 library=cuda variant=v11 compute=5.2 driver=12.4 name="Quadro M6000 24GB" total="23.9 GiB" available="23.8 GiB"
497 Sep 25 13:45:49 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:45:49 | 200 | 1.00897ms | 127.0.0.1 | HEAD "/"
498 Sep 25 13:45:49 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:45:49 | 200 | 14.162083ms | 127.0.0.1 | GET "/api/tags"
499 Sep 25 13:46:01 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:46:01 | 200 | 40.385µs | 127.0.0.1 | HEAD "/"
500 Sep 25 13:46:01 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:46:01 | 200 | 549.982929ms | 127.0.0.1 | POST "/api/show"
501 Sep 25 13:46:02 ai-platform ollama[863]: time=2024-09-25T13:46:02.049-07:00 level=INFO source=server.go:103 msg="system memory" total="62.8 GiB" free="61.6 GiB" free_swap="8.0 GiB"
502 Sep 25 13:46:02 ai-platform ollama[863]: time=2024-09-25T13:46:02.051-07:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=47 layers.split="" memory.available="[23.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="39.3 GiB" memory.required.partial="23.4 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[23.4 GiB]" memory.weights.total="36.5 GiB" memory.weights.repeating="35.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="1.1 GiB"
503 Sep 25 13:46:02 ai-platform ollama[863]: time=2024-09-25T13:46:02.052-07:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2431661027/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-6439716a5b6269ac02585fa4b90ab622c28d9fa8d93772cc713414642ffa6efd --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 47 --parallel 1 --port 40727"
504 Sep 25 13:46:02 ai-platform ollama[863]: time=2024-09-25T13:46:02.053-07:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
505 Sep 25 13:46:02 ai-platform ollama[863]: time=2024-09-25T13:46:02.053-07:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
506 Sep 25 13:46:02 ai-platform ollama[863]: time=2024-09-25T13:46:02.053-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
507 Sep 25 13:46:03 ai-platform ollama[1842]: INFO [main] build info | build=10 commit="9225b05" tid="140313977905152" timestamp=1727297163
508 Sep 25 13:46:03 ai-platform ollama[1842]: INFO [main] system info | n_threads=14 n_threads_batch=14 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140313977905152" timestamp=1727297163 total_threads=14
509 Sep 25 13:46:03 ai-platform ollama[1842]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="13" port="40727" tid="140313977905152" timestamp=1727297163
510 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-6439716a5b6269ac02585fa4b90ab622c28d9fa8d93772cc713414642ffa6efd (version GGUF V3 (latest))
511 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
512 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 0: general.architecture str = llama
513 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 1: general.type str = model
514 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 70B Instruct
515 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 3: general.finetune str = Instruct
516 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1
517 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 5: general.size_label str = 70B
518 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 6: general.license str = llama3.1
519 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
520 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
521 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 9: llama.block_count u32 = 80
522 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 10: llama.context_length u32 = 131072
523 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 11: llama.embedding_length u32 = 8192
524 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 12: llama.feed_forward_length u32 = 28672
525 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 13: llama.attention.head_count u32 = 64
526 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
527 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
528 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
529 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 17: general.file_type u32 = 2
530 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
531 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
532 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
533 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
534 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
535 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
536 Sep 25 13:46:03 ai-platform ollama[863]: time=2024-09-25T13:46:03.308-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
537 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
538 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
539 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
540 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
541 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 28: general.quantization_version u32 = 2
542 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - type f32: 162 tensors
543 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - type q4_0: 561 tensors
544 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - type q6_K: 1 tensors
545 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_vocab: special tokens cache size = 256
546 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_vocab: token to piece cache size = 0.7999 MB
547 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: format = GGUF V3 (latest)
548 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: arch = llama
549 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: vocab type = BPE
550 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_vocab = 128256
551 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_merges = 280147
552 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: vocab_only = 0
553 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_ctx_train = 131072
554 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_embd = 8192
555 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_layer = 80
556 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_head = 64
557 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_head_kv = 8
558 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_rot = 128
559 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_swa = 0
560 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_embd_head_k = 128
561 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_embd_head_v = 128
562 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_gqa = 8
563 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_embd_k_gqa = 1024
564 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_embd_v_gqa = 1024
565 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: f_norm_eps = 0.0e+00
566 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05
567 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
568 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
569 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: f_logit_scale = 0.0e+00
570 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_ff = 28672
571 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_expert = 0
572 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_expert_used = 0
573 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: causal attn = 1
574 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: pooling type = 0
575 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: rope type = 0
576 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: rope scaling = linear
577 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: freq_base_train = 500000.0
578 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: freq_scale_train = 1
579 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_ctx_orig_yarn = 131072
580 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: rope_finetuned = unknown
581 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: ssm_d_conv = 0
582 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: ssm_d_inner = 0
583 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: ssm_d_state = 0
584 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: ssm_dt_rank = 0
585 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: ssm_dt_b_c_rms = 0
586 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: model type = 70B
587 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: model ftype = Q4_0
588 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: model params = 70.55 B
589 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: model size = 37.22 GiB (4.53 BPW)
590 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: general.name = Meta Llama 3.1 70B Instruct
591 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
592 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
593 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: LF token = 128 'Ä'
594 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
595 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: max token length = 256
596 Sep 25 13:46:03 ai-platform ollama[863]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
597 Sep 25 13:46:03 ai-platform ollama[863]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
598 Sep 25 13:46:03 ai-platform ollama[863]: ggml_cuda_init: found 1 CUDA devices:
599 Sep 25 13:46:03 ai-platform ollama[863]: Device 0: Tesla P40, compute capability 6.1, VMM: yes
600 Sep 25 13:46:04 ai-platform ollama[863]: llm_load_tensors: ggml ctx size = 0.68 MiB
601 Sep 25 13:48:53 ai-platform ollama[863]: llm_load_tensors: offloading 47 repeating layers to GPU
602 Sep 25 13:48:53 ai-platform ollama[863]: llm_load_tensors: offloaded 47/81 layers to GPU
603 Sep 25 13:48:53 ai-platform ollama[863]: llm_load_tensors: CPU buffer size = 38110.61 MiB
604 Sep 25 13:48:53 ai-platform ollama[863]: llm_load_tensors: CUDA0 buffer size = 21575.95 MiB
605 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: n_ctx = 2048
606 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: n_batch = 512
607 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: n_ubatch = 512
608 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: flash_attn = 0
609 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: freq_base = 500000.0
610 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: freq_scale = 1
611 Sep 25 13:48:56 ai-platform ollama[863]: llama_kv_cache_init: CUDA_Host KV buffer size = 264.00 MiB
612 Sep 25 13:48:56 ai-platform ollama[863]: llama_kv_cache_init: CUDA0 KV buffer size = 376.00 MiB
613 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB
614 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: CUDA_Host output buffer size = 0.52 MiB
615 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB
616 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB
617 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: graph nodes = 2566
618 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: graph splits = 433
619 Sep 25 13:48:57 ai-platform ollama[1842]: INFO [main] model loaded | tid="140313977905152" timestamp=1727297337
620 Sep 25 13:48:57 ai-platform ollama[863]: time=2024-09-25T13:48:57.670-07:00 level=INFO source=server.go:626 msg="llama runner started in 175.62 seconds"
621 Sep 25 13:48:57 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:48:57 | 200 | 2m56s | 127.0.0.1 | POST "/api/generate"
622 Sep 25 13:57:36 ai-platform ollama[863]: time=2024-09-25T13:57:36.808-07:00 level=INFO source=server.go:103 msg="system memory" total="62.8 GiB" free="61.6 GiB" free_swap="8.0 GiB"
623 Sep 25 13:57:36 ai-platform ollama[863]: time=2024-09-25T13:57:36.809-07:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=47 layers.split="" memory.available="[23.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="39.3 GiB" memory.required.partial="23.4 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[23.4 GiB]" memory.weights.total="36.5 GiB" memory.weights.repeating="35.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="1.1 GiB"
624 Sep 25 13:57:36 ai-platform ollama[863]: time=2024-09-25T13:57:36.811-07:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2431661027/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-6439716a5b6269ac02585fa4b90ab622c28d9fa8d93772cc713414642ffa6efd --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 47 --parallel 1 --port 44643"
625 Sep 25 13:57:36 ai-platform ollama[863]: time=2024-09-25T13:57:36.811-07:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
626 Sep 25 13:57:36 ai-platform ollama[863]: time=2024-09-25T13:57:36.811-07:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
627 Sep 25 13:57:36 ai-platform ollama[863]: time=2024-09-25T13:57:36.811-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
628 Sep 25 13:57:36 ai-platform ollama[2354]: INFO [main] build info | build=10 commit="9225b05" tid="139803936231424" timestamp=1727297856
629 Sep 25 13:57:36 ai-platform ollama[2354]: INFO [main] system info | n_threads=14 n_threads_batch=14 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="139803936231424" timestamp=1727297856 total_threads=14
630 Sep 25 13:57:36 ai-platform ollama[2354]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="13" port="44643" tid="139803936231424" timestamp=1727297856
631 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-6439716a5b6269ac02585fa4b90ab622c28d9fa8d93772cc713414642ffa6efd (version GGUF V3 (latest))
632 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
633 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 0: general.architecture str = llama
634 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 1: general.type str = model
635 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 70B Instruct
636 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 3: general.finetune str = Instruct
637 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1
638 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 5: general.size_label str = 70B
639 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 6: general.license str = llama3.1
640 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
641 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
642 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 9: llama.block_count u32 = 80
643 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 10: llama.context_length u32 = 131072
644 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 11: llama.embedding_length u32 = 8192
645 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 12: llama.feed_forward_length u32 = 28672
646 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 13: llama.attention.head_count u32 = 64
647 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
648 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
649 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
650 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 17: general.file_type u32 = 2
651 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
652 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
653 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
654 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
655 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
656 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
657 Sep 25 13:57:37 ai-platform ollama[863]: llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
658 Sep 25 13:57:37 ai-platform ollama[863]: llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
659 Sep 25 13:57:37 ai-platform ollama[863]: llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
660 Sep 25 13:57:37 ai-platform ollama[863]: llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
661 Sep 25 13:57:37 ai-platform ollama[863]: llama_model_loader: - kv 28: general.quantization_version u32 = 2
662 Sep 25 13:57:37 ai-platform ollama[863]: llama_model_loader: - type f32: 162 tensors
663 Sep 25 13:57:37 ai-platform ollama[863]: llama_model_loader: - type q4_0: 561 tensors
664 Sep 25 13:57:37 ai-platform ollama[863]: llama_model_loader: - type q6_K: 1 tensors
665 Sep 25 13:57:37 ai-platform ollama[863]: time=2024-09-25T13:57:37.063-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
666 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_vocab: special tokens cache size = 256
667 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_vocab: token to piece cache size = 0.7999 MB
668 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: format = GGUF V3 (latest)
669 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: arch = llama
670 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: vocab type = BPE
671 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_vocab = 128256
672 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_merges = 280147
673 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: vocab_only = 0
674 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_ctx_train = 131072
675 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_embd = 8192
676 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_layer = 80
677 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_head = 64
678 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_head_kv = 8
679 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_rot = 128
680 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_swa = 0
681 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_embd_head_k = 128
682 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_embd_head_v = 128
683 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_gqa = 8
684 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_embd_k_gqa = 1024
685 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_embd_v_gqa = 1024
686 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: f_norm_eps = 0.0e+00
687 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05
688 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
689 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
690 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: f_logit_scale = 0.0e+00
691 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_ff = 28672
692 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_expert = 0
693 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_expert_used = 0
694 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: causal attn = 1
695 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: pooling type = 0
696 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: rope type = 0
697 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: rope scaling = linear
698 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: freq_base_train = 500000.0
699 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: freq_scale_train = 1
700 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_ctx_orig_yarn = 131072
701 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: rope_finetuned = unknown
702 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: ssm_d_conv = 0
703 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: ssm_d_inner = 0
704 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: ssm_d_state = 0
705 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: ssm_dt_rank = 0
706 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: ssm_dt_b_c_rms = 0
707 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: model type = 70B
708 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: model ftype = Q4_0
709 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: model params = 70.55 B
710 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: model size = 37.22 GiB (4.53 BPW)
711 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: general.name = Meta Llama 3.1 70B Instruct
712 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
713 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
714 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: LF token = 128 'Ä'
715 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
716 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: max token length = 256
717 Sep 25 13:57:37 ai-platform ollama[863]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
718 Sep 25 13:57:37 ai-platform ollama[863]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
719 Sep 25 13:57:37 ai-platform ollama[863]: ggml_cuda_init: found 1 CUDA devices:
720 Sep 25 13:57:37 ai-platform ollama[863]: Device 0: Tesla P40, compute capability 6.1, VMM: yes
721 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_tensors: ggml ctx size = 0.68 MiB
722 Sep 25 13:57:38 ai-platform ollama[863]: time=2024-09-25T13:57:38.519-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding"
723 Sep 25 13:57:39 ai-platform ollama[863]: llm_load_tensors: offloading 47 repeating layers to GPU
724 Sep 25 13:57:39 ai-platform ollama[863]: llm_load_tensors: offloaded 47/81 layers to GPU
725 Sep 25 13:57:39 ai-platform ollama[863]: llm_load_tensors: CPU buffer size = 38110.61 MiB
726 Sep 25 13:57:39 ai-platform ollama[863]: llm_load_tensors: CUDA0 buffer size = 21575.95 MiB
727 Sep 25 13:57:39 ai-platform ollama[863]: time=2024-09-25T13:57:39.221-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
728 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: n_ctx = 2048
729 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: n_batch = 512
730 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: n_ubatch = 512
731 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: flash_attn = 0
732 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: freq_base = 500000.0
733 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: freq_scale = 1
734 Sep 25 13:57:41 ai-platform ollama[863]: llama_kv_cache_init: CUDA_Host KV buffer size = 264.00 MiB
735 Sep 25 13:57:41 ai-platform ollama[863]: llama_kv_cache_init: CUDA0 KV buffer size = 376.00 MiB
736 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB
737 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: CUDA_Host output buffer size = 0.52 MiB
738 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB
739 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB
740 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: graph nodes = 2566
741 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: graph splits = 433
742 Sep 25 13:57:42 ai-platform ollama[2354]: INFO [main] model loaded | tid="139803936231424" timestamp=1727297862
743 Sep 25 13:57:42 ai-platform ollama[863]: time=2024-09-25T13:57:42.760-07:00 level=INFO source=server.go:626 msg="llama runner started in 5.95 seconds"
744 Sep 25 13:57:50 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:57:50 | 200 | 14.16960451s | 127.0.0.1 | POST "/api/chat"

<!-- gh-comment-id:2375273573 --> @Blake110 commented on GitHub (Sep 25, 2024): all about ollama logs are here. BTW, after pulling out the P40, the M6000 can work with ollama under the same NV cuda driver. and I tried the RTX 2060 with P40 and 1080ti with p40 with the same NV cuda driver, they can work together with no issus. So the different GPU architecture should be ok. I used NVIDIA-Linux-x86_64-550.100.run driver in line 352 - 353 - 354, (line 494 - 495 - 496 )2 gpu was found. " 352 Sep 25 13:09:20 ai-platform ollama[863]: time=2024-09-25T13:09:20.795-07:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs" 353 Sep 25 13:09:22 ai-platform ollama[863]: time=2024-09-25T13:09:22.492-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-8b16ac03-19ea-264b-44f5-0ba4e7a3cdea library=cuda variant=v12 compute=6.1 driver=12.4 name="Tesla P40" total="23.9 GiB" available="23.7 GiB" 354 Sep 25 13:09:22 ai-platform ollama[863]: time=2024-09-25T13:09:22.492-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-9402de2e-20d7-ff58-45c7-f25025132ba7 library=cuda variant=v11 compute=5.2 driver=12.4 name="Quadro M6000 24GB" total="23.9 GiB" available="23.8 GiB" and in line 719 - 720, just P40 was found with ggml_cuda. 719 Sep 25 13:57:37 ai-platform ollama[863]: ggml_cuda_init: found 1 CUDA devices: 720 Sep 25 13:57:37 ai-platform ollama[863]: Device 0: Tesla P40, compute capability 6.1, VMM: yes 1 Sep 25 01:54:13 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:54:13 | 200 | 43.216µs | 127.0.0.1 | HEAD "/" 2 Sep 25 01:54:13 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:54:13 | 200 | 164.080988ms | 127.0.0.1 | GET "/api/tags" 3 Sep 25 01:54:21 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:54:21 | 200 | 22.7µs | 127.0.0.1 | HEAD "/" 4 Sep 25 01:54:21 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:54:21 | 200 | 83.877µs | 127.0.0.1 | GET "/api/ps" 5 Sep 25 01:54:49 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:54:49 | 200 | 24.045µs | 127.0.0.1 | HEAD "/" 6 Sep 25 01:54:49 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:54:49 | 200 | 1.135969ms | 127.0.0.1 | GET "/api/tags" 7 Sep 25 01:55:03 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:55:03 | 200 | 22.533µs | 127.0.0.1 | HEAD "/" 8 Sep 25 01:55:03 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:55:03 | 200 | 186.942878ms | 127.0.0.1 | POST "/api/show" 9 Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.644-07:00 level=INFO source=server.go:103 msg="system memory" total="78.5 GiB" free="77.2 GiB" free_swap="8.0 GiB" 10 Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.645-07:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=47 layers.split="" memory.available="[23.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="39.3 GiB" memory.required.partial="23.4 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[23.4 GiB]" memory.weights.total="36.5 GiB" memory.weights.repeating="35.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="1.1 GiB" 11 Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.647-07:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2509469182/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-6439716a5b6269ac02585fa4b90ab622c28d9fa8d93772cc713414642ffa6efd --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 47 --parallel 1 --port 37095" 12 Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.647-07:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 13 Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.647-07:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding" 14 Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.648-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" 15 Sep 25 01:55:03 ai-platform ollama[2194]: INFO [main] build info | build=10 commit="9225b05" tid="139687410753536" timestamp=1727254503 16 Sep 25 01:55:03 ai-platform ollama[2194]: INFO [main] system info | n_threads=14 n_threads_batch=14 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="139687410753536" timestamp=1727254503 total_threads=14 17 Sep 25 01:55:03 ai-platform ollama[2194]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="13" port="37095" tid="139687410753536" timestamp=1727254503 18 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-6439716a5b6269ac02585fa4b90ab622c28d9fa8d93772cc713414642ffa6efd (version GGUF V3 (latest)) 19 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 20 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 0: general.architecture str = llama 21 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 1: general.type str = model 22 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 70B Instruct 23 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 3: general.finetune str = Instruct 24 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 25 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 5: general.size_label str = 70B 26 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 6: general.license str = llama3.1 27 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... 28 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... 29 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 9: llama.block_count u32 = 80 30 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 10: llama.context_length u32 = 131072 31 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 11: llama.embedding_length u32 = 8192 32 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 12: llama.feed_forward_length u32 = 28672 33 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 13: llama.attention.head_count u32 = 64 34 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 35 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 36 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 37 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 17: general.file_type u32 = 2 38 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 39 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 40 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 41 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe 42 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... 43 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 44 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... 45 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 46 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 47 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... 48 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 28: general.quantization_version u32 = 2 49 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - type f32: 162 tensors 50 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - type q4_0: 561 tensors 51 Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - type q6_K: 1 tensors 52 Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.899-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" 53 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_vocab: special tokens cache size = 256 54 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_vocab: token to piece cache size = 0.7999 MB 55 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: format = GGUF V3 (latest) 56 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: arch = llama 57 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: vocab type = BPE 58 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_vocab = 128256 59 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_merges = 280147 60 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: vocab_only = 0 61 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_ctx_train = 131072 62 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_embd = 8192 63 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_layer = 80 64 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_head = 64 65 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_head_kv = 8 66 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_rot = 128 67 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_swa = 0 68 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_embd_head_k = 128 69 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_embd_head_v = 128 70 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_gqa = 8 71 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_embd_k_gqa = 1024 72 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_embd_v_gqa = 1024 73 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: f_norm_eps = 0.0e+00 74 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 75 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 76 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 77 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: f_logit_scale = 0.0e+00 78 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_ff = 28672 79 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_expert = 0 80 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_expert_used = 0 81 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: causal attn = 1 82 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: pooling type = 0 83 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: rope type = 0 84 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: rope scaling = linear 85 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: freq_base_train = 500000.0 86 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: freq_scale_train = 1 87 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_ctx_orig_yarn = 131072 88 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: rope_finetuned = unknown 89 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: ssm_d_conv = 0 90 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: ssm_d_inner = 0 91 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: ssm_d_state = 0 92 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: ssm_dt_rank = 0 93 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: ssm_dt_b_c_rms = 0 94 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: model type = 70B 95 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: model ftype = Q4_0 96 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: model params = 70.55 B 97 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: model size = 37.22 GiB (4.53 BPW) 98 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: general.name = Meta Llama 3.1 70B Instruct 99 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' 100 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>' 101 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: LF token = 128 'Ä' 102 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>' 103 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: max token length = 256 104 Sep 25 01:55:04 ai-platform ollama[2049]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 105 Sep 25 01:55:04 ai-platform ollama[2049]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 106 Sep 25 01:55:04 ai-platform ollama[2049]: ggml_cuda_init: found 1 CUDA devices: 107 Sep 25 01:55:04 ai-platform ollama[2049]: Device 0: Tesla P40, compute capability 6.1, VMM: yes 108 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_tensors: ggml ctx size = 0.68 MiB 109 Sep 25 01:57:33 ai-platform ollama[2049]: llm_load_tensors: offloading 47 repeating layers to GPU 110 Sep 25 01:57:33 ai-platform ollama[2049]: llm_load_tensors: offloaded 47/81 layers to GPU 111 Sep 25 01:57:33 ai-platform ollama[2049]: llm_load_tensors: CPU buffer size = 38110.61 MiB 112 Sep 25 01:57:33 ai-platform ollama[2049]: llm_load_tensors: CUDA0 buffer size = 21575.95 MiB 113 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: n_ctx = 2048 114 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: n_batch = 512 115 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: n_ubatch = 512 116 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: flash_attn = 0 117 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: freq_base = 500000.0 118 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: freq_scale = 1 119 Sep 25 01:57:36 ai-platform ollama[2049]: llama_kv_cache_init: CUDA_Host KV buffer size = 264.00 MiB 120 Sep 25 01:57:36 ai-platform ollama[2049]: llama_kv_cache_init: CUDA0 KV buffer size = 376.00 MiB 121 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB 122 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: CUDA_Host output buffer size = 0.52 MiB 123 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB 124 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB 125 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: graph nodes = 2566 126 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: graph splits = 433 127 Sep 25 01:57:37 ai-platform ollama[2194]: INFO [main] model loaded | tid="139687410753536" timestamp=1727254657 128 Sep 25 01:57:37 ai-platform ollama[2049]: time=2024-09-25T01:57:37.386-07:00 level=INFO source=server.go:626 msg="llama runner started in 153.74 seconds" 129 Sep 25 01:57:37 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:57:37 | 200 | 2m34s | 127.0.0.1 | POST "/api/generate" 130 Sep 25 02:08:35 ai-platform systemd[1]: Stopping Ollama Service... 131 Sep 25 02:08:36 ai-platform systemd[1]: ollama.service: Deactivated successfully. 132 Sep 25 02:08:36 ai-platform systemd[1]: Stopped Ollama Service. 133 Sep 25 02:08:36 ai-platform systemd[1]: ollama.service: Consumed 1min 4.919s CPU time. 134 -- Boot 997666db52994643b5bfc4ed04149e37 -- 135 Sep 25 02:10:03 ai-platform systemd[1]: Started Ollama Service. 136 Sep 25 02:10:03 ai-platform ollama[846]: 2024/09/25 02:10:03 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" 137 Sep 25 02:10:03 ai-platform ollama[846]: time=2024-09-25T02:10:03.720-07:00 level=INFO source=images.go:753 msg="total blobs: 33" 138 Sep 25 02:10:03 ai-platform ollama[846]: time=2024-09-25T02:10:03.729-07:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0" 139 Sep 25 02:10:03 ai-platform ollama[846]: time=2024-09-25T02:10:03.730-07:00 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)" 140 Sep 25 02:10:03 ai-platform ollama[846]: time=2024-09-25T02:10:03.732-07:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama4199965130/runners 141 Sep 25 02:10:19 ai-platform ollama[846]: time=2024-09-25T02:10:19.610-07:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm_v60102]" 142 Sep 25 02:10:19 ai-platform ollama[846]: time=2024-09-25T02:10:19.612-07:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs" 143 Sep 25 02:10:21 ai-platform ollama[846]: time=2024-09-25T02:10:21.299-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-8b16ac03-19ea-264b-44f5-0ba4e7a3cdea library=cuda variant=v12 compute=6.1 driver=12.4 name="Tesla P40" total="23.9 GiB" available="23.7 GiB" 144 Sep 25 02:10:21 ai-platform ollama[846]: time=2024-09-25T02:10:21.299-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-9402de2e-20d7-ff58-45c7-f25025132ba7 library=cuda variant=v11 compute=5.2 driver=12.4 name="Quadro M6000 24GB" total="23.9 GiB" available="23.8 GiB" 145 Sep 25 02:12:57 ai-platform ollama[846]: [GIN] 2024/09/25 - 02:12:57 | 200 | 996.696µs | 127.0.0.1 | HEAD "/" 146 Sep 25 02:12:57 ai-platform ollama[846]: [GIN] 2024/09/25 - 02:12:57 | 200 | 180.045µs | 127.0.0.1 | GET "/api/ps" 147 Sep 25 02:13:13 ai-platform systemd[1]: Stopping Ollama Service... 148 Sep 25 02:13:14 ai-platform systemd[1]: ollama.service: Deactivated successfully. 149 Sep 25 02:13:14 ai-platform systemd[1]: Stopped Ollama Service. 150 Sep 25 02:13:14 ai-platform systemd[1]: ollama.service: Consumed 30.152s CPU time. 151 -- Boot 37f48c8c97cd44a0bb24888eb055fc69 -- 152 Sep 25 02:17:12 ai-platform systemd[1]: Started Ollama Service. 153 Sep 25 02:17:15 ai-platform ollama[869]: 2024/09/25 02:17:15 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" 154 Sep 25 02:17:15 ai-platform ollama[869]: time=2024-09-25T02:17:15.092-07:00 level=INFO source=images.go:753 msg="total blobs: 33" 155 Sep 25 02:17:15 ai-platform ollama[869]: time=2024-09-25T02:17:15.165-07:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0" 156 Sep 25 02:17:15 ai-platform ollama[869]: time=2024-09-25T02:17:15.170-07:00 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)" 157 Sep 25 02:17:15 ai-platform ollama[869]: time=2024-09-25T02:17:15.171-07:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama4047483176/runners 158 Sep 25 02:17:59 ai-platform ollama[869]: time=2024-09-25T02:17:59.819-07:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm_v60102 cpu]" 159 Sep 25 02:17:59 ai-platform ollama[869]: time=2024-09-25T02:17:59.820-07:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs" 160 Sep 25 02:18:01 ai-platform ollama[869]: time=2024-09-25T02:18:01.531-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-8b16ac03-19ea-264b-44f5-0ba4e7a3cdea library=cuda variant=v12 compute=6.1 driver=12.4 name="Tesla P40" total="23.9 GiB" available="23.7 GiB" 161 Sep 25 02:18:01 ai-platform ollama[869]: time=2024-09-25T02:18:01.532-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-9402de2e-20d7-ff58-45c7-f25025132ba7 library=cuda variant=v11 compute=5.2 driver=12.4 name="Quadro M6000 24GB" total="23.9 GiB" available="23.8 GiB" 162 Sep 25 02:23:45 ai-platform systemd[1]: Stopping Ollama Service... 163 Sep 25 02:23:46 ai-platform systemd[1]: ollama.service: Deactivated successfully. 164 Sep 25 02:23:46 ai-platform systemd[1]: Stopped Ollama Service. 165 Sep 25 02:23:46 ai-platform systemd[1]: ollama.service: Consumed 30.855s CPU time. 166 -- Boot 573ea622850a4a3d8eb2de36dd38cae3 -- 167 Sep 25 02:25:05 ai-platform systemd[1]: Started Ollama Service. 168 Sep 25 02:25:07 ai-platform ollama[864]: 2024/09/25 02:25:07 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" 169 Sep 25 02:25:08 ai-platform ollama[864]: time=2024-09-25T02:25:08.064-07:00 level=INFO source=images.go:753 msg="total blobs: 33" 170 Sep 25 02:25:08 ai-platform ollama[864]: time=2024-09-25T02:25:08.803-07:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0" 171 Sep 25 02:25:08 ai-platform ollama[864]: time=2024-09-25T02:25:08.810-07:00 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)" 172 Sep 25 02:25:08 ai-platform ollama[864]: time=2024-09-25T02:25:08.817-07:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama2283343905/runners 173 Sep 25 02:25:51 ai-platform ollama[864]: time=2024-09-25T02:25:51.063-07:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu_avx2 cuda_v11 cuda_v12 rocm_v60102 cpu cpu_avx]" 174 Sep 25 02:25:51 ai-platform ollama[864]: time=2024-09-25T02:25:51.070-07:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs" 175 Sep 25 02:25:52 ai-platform ollama[864]: time=2024-09-25T02:25:52.804-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-8b16ac03-19ea-264b-44f5-0ba4e7a3cdea library=cuda variant=v12 compute=6.1 driver=12.4 name="Tesla P40" total="23.9 GiB" available="23.7 GiB" 176 Sep 25 02:25:52 ai-platform ollama[864]: time=2024-09-25T02:25:52.804-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-9402de2e-20d7-ff58-45c7-f25025132ba7 library=cuda variant=v11 compute=5.2 driver=12.4 name="Quadro M6000 24GB" total="23.9 GiB" available="23.8 GiB" 177 Sep 25 02:49:48 ai-platform ollama[864]: [GIN] 2024/09/25 - 02:49:48 | 200 | 806.784µs | 127.0.0.1 | HEAD "/" 178 Sep 25 02:49:48 ai-platform ollama[864]: [GIN] 2024/09/25 - 02:49:48 | 200 | 14.434265ms | 127.0.0.1 | GET "/api/tags" 179 Sep 25 02:55:02 ai-platform ollama[864]: [GIN] 2024/09/25 - 02:55:02 | 200 | 37.488µs | 127.0.0.1 | HEAD "/" 180 Sep 25 02:55:02 ai-platform ollama[864]: [GIN] 2024/09/25 - 02:55:02 | 200 | 79.297887ms | 127.0.0.1 | DELETE "/api/delete" 181 Sep 25 02:55:07 ai-platform ollama[864]: [GIN] 2024/09/25 - 02:55:07 | 200 | 35.511µs | 127.0.0.1 | HEAD "/" 182 Sep 25 02:55:07 ai-platform ollama[864]: [GIN] 2024/09/25 - 02:55:07 | 200 | 1.096253ms | 127.0.0.1 | GET "/api/tags" 183 Sep 25 02:55:25 ai-platform ollama[864]: [GIN] 2024/09/25 - 02:55:25 | 200 | 24.964µs | 127.0.0.1 | HEAD "/" 184 Sep 25 02:55:26 ai-platform ollama[864]: time=2024-09-25T02:55:26.794-07:00 level=INFO source=download.go:175 msg="downloading 09cd6813dc2e in 17 1 GB part(s)" 185 Sep 25 02:55:52 ai-platform ollama[864]: time=2024-09-25T02:55:52.901-07:00 level=INFO source=download.go:370 msg="09cd6813dc2e part 10 stalled; retrying. If this persists, press ctrl-c to exit, then 'ollama pull' to find a faster connection." 186 Sep 25 02:55:52 ai-platform ollama[864]: time=2024-09-25T02:55:52.901-07:00 level=INFO source=download.go:370 msg="09cd6813dc2e part 15 stalled; retrying. If this persists, press ctrl-c to exit, then 'ollama pull' to find a faster connection." 187 Sep 25 02:55:52 ai-platform ollama[864]: time=2024-09-25T02:55:52.901-07:00 level=INFO source=download.go:370 msg="09cd6813dc2e part 9 stalled; retrying. If this persists, press ctrl-c to exit, then 'ollama pull' to find a faster connection." 188 Sep 25 02:55:52 ai-platform ollama[864]: time=2024-09-25T02:55:52.901-07:00 level=INFO source=download.go:370 msg="09cd6813dc2e part 7 stalled; retrying. If this persists, press ctrl-c to exit, then 'ollama pull' to find a faster connection." 189 Sep 25 02:58:03 ai-platform ollama[864]: time=2024-09-25T02:58:03.626-07:00 level=INFO source=download.go:175 msg="downloading 948af2743fc7 in 1 1.5 KB part(s)" 190 Sep 25 02:58:05 ai-platform ollama[864]: time=2024-09-25T02:58:05.536-07:00 level=INFO source=download.go:175 msg="downloading daa7d15f6d0b in 1 484 B part(s)" 191 Sep 25 02:58:53 ai-platform ollama[864]: [GIN] 2024/09/25 - 02:58:53 | 200 | 3m28s | 127.0.0.1 | POST "/api/pull" 192 Sep 25 03:00:30 ai-platform ollama[864]: [GIN] 2024/09/25 - 03:00:30 | 200 | 28.578µs | 127.0.0.1 | HEAD "/" 193 Sep 25 03:00:30 ai-platform ollama[864]: [GIN] 2024/09/25 - 03:00:30 | 200 | 1.274784ms | 127.0.0.1 | GET "/api/tags" 194 Sep 25 03:00:49 ai-platform ollama[864]: [GIN] 2024/09/25 - 03:00:49 | 200 | 22.555µs | 127.0.0.1 | HEAD "/" 195 Sep 25 03:00:49 ai-platform ollama[864]: [GIN] 2024/09/25 - 03:00:49 | 404 | 135.621µs | 127.0.0.1 | POST "/api/show" 196 Sep 25 03:00:50 ai-platform ollama[864]: [GIN] 2024/09/25 - 03:00:50 | 200 | 468.40178ms | 127.0.0.1 | POST "/api/pull" 197 Sep 25 03:01:06 ai-platform ollama[864]: [GIN] 2024/09/25 - 03:01:06 | 200 | 27.667µs | 127.0.0.1 | HEAD "/" 198 Sep 25 03:01:06 ai-platform ollama[864]: [GIN] 2024/09/25 - 03:01:06 | 200 | 26.493541ms | 127.0.0.1 | POST "/api/show" 199 Sep 25 03:01:06 ai-platform ollama[864]: time=2024-09-25T03:01:06.425-07:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-09cd6813dc2e73d9db9345123ee1b3385bb7cee88a46f13dc37bc3d5e96ba3a4 gpu=GPU-8b16ac03-19ea-264b-44f5-0ba4e7a3cdea parallel=4 available=25470566400 required="16.4 GiB" 200 Sep 25 03:01:06 ai-platform ollama[864]: time=2024-09-25T03:01:06.425-07:00 level=INFO source=server.go:103 msg="system memory" total="62.8 GiB" free="61.5 GiB" free_swap="8.0 GiB" 201 Sep 25 03:01:06 ai-platform ollama[864]: time=2024-09-25T03:01:06.426-07:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[23.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="16.4 GiB" memory.required.partial="16.4 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[16.4 GiB]" memory.weights.total="14.0 GiB" memory.weights.repeating="13.0 GiB" memory.weights.nonrepeating="1002.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB" 202 Sep 25 03:01:06 ai-platform ollama[864]: time=2024-09-25T03:01:06.428-07:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2283343905/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-09cd6813dc2e73d9db9345123ee1b3385bb7cee88a46f13dc37bc3d5e96ba3a4 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 34317" 203 Sep 25 03:01:06 ai-platform ollama[864]: time=2024-09-25T03:01:06.429-07:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 204 Sep 25 03:01:06 ai-platform ollama[864]: time=2024-09-25T03:01:06.429-07:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding" 205 Sep 25 03:01:06 ai-platform ollama[864]: time=2024-09-25T03:01:06.429-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" 206 Sep 25 03:01:07 ai-platform ollama[1805]: INFO [main] build info | build=10 commit="9225b05" tid="140325808095232" timestamp=1727258467 207 Sep 25 03:01:07 ai-platform ollama[1805]: INFO [main] system info | n_threads=14 n_threads_batch=14 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140325808095232" timestamp=1727258467 total_threads=14 208 Sep 25 03:01:07 ai-platform ollama[1805]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="13" port="34317" tid="140325808095232" timestamp=1727258467 209 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-09cd6813dc2e73d9db9345123ee1b3385bb7cee88a46f13dc37bc3d5e96ba3a4 (version GGUF V3 (latest)) 210 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 211 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 0: general.architecture str = llama 212 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 1: general.type str = model 213 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct 214 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 3: general.finetune str = Instruct 215 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 216 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 5: general.size_label str = 8B 217 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 6: general.license str = llama3.1 218 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... 219 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... 220 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 9: llama.block_count u32 = 32 221 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 10: llama.context_length u32 = 131072 222 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 11: llama.embedding_length u32 = 4096 223 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336 224 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 13: llama.attention.head_count u32 = 32 225 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 226 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 227 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 228 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 17: general.file_type u32 = 1 229 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 230 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 231 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 232 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe 233 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... 234 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 235 Sep 25 03:01:07 ai-platform ollama[864]: time=2024-09-25T03:01:07.685-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" 236 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... 237 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 238 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 239 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... 240 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 28: general.quantization_version u32 = 2 241 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - type f32: 66 tensors 242 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - type f16: 226 tensors 243 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_vocab: special tokens cache size = 256 244 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_vocab: token to piece cache size = 0.7999 MB 245 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: format = GGUF V3 (latest) 246 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: arch = llama 247 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: vocab type = BPE 248 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_vocab = 128256 249 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_merges = 280147 250 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: vocab_only = 0 251 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_ctx_train = 131072 252 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_embd = 4096 253 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_layer = 32 254 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_head = 32 255 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_head_kv = 8 256 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_rot = 128 257 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_swa = 0 258 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_embd_head_k = 128 259 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_embd_head_v = 128 260 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_gqa = 4 261 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_embd_k_gqa = 1024 262 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_embd_v_gqa = 1024 263 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: f_norm_eps = 0.0e+00 264 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 265 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 266 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 267 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: f_logit_scale = 0.0e+00 268 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_ff = 14336 269 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_expert = 0 270 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_expert_used = 0 271 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: causal attn = 1 272 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: pooling type = 0 273 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: rope type = 0 274 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: rope scaling = linear 275 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: freq_base_train = 500000.0 276 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: freq_scale_train = 1 277 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_ctx_orig_yarn = 131072 278 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: rope_finetuned = unknown 279 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: ssm_d_conv = 0 280 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: ssm_d_inner = 0 281 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: ssm_d_state = 0 282 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: ssm_dt_rank = 0 283 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: ssm_dt_b_c_rms = 0 284 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: model type = 8B 285 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: model ftype = F16 286 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: model params = 8.03 B 287 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: model size = 14.96 GiB (16.00 BPW) 288 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct 289 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' 290 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>' 291 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: LF token = 128 'Ä' 292 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>' 293 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: max token length = 256 294 Sep 25 03:01:08 ai-platform ollama[864]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 295 Sep 25 03:01:08 ai-platform ollama[864]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 296 Sep 25 03:01:08 ai-platform ollama[864]: ggml_cuda_init: found 1 CUDA devices: 297 Sep 25 03:01:08 ai-platform ollama[864]: Device 0: Tesla P40, compute capability 6.1, VMM: yes 298 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_tensors: ggml ctx size = 0.27 MiB 299 Sep 25 03:01:09 ai-platform ollama[864]: time=2024-09-25T03:01:09.140-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding" 300 Sep 25 03:01:09 ai-platform ollama[864]: llm_load_tensors: offloading 32 repeating layers to GPU 301 Sep 25 03:01:09 ai-platform ollama[864]: llm_load_tensors: offloading non-repeating layers to GPU 302 Sep 25 03:01:09 ai-platform ollama[864]: llm_load_tensors: offloaded 33/33 layers to GPU 303 Sep 25 03:01:09 ai-platform ollama[864]: llm_load_tensors: CPU buffer size = 1002.00 MiB 304 Sep 25 03:01:09 ai-platform ollama[864]: llm_load_tensors: CUDA0 buffer size = 14315.02 MiB 305 Sep 25 03:01:09 ai-platform ollama[864]: time=2024-09-25T03:01:09.843-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" 306 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: n_ctx = 8192 307 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: n_batch = 512 308 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: n_ubatch = 512 309 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: flash_attn = 0 310 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: freq_base = 500000.0 311 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: freq_scale = 1 312 Sep 25 03:01:11 ai-platform ollama[864]: llama_kv_cache_init: CUDA0 KV buffer size = 1024.00 MiB 313 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB 314 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: CUDA_Host output buffer size = 2.02 MiB 315 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: CUDA0 compute buffer size = 560.00 MiB 316 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: CUDA_Host compute buffer size = 24.01 MiB 317 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: graph nodes = 1030 318 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: graph splits = 2 319 Sep 25 03:01:12 ai-platform ollama[1805]: INFO [main] model loaded | tid="140325808095232" timestamp=1727258472 320 Sep 25 03:01:12 ai-platform ollama[864]: time=2024-09-25T03:01:12.354-07:00 level=INFO source=server.go:626 msg="llama runner started in 5.93 seconds" 321 Sep 25 03:01:12 ai-platform ollama[864]: [GIN] 2024/09/25 - 03:01:12 | 200 | 6.266740535s | 127.0.0.1 | POST "/api/generate" 322 Sep 25 03:01:26 ai-platform ollama[864]: [GIN] 2024/09/25 - 03:01:26 | 200 | 1.650481207s | 127.0.0.1 | POST "/api/chat" 323 Sep 25 03:02:29 ai-platform ollama[864]: [GIN] 2024/09/25 - 03:02:29 | 200 | 27.755757066s | 127.0.0.1 | POST "/api/chat" 324 Sep 25 03:05:31 ai-platform ollama[864]: [GIN] 2024/09/25 - 03:05:31 | 200 | 34.507382868s | 127.0.0.1 | POST "/api/chat" 325 Sep 25 06:26:33 ai-platform systemd[1]: Stopping Ollama Service... 326 Sep 25 06:26:34 ai-platform systemd[1]: ollama.service: Deactivated successfully. 327 Sep 25 06:26:34 ai-platform systemd[1]: Stopped Ollama Service. 328 Sep 25 06:26:34 ai-platform systemd[1]: ollama.service: Consumed 6min 21.822s CPU time. 329 -- Boot 07e33ef45ce6476f8795bb10410b0122 -- 330 Sep 25 12:42:32 ai-platform systemd[1]: Started Ollama Service. 331 Sep 25 12:42:32 ai-platform ollama[861]: 2024/09/25 12:42:32 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" 332 Sep 25 12:42:32 ai-platform ollama[861]: time=2024-09-25T12:42:32.988-07:00 level=INFO source=images.go:753 msg="total blobs: 31" 333 Sep 25 12:42:32 ai-platform ollama[861]: time=2024-09-25T12:42:32.997-07:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0" 334 Sep 25 12:42:32 ai-platform ollama[861]: time=2024-09-25T12:42:32.999-07:00 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)" 335 Sep 25 12:42:33 ai-platform ollama[861]: time=2024-09-25T12:42:33.003-07:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama1762789208/runners 336 Sep 25 12:42:48 ai-platform ollama[861]: time=2024-09-25T12:42:48.061-07:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm_v60102]" 337 Sep 25 12:42:48 ai-platform ollama[861]: time=2024-09-25T12:42:48.069-07:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs" 338 Sep 25 12:42:50 ai-platform ollama[861]: time=2024-09-25T12:42:50.306-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-8b16ac03-19ea-264b-44f5-0ba4e7a3cdea library=cuda variant=v12 compute=6.1 driver=12.4 name="Tesla P40" total="23.9 GiB" available="23.7 GiB" 339 Sep 25 12:42:50 ai-platform ollama[861]: time=2024-09-25T12:42:50.306-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-9402de2e-20d7-ff58-45c7-f25025132ba7 library=cuda variant=v11 compute=5.2 driver=12.4 name="Quadro M6000 24GB" total="23.9 GiB" available="23.8 GiB" 340 Sep 25 13:07:38 ai-platform systemd[1]: Stopping Ollama Service... 341 Sep 25 13:07:40 ai-platform systemd[1]: ollama.service: Deactivated successfully. 342 Sep 25 13:07:40 ai-platform systemd[1]: Stopped Ollama Service. 343 Sep 25 13:07:40 ai-platform systemd[1]: ollama.service: Consumed 29.339s CPU time. 344 -- Boot 24c5cad9e4db4be8951d9cf2bc3114c5 -- 345 Sep 25 13:08:55 ai-platform systemd[1]: Started Ollama Service. 346 Sep 25 13:08:59 ai-platform ollama[863]: 2024/09/25 13:08:59 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" 347 Sep 25 13:08:59 ai-platform ollama[863]: time=2024-09-25T13:08:59.497-07:00 level=INFO source=images.go:753 msg="total blobs: 31" 348 Sep 25 13:08:59 ai-platform ollama[863]: time=2024-09-25T13:08:59.546-07:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0" 349 Sep 25 13:08:59 ai-platform ollama[863]: time=2024-09-25T13:08:59.547-07:00 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)" 350 Sep 25 13:08:59 ai-platform ollama[863]: time=2024-09-25T13:08:59.547-07:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama2555802688/runners 351 Sep 25 13:09:20 ai-platform ollama[863]: time=2024-09-25T13:09:20.794-07:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm_v60102 cpu]" 352 Sep 25 13:09:20 ai-platform ollama[863]: time=2024-09-25T13:09:20.795-07:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs" 353 Sep 25 13:09:22 ai-platform ollama[863]: time=2024-09-25T13:09:22.492-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-8b16ac03-19ea-264b-44f5-0ba4e7a3cdea library=cuda variant=v12 compute=6.1 driver=12.4 name="Tesla P40" total="23.9 GiB" available="23.7 GiB" 354 Sep 25 13:09:22 ai-platform ollama[863]: time=2024-09-25T13:09:22.492-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-9402de2e-20d7-ff58-45c7-f25025132ba7 library=cuda variant=v11 compute=5.2 driver=12.4 name="Quadro M6000 24GB" total="23.9 GiB" available="23.8 GiB" 355 Sep 25 13:15:35 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:15:35 | 200 | 28.717µs | 127.0.0.1 | HEAD "/" 356 Sep 25 13:15:35 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:15:35 | 200 | 11.959577ms | 127.0.0.1 | GET "/api/tags" 357 Sep 25 13:15:57 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:15:57 | 200 | 24.363µs | 127.0.0.1 | HEAD "/" 358 Sep 25 13:15:57 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:15:57 | 200 | 160.361228ms | 127.0.0.1 | POST "/api/show" 359 Sep 25 13:15:58 ai-platform ollama[863]: time=2024-09-25T13:15:58.207-07:00 level=INFO source=server.go:103 msg="system memory" total="62.8 GiB" free="61.6 GiB" free_swap="8.0 GiB" 360 Sep 25 13:15:58 ai-platform ollama[863]: time=2024-09-25T13:15:58.208-07:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=47 layers.split="" memory.available="[23.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="39.3 GiB" memory.required.partial="23.4 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[23.4 GiB]" memory.weights.total="36.5 GiB" memory.weights.repeating="35.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="1.1 GiB" 361 Sep 25 13:15:58 ai-platform ollama[863]: time=2024-09-25T13:15:58.210-07:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2555802688/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-6439716a5b6269ac02585fa4b90ab622c28d9fa8d93772cc713414642ffa6efd --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 47 --parallel 1 --port 33157" 362 Sep 25 13:15:58 ai-platform ollama[863]: time=2024-09-25T13:15:58.211-07:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 363 Sep 25 13:15:58 ai-platform ollama[863]: time=2024-09-25T13:15:58.211-07:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding" 364 Sep 25 13:15:58 ai-platform ollama[863]: time=2024-09-25T13:15:58.211-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" 365 Sep 25 13:15:59 ai-platform ollama[1836]: INFO [main] build info | build=10 commit="9225b05" tid="139726698635264" timestamp=1727295359 366 Sep 25 13:15:59 ai-platform ollama[1836]: INFO [main] system info | n_threads=14 n_threads_batch=14 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="139726698635264" timestamp=1727295359 total_threads=14 367 Sep 25 13:15:59 ai-platform ollama[1836]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="13" port="33157" tid="139726698635264" timestamp=1727295359 368 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-6439716a5b6269ac02585fa4b90ab622c28d9fa8d93772cc713414642ffa6efd (version GGUF V3 (latest)) 369 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 370 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 0: general.architecture str = llama 371 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 1: general.type str = model 372 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 70B Instruct 373 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 3: general.finetune str = Instruct 374 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 375 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 5: general.size_label str = 70B 376 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 6: general.license str = llama3.1 377 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... 378 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... 379 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 9: llama.block_count u32 = 80 380 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 10: llama.context_length u32 = 131072 381 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 11: llama.embedding_length u32 = 8192 382 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 12: llama.feed_forward_length u32 = 28672 383 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 13: llama.attention.head_count u32 = 64 384 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 385 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 386 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 387 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 17: general.file_type u32 = 2 388 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 389 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 390 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 391 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe 392 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... 393 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 394 Sep 25 13:15:59 ai-platform ollama[863]: time=2024-09-25T13:15:59.717-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" 395 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... 396 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 397 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 398 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... 399 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 28: general.quantization_version u32 = 2 400 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - type f32: 162 tensors 401 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - type q4_0: 561 tensors 402 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - type q6_K: 1 tensors 403 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_vocab: special tokens cache size = 256 404 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_vocab: token to piece cache size = 0.7999 MB 405 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: format = GGUF V3 (latest) 406 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: arch = llama 407 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: vocab type = BPE 408 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_vocab = 128256 409 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_merges = 280147 410 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: vocab_only = 0 411 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_ctx_train = 131072 412 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_embd = 8192 413 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_layer = 80 414 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_head = 64 415 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_head_kv = 8 416 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_rot = 128 417 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_swa = 0 418 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_embd_head_k = 128 419 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_embd_head_v = 128 420 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_gqa = 8 421 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_embd_k_gqa = 1024 422 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_embd_v_gqa = 1024 423 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: f_norm_eps = 0.0e+00 424 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 425 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 426 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 427 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: f_logit_scale = 0.0e+00 428 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_ff = 28672 429 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_expert = 0 430 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_expert_used = 0 431 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: causal attn = 1 432 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: pooling type = 0 433 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: rope type = 0 434 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: rope scaling = linear 435 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: freq_base_train = 500000.0 436 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: freq_scale_train = 1 437 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_ctx_orig_yarn = 131072 438 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: rope_finetuned = unknown 439 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: ssm_d_conv = 0 440 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: ssm_d_inner = 0 441 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: ssm_d_state = 0 442 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: ssm_dt_rank = 0 443 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: ssm_dt_b_c_rms = 0 444 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: model type = 70B 445 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: model ftype = Q4_0 446 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: model params = 70.55 B 447 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: model size = 37.22 GiB (4.53 BPW) 448 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: general.name = Meta Llama 3.1 70B Instruct 449 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' 450 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>' 451 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: LF token = 128 'Ä' 452 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>' 453 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: max token length = 256 454 Sep 25 13:16:00 ai-platform ollama[863]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 455 Sep 25 13:16:00 ai-platform ollama[863]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 456 Sep 25 13:16:00 ai-platform ollama[863]: ggml_cuda_init: found 1 CUDA devices: 457 Sep 25 13:16:00 ai-platform ollama[863]: Device 0: Tesla P40, compute capability 6.1, VMM: yes 458 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_tensors: ggml ctx size = 0.68 MiB 459 Sep 25 13:20:04 ai-platform ollama[863]: llm_load_tensors: offloading 47 repeating layers to GPU 460 Sep 25 13:20:04 ai-platform ollama[863]: llm_load_tensors: offloaded 47/81 layers to GPU 461 Sep 25 13:20:04 ai-platform ollama[863]: llm_load_tensors: CPU buffer size = 38110.61 MiB 462 Sep 25 13:20:04 ai-platform ollama[863]: llm_load_tensors: CUDA0 buffer size = 21575.95 MiB 463 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: n_ctx = 2048 464 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: n_batch = 512 465 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: n_ubatch = 512 466 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: flash_attn = 0 467 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: freq_base = 500000.0 468 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: freq_scale = 1 469 Sep 25 13:20:06 ai-platform ollama[863]: llama_kv_cache_init: CUDA_Host KV buffer size = 264.00 MiB 470 Sep 25 13:20:06 ai-platform ollama[863]: llama_kv_cache_init: CUDA0 KV buffer size = 376.00 MiB 471 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB 472 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: CUDA_Host output buffer size = 0.52 MiB 473 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB 474 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB 475 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: graph nodes = 2566 476 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: graph splits = 433 477 Sep 25 13:20:07 ai-platform ollama[1836]: INFO [main] model loaded | tid="139726698635264" timestamp=1727295607 478 Sep 25 13:20:07 ai-platform ollama[863]: time=2024-09-25T13:20:07.942-07:00 level=INFO source=server.go:626 msg="llama runner started in 249.73 seconds" 479 Sep 25 13:20:07 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:20:07 | 200 | 4m10s | 127.0.0.1 | POST "/api/generate" 480 Sep 25 13:21:49 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:21:49 | 200 | 8.160347761s | 127.0.0.1 | POST "/api/chat" 481 Sep 25 13:22:11 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:22:11 | 200 | 14.763904056s | 127.0.0.1 | POST "/api/chat" 482 Sep 25 13:27:04 ai-platform systemd[1]: Stopping Ollama Service... 483 Sep 25 13:27:07 ai-platform systemd[1]: ollama.service: Deactivated successfully. 484 Sep 25 13:27:07 ai-platform systemd[1]: Stopped Ollama Service. 485 Sep 25 13:27:07 ai-platform systemd[1]: ollama.service: Consumed 5min 33.070s CPU time. 486 -- Boot 16c6f123db1c41d89aa8afa1dcd6c4fc -- 487 Sep 25 13:28:14 ai-platform systemd[1]: Started Ollama Service. 488 Sep 25 13:28:16 ai-platform ollama[863]: 2024/09/25 13:28:16 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" 489 Sep 25 13:28:16 ai-platform ollama[863]: time=2024-09-25T13:28:16.609-07:00 level=INFO source=images.go:753 msg="total blobs: 31" 490 Sep 25 13:28:16 ai-platform ollama[863]: time=2024-09-25T13:28:16.666-07:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0" 491 Sep 25 13:28:16 ai-platform ollama[863]: time=2024-09-25T13:28:16.668-07:00 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)" 492 Sep 25 13:28:16 ai-platform ollama[863]: time=2024-09-25T13:28:16.668-07:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama2431661027/runners 493 Sep 25 13:28:46 ai-platform ollama[863]: time=2024-09-25T13:28:46.470-07:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm_v60102]" 494 Sep 25 13:28:46 ai-platform ollama[863]: time=2024-09-25T13:28:46.470-07:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs" 495 Sep 25 13:28:48 ai-platform ollama[863]: time=2024-09-25T13:28:48.635-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-8b16ac03-19ea-264b-44f5-0ba4e7a3cdea library=cuda variant=v12 compute=6.1 driver=12.4 name="Tesla P40" total="23.9 GiB" available="23.7 GiB" 496 Sep 25 13:28:48 ai-platform ollama[863]: time=2024-09-25T13:28:48.635-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-9402de2e-20d7-ff58-45c7-f25025132ba7 library=cuda variant=v11 compute=5.2 driver=12.4 name="Quadro M6000 24GB" total="23.9 GiB" available="23.8 GiB" 497 Sep 25 13:45:49 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:45:49 | 200 | 1.00897ms | 127.0.0.1 | HEAD "/" 498 Sep 25 13:45:49 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:45:49 | 200 | 14.162083ms | 127.0.0.1 | GET "/api/tags" 499 Sep 25 13:46:01 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:46:01 | 200 | 40.385µs | 127.0.0.1 | HEAD "/" 500 Sep 25 13:46:01 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:46:01 | 200 | 549.982929ms | 127.0.0.1 | POST "/api/show" 501 Sep 25 13:46:02 ai-platform ollama[863]: time=2024-09-25T13:46:02.049-07:00 level=INFO source=server.go:103 msg="system memory" total="62.8 GiB" free="61.6 GiB" free_swap="8.0 GiB" 502 Sep 25 13:46:02 ai-platform ollama[863]: time=2024-09-25T13:46:02.051-07:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=47 layers.split="" memory.available="[23.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="39.3 GiB" memory.required.partial="23.4 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[23.4 GiB]" memory.weights.total="36.5 GiB" memory.weights.repeating="35.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="1.1 GiB" 503 Sep 25 13:46:02 ai-platform ollama[863]: time=2024-09-25T13:46:02.052-07:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2431661027/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-6439716a5b6269ac02585fa4b90ab622c28d9fa8d93772cc713414642ffa6efd --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 47 --parallel 1 --port 40727" 504 Sep 25 13:46:02 ai-platform ollama[863]: time=2024-09-25T13:46:02.053-07:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 505 Sep 25 13:46:02 ai-platform ollama[863]: time=2024-09-25T13:46:02.053-07:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding" 506 Sep 25 13:46:02 ai-platform ollama[863]: time=2024-09-25T13:46:02.053-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" 507 Sep 25 13:46:03 ai-platform ollama[1842]: INFO [main] build info | build=10 commit="9225b05" tid="140313977905152" timestamp=1727297163 508 Sep 25 13:46:03 ai-platform ollama[1842]: INFO [main] system info | n_threads=14 n_threads_batch=14 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140313977905152" timestamp=1727297163 total_threads=14 509 Sep 25 13:46:03 ai-platform ollama[1842]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="13" port="40727" tid="140313977905152" timestamp=1727297163 510 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-6439716a5b6269ac02585fa4b90ab622c28d9fa8d93772cc713414642ffa6efd (version GGUF V3 (latest)) 511 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 512 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 0: general.architecture str = llama 513 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 1: general.type str = model 514 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 70B Instruct 515 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 3: general.finetune str = Instruct 516 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 517 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 5: general.size_label str = 70B 518 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 6: general.license str = llama3.1 519 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... 520 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... 521 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 9: llama.block_count u32 = 80 522 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 10: llama.context_length u32 = 131072 523 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 11: llama.embedding_length u32 = 8192 524 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 12: llama.feed_forward_length u32 = 28672 525 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 13: llama.attention.head_count u32 = 64 526 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 527 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 528 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 529 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 17: general.file_type u32 = 2 530 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 531 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 532 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 533 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe 534 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... 535 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 536 Sep 25 13:46:03 ai-platform ollama[863]: time=2024-09-25T13:46:03.308-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" 537 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... 538 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 539 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 540 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... 541 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 28: general.quantization_version u32 = 2 542 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - type f32: 162 tensors 543 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - type q4_0: 561 tensors 544 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - type q6_K: 1 tensors 545 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_vocab: special tokens cache size = 256 546 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_vocab: token to piece cache size = 0.7999 MB 547 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: format = GGUF V3 (latest) 548 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: arch = llama 549 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: vocab type = BPE 550 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_vocab = 128256 551 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_merges = 280147 552 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: vocab_only = 0 553 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_ctx_train = 131072 554 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_embd = 8192 555 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_layer = 80 556 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_head = 64 557 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_head_kv = 8 558 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_rot = 128 559 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_swa = 0 560 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_embd_head_k = 128 561 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_embd_head_v = 128 562 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_gqa = 8 563 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_embd_k_gqa = 1024 564 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_embd_v_gqa = 1024 565 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: f_norm_eps = 0.0e+00 566 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 567 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 568 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 569 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: f_logit_scale = 0.0e+00 570 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_ff = 28672 571 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_expert = 0 572 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_expert_used = 0 573 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: causal attn = 1 574 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: pooling type = 0 575 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: rope type = 0 576 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: rope scaling = linear 577 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: freq_base_train = 500000.0 578 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: freq_scale_train = 1 579 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_ctx_orig_yarn = 131072 580 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: rope_finetuned = unknown 581 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: ssm_d_conv = 0 582 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: ssm_d_inner = 0 583 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: ssm_d_state = 0 584 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: ssm_dt_rank = 0 585 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: ssm_dt_b_c_rms = 0 586 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: model type = 70B 587 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: model ftype = Q4_0 588 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: model params = 70.55 B 589 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: model size = 37.22 GiB (4.53 BPW) 590 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: general.name = Meta Llama 3.1 70B Instruct 591 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' 592 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>' 593 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: LF token = 128 'Ä' 594 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>' 595 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: max token length = 256 596 Sep 25 13:46:03 ai-platform ollama[863]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 597 Sep 25 13:46:03 ai-platform ollama[863]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 598 Sep 25 13:46:03 ai-platform ollama[863]: ggml_cuda_init: found 1 CUDA devices: 599 Sep 25 13:46:03 ai-platform ollama[863]: Device 0: Tesla P40, compute capability 6.1, VMM: yes 600 Sep 25 13:46:04 ai-platform ollama[863]: llm_load_tensors: ggml ctx size = 0.68 MiB 601 Sep 25 13:48:53 ai-platform ollama[863]: llm_load_tensors: offloading 47 repeating layers to GPU 602 Sep 25 13:48:53 ai-platform ollama[863]: llm_load_tensors: offloaded 47/81 layers to GPU 603 Sep 25 13:48:53 ai-platform ollama[863]: llm_load_tensors: CPU buffer size = 38110.61 MiB 604 Sep 25 13:48:53 ai-platform ollama[863]: llm_load_tensors: CUDA0 buffer size = 21575.95 MiB 605 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: n_ctx = 2048 606 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: n_batch = 512 607 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: n_ubatch = 512 608 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: flash_attn = 0 609 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: freq_base = 500000.0 610 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: freq_scale = 1 611 Sep 25 13:48:56 ai-platform ollama[863]: llama_kv_cache_init: CUDA_Host KV buffer size = 264.00 MiB 612 Sep 25 13:48:56 ai-platform ollama[863]: llama_kv_cache_init: CUDA0 KV buffer size = 376.00 MiB 613 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB 614 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: CUDA_Host output buffer size = 0.52 MiB 615 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB 616 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB 617 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: graph nodes = 2566 618 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: graph splits = 433 619 Sep 25 13:48:57 ai-platform ollama[1842]: INFO [main] model loaded | tid="140313977905152" timestamp=1727297337 620 Sep 25 13:48:57 ai-platform ollama[863]: time=2024-09-25T13:48:57.670-07:00 level=INFO source=server.go:626 msg="llama runner started in 175.62 seconds" 621 Sep 25 13:48:57 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:48:57 | 200 | 2m56s | 127.0.0.1 | POST "/api/generate" 622 Sep 25 13:57:36 ai-platform ollama[863]: time=2024-09-25T13:57:36.808-07:00 level=INFO source=server.go:103 msg="system memory" total="62.8 GiB" free="61.6 GiB" free_swap="8.0 GiB" 623 Sep 25 13:57:36 ai-platform ollama[863]: time=2024-09-25T13:57:36.809-07:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=47 layers.split="" memory.available="[23.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="39.3 GiB" memory.required.partial="23.4 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[23.4 GiB]" memory.weights.total="36.5 GiB" memory.weights.repeating="35.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="1.1 GiB" 624 Sep 25 13:57:36 ai-platform ollama[863]: time=2024-09-25T13:57:36.811-07:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2431661027/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-6439716a5b6269ac02585fa4b90ab622c28d9fa8d93772cc713414642ffa6efd --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 47 --parallel 1 --port 44643" 625 Sep 25 13:57:36 ai-platform ollama[863]: time=2024-09-25T13:57:36.811-07:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 626 Sep 25 13:57:36 ai-platform ollama[863]: time=2024-09-25T13:57:36.811-07:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding" 627 Sep 25 13:57:36 ai-platform ollama[863]: time=2024-09-25T13:57:36.811-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" 628 Sep 25 13:57:36 ai-platform ollama[2354]: INFO [main] build info | build=10 commit="9225b05" tid="139803936231424" timestamp=1727297856 629 Sep 25 13:57:36 ai-platform ollama[2354]: INFO [main] system info | n_threads=14 n_threads_batch=14 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="139803936231424" timestamp=1727297856 total_threads=14 630 Sep 25 13:57:36 ai-platform ollama[2354]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="13" port="44643" tid="139803936231424" timestamp=1727297856 631 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-6439716a5b6269ac02585fa4b90ab622c28d9fa8d93772cc713414642ffa6efd (version GGUF V3 (latest)) 632 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 633 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 0: general.architecture str = llama 634 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 1: general.type str = model 635 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 70B Instruct 636 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 3: general.finetune str = Instruct 637 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 638 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 5: general.size_label str = 70B 639 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 6: general.license str = llama3.1 640 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... 641 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... 642 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 9: llama.block_count u32 = 80 643 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 10: llama.context_length u32 = 131072 644 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 11: llama.embedding_length u32 = 8192 645 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 12: llama.feed_forward_length u32 = 28672 646 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 13: llama.attention.head_count u32 = 64 647 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 648 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 649 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 650 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 17: general.file_type u32 = 2 651 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 652 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 653 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 654 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe 655 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... 656 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 657 Sep 25 13:57:37 ai-platform ollama[863]: llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... 658 Sep 25 13:57:37 ai-platform ollama[863]: llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 659 Sep 25 13:57:37 ai-platform ollama[863]: llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 660 Sep 25 13:57:37 ai-platform ollama[863]: llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... 661 Sep 25 13:57:37 ai-platform ollama[863]: llama_model_loader: - kv 28: general.quantization_version u32 = 2 662 Sep 25 13:57:37 ai-platform ollama[863]: llama_model_loader: - type f32: 162 tensors 663 Sep 25 13:57:37 ai-platform ollama[863]: llama_model_loader: - type q4_0: 561 tensors 664 Sep 25 13:57:37 ai-platform ollama[863]: llama_model_loader: - type q6_K: 1 tensors 665 Sep 25 13:57:37 ai-platform ollama[863]: time=2024-09-25T13:57:37.063-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" 666 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_vocab: special tokens cache size = 256 667 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_vocab: token to piece cache size = 0.7999 MB 668 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: format = GGUF V3 (latest) 669 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: arch = llama 670 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: vocab type = BPE 671 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_vocab = 128256 672 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_merges = 280147 673 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: vocab_only = 0 674 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_ctx_train = 131072 675 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_embd = 8192 676 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_layer = 80 677 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_head = 64 678 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_head_kv = 8 679 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_rot = 128 680 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_swa = 0 681 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_embd_head_k = 128 682 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_embd_head_v = 128 683 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_gqa = 8 684 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_embd_k_gqa = 1024 685 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_embd_v_gqa = 1024 686 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: f_norm_eps = 0.0e+00 687 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 688 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 689 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 690 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: f_logit_scale = 0.0e+00 691 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_ff = 28672 692 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_expert = 0 693 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_expert_used = 0 694 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: causal attn = 1 695 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: pooling type = 0 696 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: rope type = 0 697 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: rope scaling = linear 698 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: freq_base_train = 500000.0 699 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: freq_scale_train = 1 700 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_ctx_orig_yarn = 131072 701 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: rope_finetuned = unknown 702 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: ssm_d_conv = 0 703 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: ssm_d_inner = 0 704 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: ssm_d_state = 0 705 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: ssm_dt_rank = 0 706 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: ssm_dt_b_c_rms = 0 707 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: model type = 70B 708 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: model ftype = Q4_0 709 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: model params = 70.55 B 710 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: model size = 37.22 GiB (4.53 BPW) 711 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: general.name = Meta Llama 3.1 70B Instruct 712 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' 713 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>' 714 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: LF token = 128 'Ä' 715 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>' 716 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: max token length = 256 717 Sep 25 13:57:37 ai-platform ollama[863]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 718 Sep 25 13:57:37 ai-platform ollama[863]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 719 Sep 25 13:57:37 ai-platform ollama[863]: ggml_cuda_init: found 1 CUDA devices: 720 Sep 25 13:57:37 ai-platform ollama[863]: Device 0: Tesla P40, compute capability 6.1, VMM: yes 721 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_tensors: ggml ctx size = 0.68 MiB 722 Sep 25 13:57:38 ai-platform ollama[863]: time=2024-09-25T13:57:38.519-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding" 723 Sep 25 13:57:39 ai-platform ollama[863]: llm_load_tensors: offloading 47 repeating layers to GPU 724 Sep 25 13:57:39 ai-platform ollama[863]: llm_load_tensors: offloaded 47/81 layers to GPU 725 Sep 25 13:57:39 ai-platform ollama[863]: llm_load_tensors: CPU buffer size = 38110.61 MiB 726 Sep 25 13:57:39 ai-platform ollama[863]: llm_load_tensors: CUDA0 buffer size = 21575.95 MiB 727 Sep 25 13:57:39 ai-platform ollama[863]: time=2024-09-25T13:57:39.221-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" 728 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: n_ctx = 2048 729 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: n_batch = 512 730 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: n_ubatch = 512 731 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: flash_attn = 0 732 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: freq_base = 500000.0 733 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: freq_scale = 1 734 Sep 25 13:57:41 ai-platform ollama[863]: llama_kv_cache_init: CUDA_Host KV buffer size = 264.00 MiB 735 Sep 25 13:57:41 ai-platform ollama[863]: llama_kv_cache_init: CUDA0 KV buffer size = 376.00 MiB 736 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB 737 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: CUDA_Host output buffer size = 0.52 MiB 738 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB 739 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB 740 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: graph nodes = 2566 741 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: graph splits = 433 742 Sep 25 13:57:42 ai-platform ollama[2354]: INFO [main] model loaded | tid="139803936231424" timestamp=1727297862 743 Sep 25 13:57:42 ai-platform ollama[863]: time=2024-09-25T13:57:42.760-07:00 level=INFO source=server.go:626 msg="llama runner started in 5.95 seconds" 744 Sep 25 13:57:50 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:57:50 | 200 | 14.16960451s | 127.0.0.1 | POST "/api/chat"
Author
Owner

@dhiltgen commented on GitHub (Sep 25, 2024):

To clarify, I believe you're trying to load a larger model that needs to span the 2 GPUS, and we're failing to do so. Is that correct? If so, then I think I understand the problem.

The M6000 is a Compute Capability 5.2 which requires CUDA v11. The P40 is a 6.1, which can leverage v12. We probably have a bug where we're not falling back to the lowest common denominator CUDA library. To work around this, try setting OLLAMA_LLM_LIBRARY=cuda_v11 to force it to use that runner, and I believe it should start working on the 2 GPUs.

<!-- gh-comment-id:2375454240 --> @dhiltgen commented on GitHub (Sep 25, 2024): To clarify, I believe you're trying to load a larger model that needs to span the 2 GPUS, and we're failing to do so. Is that correct? If so, then I think I understand the problem. The M6000 is a Compute Capability 5.2 which requires CUDA v11. The P40 is a 6.1, which can leverage v12. We probably have a bug where we're not falling back to the lowest common denominator CUDA library. To work around this, try setting `OLLAMA_LLM_LIBRARY=cuda_v11` to force it to use that runner, and I believe it should start working on the 2 GPUs.
Author
Owner

@Blake110 commented on GitHub (Sep 26, 2024):

@dhiltgen Thank you for your time and reply.
yeah, I'm trying to run llama3.1:70b locallly in Ubuntu 22.04
I follow your guide to add OLLAMA_LLM_LIBRARY=cuda_v11 into "ollama.service", but it doesn't work. the Ollama still running with p40 compute v6.1.
I'm not suer I'm right or not. below is my ollama.service configuration.
/etc/systemd/system/ollama.service


[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
OLLAMA_LLM_LIBRARY=cuda_v11
Restart=always
RestartSec=3
Environment="OLLAMA_HOST=0.0.0.0" "PATH=/home/user/anaconda3/condabin:/usr/local/sbin:/usr/local/bin:/>
CUDA_VISIBLE_DEVICES=0,1

[Install]
WantedBy=default.target

and I also tried add
Environment="OLLAMA_LLM_LIBRARY=cuda_v11"

doesn't work too.

<!-- gh-comment-id:2375622872 --> @Blake110 commented on GitHub (Sep 26, 2024): @dhiltgen Thank you for your time and reply. yeah, I'm trying to run llama3.1:70b locallly in Ubuntu 22.04 I follow your guide to add OLLAMA_LLM_LIBRARY=cuda_v11 into "ollama.service", but it doesn't work. the Ollama still running with p40 compute v6.1. I'm not suer I'm right or not. below is my ollama.service configuration. /etc/systemd/system/ollama.service ----------------------------------------------------------------------------------------------------------- [Unit] Description=Ollama Service After=network-online.target [Service] ExecStart=/usr/local/bin/ollama serve User=ollama Group=ollama OLLAMA_LLM_LIBRARY=cuda_v11 Restart=always RestartSec=3 Environment="OLLAMA_HOST=0.0.0.0" "PATH=/home/user/anaconda3/condabin:/usr/local/sbin:/usr/local/bin:/> CUDA_VISIBLE_DEVICES=0,1 [Install] WantedBy=default.target -------------------------------------------------------------------------------------------------------------------- and I also tried add Environment="OLLAMA_LLM_LIBRARY=cuda_v11" doesn't work too.
Author
Owner

@taco-q commented on GitHub (Sep 26, 2024):

I have the same problem.
My GPUs are 3090x2 and M6000x1 running Ollama 0.3.12, but M6000 is not used.
Attached is an excerpt from the server logs.

Sep 26 11:09:00 h11ssl-i systemd[1]: Started Ollama Service.
Sep 26 11:09:01 h11ssl-i ollama[25305]: 2024/09/26 11:09:01 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY:cuda_v11 OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Sep 26 11:09:01 h11ssl-i ollama[25305]: time=2024-09-26T11:09:01.004Z level=INFO source=images.go:753 msg="total blobs: 23"
Sep 26 11:09:01 h11ssl-i ollama[25305]: time=2024-09-26T11:09:01.005Z level=INFO source=images.go:760 msg="total unused blobs removed: 0"
Sep 26 11:09:01 h11ssl-i ollama[25305]: time=2024-09-26T11:09:01.006Z level=INFO source=routes.go:1200 msg="Listening on 127.0.0.1:11434 (version 0.3.12)"
Sep 26 11:09:01 h11ssl-i ollama[25305]: time=2024-09-26T11:09:01.006Z level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama2432595780/runners
Sep 26 11:09:17 h11ssl-i ollama[25305]: time=2024-09-26T11:09:17.463Z level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cuda_v11 cuda_v12 rocm_v60102 cpu cpu_avx cpu_avx2]"
Sep 26 11:09:17 h11ssl-i ollama[25305]: time=2024-09-26T11:09:17.463Z level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
Sep 26 11:09:18 h11ssl-i ollama[25305]: time=2024-09-26T11:09:18.240Z level=INFO source=types.go:107 msg="inference compute" id=GPU-484f7983-dd05-2e00-78c1-bc181e698055 library=cuda variant=v12 compute=8.6 driver=12.6 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB"
Sep 26 11:09:18 h11ssl-i ollama[25305]: time=2024-09-26T11:09:18.240Z level=INFO source=types.go:107 msg="inference compute" id=GPU-c5f8f968-5653-9187-5bdb-c8931d908436 library=cuda variant=v12 compute=8.6 driver=12.6 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB"
Sep 26 11:09:18 h11ssl-i ollama[25305]: time=2024-09-26T11:09:18.240Z level=INFO source=types.go:107 msg="inference compute" id=GPU-1033f84d-2926-787c-21cd-765120fc5009 library=cuda variant=v11 compute=5.2 driver=12.6 name="Quadro M6000 24GB" total="23.9 GiB" available="23.8 GiB"

Sep 26 11:10:02 h11ssl-i ollama[25305]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
Sep 26 11:10:02 h11ssl-i ollama[25305]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 26 11:10:02 h11ssl-i ollama[25305]: ggml_cuda_init: found 2 CUDA devices:
Sep 26 11:10:02 h11ssl-i ollama[25305]: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Sep 26 11:10:02 h11ssl-i ollama[25305]: Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Sep 26 11:10:02 h11ssl-i ollama[25305]: llm_load_tensors: ggml ctx size = 1.27 MiB
Sep 26 11:10:03 h11ssl-i ollama[25305]: time=2024-09-26T11:10:03.288Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding"
Sep 26 11:10:08 h11ssl-i ollama[25305]: time=2024-09-26T11:10:08.956Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
Sep 26 11:10:09 h11ssl-i ollama[25305]: llm_load_tensors: offloading 67 repeating layers to GPU
Sep 26 11:10:09 h11ssl-i ollama[25305]: llm_load_tensors: offloaded 67/81 layers to GPU
Sep 26 11:10:09 h11ssl-i ollama[25305]: llm_load_tensors: CPU buffer size = 51919.44 MiB
Sep 26 11:10:09 h11ssl-i ollama[25305]: llm_load_tensors: CUDA0 buffer size = 20274.98 MiB
Sep 26 11:10:09 h11ssl-i ollama[25305]: llm_load_tensors: CUDA1 buffer size = 21377.71 MiB
Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_new_context_with_model: n_ctx = 2048
Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_new_context_with_model: n_batch = 512
Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_new_context_with_model: n_ubatch = 512
Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_new_context_with_model: flash_attn = 0
Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_new_context_with_model: freq_base = 1000000.0
Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_new_context_with_model: freq_scale = 1
Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_kv_cache_init: CUDA_Host KV buffer size = 104.00 MiB
Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_kv_cache_init: CUDA0 KV buffer size = 264.00 MiB
Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_kv_cache_init: CUDA1 KV buffer size = 272.00 MiB
Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB
Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_new_context_with_model: CUDA_Host output buffer size = 0.61 MiB
Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_new_context_with_model: CUDA0 compute buffer size = 1287.53 MiB
Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_new_context_with_model: CUDA1 compute buffer size = 324.00 MiB
Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB

<!-- gh-comment-id:2376705467 --> @taco-q commented on GitHub (Sep 26, 2024): I have the same problem. My GPUs are 3090x2 and M6000x1 running Ollama 0.3.12, but M6000 is not used. Attached is an excerpt from the server logs. Sep 26 11:09:00 h11ssl-i systemd[1]: Started Ollama Service. Sep 26 11:09:01 h11ssl-i ollama[25305]: 2024/09/26 11:09:01 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY:cuda_v11 OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" Sep 26 11:09:01 h11ssl-i ollama[25305]: time=2024-09-26T11:09:01.004Z level=INFO source=images.go:753 msg="total blobs: 23" Sep 26 11:09:01 h11ssl-i ollama[25305]: time=2024-09-26T11:09:01.005Z level=INFO source=images.go:760 msg="total unused blobs removed: 0" Sep 26 11:09:01 h11ssl-i ollama[25305]: time=2024-09-26T11:09:01.006Z level=INFO source=routes.go:1200 msg="Listening on 127.0.0.1:11434 (version 0.3.12)" Sep 26 11:09:01 h11ssl-i ollama[25305]: time=2024-09-26T11:09:01.006Z level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama2432595780/runners Sep 26 11:09:17 h11ssl-i ollama[25305]: time=2024-09-26T11:09:17.463Z level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cuda_v11 cuda_v12 rocm_v60102 cpu cpu_avx cpu_avx2]" Sep 26 11:09:17 h11ssl-i ollama[25305]: time=2024-09-26T11:09:17.463Z level=INFO source=gpu.go:199 msg="looking for compatible GPUs" Sep 26 11:09:18 h11ssl-i ollama[25305]: time=2024-09-26T11:09:18.240Z level=INFO source=types.go:107 msg="inference compute" id=GPU-484f7983-dd05-2e00-78c1-bc181e698055 library=cuda variant=v12 compute=8.6 driver=12.6 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB" Sep 26 11:09:18 h11ssl-i ollama[25305]: time=2024-09-26T11:09:18.240Z level=INFO source=types.go:107 msg="inference compute" id=GPU-c5f8f968-5653-9187-5bdb-c8931d908436 library=cuda variant=v12 compute=8.6 driver=12.6 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB" Sep 26 11:09:18 h11ssl-i ollama[25305]: time=2024-09-26T11:09:18.240Z level=INFO source=types.go:107 msg="inference compute" id=GPU-1033f84d-2926-787c-21cd-765120fc5009 library=cuda variant=v11 compute=5.2 driver=12.6 name="Quadro M6000 24GB" total="23.9 GiB" available="23.8 GiB" Sep 26 11:10:02 h11ssl-i ollama[25305]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Sep 26 11:10:02 h11ssl-i ollama[25305]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Sep 26 11:10:02 h11ssl-i ollama[25305]: ggml_cuda_init: found 2 CUDA devices: Sep 26 11:10:02 h11ssl-i ollama[25305]: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Sep 26 11:10:02 h11ssl-i ollama[25305]: Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Sep 26 11:10:02 h11ssl-i ollama[25305]: llm_load_tensors: ggml ctx size = 1.27 MiB Sep 26 11:10:03 h11ssl-i ollama[25305]: time=2024-09-26T11:10:03.288Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding" Sep 26 11:10:08 h11ssl-i ollama[25305]: time=2024-09-26T11:10:08.956Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" Sep 26 11:10:09 h11ssl-i ollama[25305]: llm_load_tensors: offloading 67 repeating layers to GPU Sep 26 11:10:09 h11ssl-i ollama[25305]: llm_load_tensors: offloaded 67/81 layers to GPU Sep 26 11:10:09 h11ssl-i ollama[25305]: llm_load_tensors: CPU buffer size = 51919.44 MiB Sep 26 11:10:09 h11ssl-i ollama[25305]: llm_load_tensors: CUDA0 buffer size = 20274.98 MiB Sep 26 11:10:09 h11ssl-i ollama[25305]: llm_load_tensors: CUDA1 buffer size = 21377.71 MiB Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_new_context_with_model: n_ctx = 2048 Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_new_context_with_model: n_batch = 512 Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_new_context_with_model: n_ubatch = 512 Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_new_context_with_model: flash_attn = 0 Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_new_context_with_model: freq_base = 1000000.0 Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_new_context_with_model: freq_scale = 1 Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_kv_cache_init: CUDA_Host KV buffer size = 104.00 MiB Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_kv_cache_init: CUDA0 KV buffer size = 264.00 MiB Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_kv_cache_init: CUDA1 KV buffer size = 272.00 MiB Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_new_context_with_model: CUDA_Host output buffer size = 0.61 MiB Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_new_context_with_model: CUDA0 compute buffer size = 1287.53 MiB Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_new_context_with_model: CUDA1 compute buffer size = 324.00 MiB Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB
Author
Owner

@Blake110 commented on GitHub (Sep 27, 2024):

@taco-q Have you solved this issus? I even can not find gpu.go file in my system........

<!-- gh-comment-id:2378253852 --> @Blake110 commented on GitHub (Sep 27, 2024): @taco-q Have you solved this issus? I even can not find gpu.go file in my system........
Author
Owner

@taco-q commented on GitHub (Sep 27, 2024):

Wait for progress on the following pull request

https://github.com/ollama/ollama/pull/6983

<!-- gh-comment-id:2378968368 --> @taco-q commented on GitHub (Sep 27, 2024): Wait for progress on the following pull request https://github.com/ollama/ollama/pull/6983
Author
Owner

@prusnak commented on GitHub (Feb 25, 2025):

Fixed with https://github.com/ollama/ollama/pull/8567

<!-- gh-comment-id:2682918285 --> @prusnak commented on GitHub (Feb 25, 2025): Fixed with https://github.com/ollama/ollama/pull/8567
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#50898