[GH-ISSUE #8188] Corrupt output on multiple GPU in Windows 11 #30990

Closed
opened 2026-04-22 11:03:29 -05:00 by GiteaMirror · 86 comments
Owner

Originally created by @robbyjo on GitHub (Dec 20, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/8188

What is the issue?

I have 2x4090 and 192GB RAM on my Windows 11 machine. I am currently using Ollama 0.5.4. I am using the following model with 32K context:

hf.co/mradermacher/Llama-3.3-70B-Instruct-abliterated-i1-GGUF:Q5_K_M

Theoretically, I could fit in most of the layers to my 2x24GB VRAM. The VRAM usage is currently about 1.5GB on my machine (via nvidia-smi) and that's only on one card. The other one is empty. I already set the CUDA_VISIBLE_DEVICES to 0,1

However, if I use both GPUs, then the output is garbage (like random words and punctuations and sometimes foreign words). If I limit myself to one GPU (say, by offloading only 28 layers to GPU), then the output is fine albeit a bit slow. I tried many things, such as, enabling or disabling flash attention, KV cache type, it doesn't seem to work. If I enable OLLAMA_SCHED_SPREAD, then the output will be garbled no matter what (regardless of how many layers I offload to GPU).

Example output: ":[-":[- Doug":[-keypress Ment":[-":[- spline":[-":[-":[- klu락огра":[-":[-":[-":[-":[-":[-":[-":[-":[-":[-isser":[-<Path popularity":[- menstratori#ab направ slate":[-(indices":[-uerdo.serialلقwnerkie":[-

I read about setting OLLAMA_GPU_OVERHEAD to avoid corruption like this, but the output is still garbled.

In retrospect, my issue is like PR #7575, except that I am using 2x4090 and in Windows 11. I would love any pointers. I honestly think this is a bug.

Thank you so much.

Here is server.log:

2024/12/20 14:17:01 routes.go:1259: INFO server config env="map[CUDA_VISIBLE_DEVICES:0,1 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:E:\DeepLearning\LLM OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"

time=2024-12-20T14:17:01.858-05:00 level=INFO source=images.go:757 msg="total blobs: 74"
time=2024-12-20T14:17:01.859-05:00 level=INFO source=images.go:764 msg="total unused blobs removed: 0"
time=2024-12-20T14:17:01.860-05:00 level=INFO source=routes.go:1310 msg="Listening on 127.0.0.1:11434 (version 0.5.4)"
time=2024-12-20T14:17:01.861-05:00 level=INFO source=routes.go:1339 msg="Dynamic LLM libraries" runners="[rocm_avx cpu cpu_avx cpu_avx2 cuda_v11_avx cuda_v12_avx]"
time=2024-12-20T14:17:01.861-05:00 level=INFO source=gpu.go:226 msg="looking for compatible GPUs"
time=2024-12-20T14:17:01.861-05:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2024-12-20T14:17:01.861-05:00 level=INFO source=gpu_windows.go:183 msg="efficiency cores detected" maxEfficiencyClass=1
time=2024-12-20T14:17:01.861-05:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=24 efficiency=16 threads=32
time=2024-12-20T14:17:02.092-05:00 level=INFO source=gpu.go:334 msg="detected OS VRAM overhead" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB"
time=2024-12-20T14:17:02.094-05:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB"
time=2024-12-20T14:17:02.094-05:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB"
[GIN] 2024/12/20 - 14:17:02 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2024/12/20 - 14:17:02 | 200 | 0s | 127.0.0.1 | GET "/api/ps"
[GIN] 2024/12/20 - 14:17:36 | 200 | 2.0599ms | 127.0.0.1 | GET "/api/tags"
time=2024-12-20T14:17:36.410-05:00 level=WARN source=types.go:509 msg="invalid option provided" option=stream_response
time=2024-12-20T14:17:36.646-05:00 level=INFO source=sched.go:730 msg="new model will fit in available VRAM, loading" model=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 library=cuda parallel=1 required="28.3 GiB"
time=2024-12-20T14:17:36.676-05:00 level=INFO source=server.go:104 msg="system memory" total="191.7 GiB" free="159.5 GiB" free_swap="242.5 GiB"
time=2024-12-20T14:17:36.708-05:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=28 layers.model=81 layers.offload=28 layers.split=14,14 memory.available="[22.5 GiB 22.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="61.6 GiB" memory.required.partial="28.3 GiB" memory.required.kv="5.0 GiB" memory.required.allocations="[14.1 GiB 14.1 GiB]" memory.weights.total="50.0 GiB" memory.weights.repeating="49.2 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="4.3 GiB" memory.graph.partial="4.3 GiB"
time=2024-12-20T14:17:36.708-05:00 level=INFO source=server.go:223 msg="enabling flash attention"
time=2024-12-20T14:17:36.712-05:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe runner --model E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 --ctx-size 32768 --batch-size 512 --n-gpu-layers 28 --threads 8 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --tensor-split 14,14 --port 63440"
time=2024-12-20T14:17:36.715-05:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2024-12-20T14:17:36.716-05:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2024-12-20T14:17:36.716-05:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2024-12-20T14:17:36.789-05:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
time=2024-12-20T14:17:36.872-05:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=8
time=2024-12-20T14:17:36.873-05:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:63440"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
llama_load_model_from_file: using device CUDA1 (NVIDIA GeForce RTX 4090) - 22994 MiB free
time=2024-12-20T14:17:36.966-05:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 47 key-value pairs and 724 tensors from E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.3 70B Instruct Abliterated
llama_model_loader: - kv 3: general.finetune str = Instruct-abliterated
llama_model_loader: - kv 4: general.basename str = Llama-3.3
llama_model_loader: - kv 5: general.size_label str = 70B
llama_model_loader: - kv 6: general.license str = llama3.3
llama_model_loader: - kv 7: general.base_model.count u32 = 1
llama_model_loader: - kv 8: general.base_model.0.name str = Llama 3.3 70B Instruct
llama_model_loader: - kv 9: general.base_model.0.organization str = Meta Llama
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv 11: general.tags arr[str,7] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 12: general.languages arr[str,8] = ["en", "fr", "it", "pt", "hi", "es", ...
llama_model_loader: - kv 13: llama.block_count u32 = 80
llama_model_loader: - kv 14: llama.context_length u32 = 131072
llama_model_loader: - kv 15: llama.embedding_length u32 = 8192
llama_model_loader: - kv 16: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 17: llama.attention.head_count u32 = 64
llama_model_loader: - kv 18: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 19: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 20: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 21: llama.attention.key_length u32 = 128
llama_model_loader: - kv 22: llama.attention.value_length u32 = 128
llama_model_loader: - kv 23: general.file_type u32 = 17
llama_model_loader: - kv 24: llama.vocab_size u32 = 128256
llama_model_loader: - kv 25: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 27: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 32: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 34: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 35: general.quantization_version u32 = 2
llama_model_loader: - kv 36: general.url str = https://huggingface.co/mradermacher/L...
llama_model_loader: - kv 37: mradermacher.quantize_version str = 2
llama_model_loader: - kv 38: mradermacher.quantized_by str = mradermacher
llama_model_loader: - kv 39: mradermacher.quantized_at str = 2024-12-13T11:20:37+01:00
llama_model_loader: - kv 40: mradermacher.quantized_on str = rich1
llama_model_loader: - kv 41: general.source.url str = https://huggingface.co/huihui-ai/Llam...
llama_model_loader: - kv 42: mradermacher.convert_type str = hf
llama_model_loader: - kv 43: quantize.imatrix.file str = Llama-3.3-70B-Instruct-abliterated-i1...
llama_model_loader: - kv 44: quantize.imatrix.dataset str = imatrix-training-full-3
llama_model_loader: - kv 45: quantize.imatrix.entries_count i32 = 560
llama_model_loader: - kv 46: quantize.imatrix.chunks_count i32 = 314
llama_model_loader: - type f32: 162 tensors
llama_model_loader: - type q5_K: 481 tensors
llama_model_loader: - type q6_K: 81 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = Q5_K - Medium
llm_load_print_meta: model params = 70.55 B
llm_load_print_meta: model size = 46.51 GiB (5.66 BPW)
llm_load_print_meta: general.name = Llama 3.3 70B Instruct Abliterated
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
[GIN] 2024/12/20 - 14:17:38 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2024/12/20 - 14:17:38 | 200 | 0s | 127.0.0.1 | GET "/api/ps"
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloaded 28/81 layers to GPU
llm_load_tensors: CPU model buffer size = 688.88 MiB
llm_load_tensors: CUDA_Host model buffer size = 30736.73 MiB
llm_load_tensors: CUDA0 model buffer size = 7978.13 MiB
llm_load_tensors: CUDA1 model buffer size = 8224.63 MiB
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 32768
llama_new_context_with_model: n_ctx_per_seq = 32768
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (32768) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 3536.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 952.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 952.00 MiB
llama_new_context_with_model: KV self size = 5440.00 MiB, K (q8_0): 2720.00 MiB, V (q8_0): 2720.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.52 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 176.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 80.01 MiB
llama_new_context_with_model: graph nodes = 2247
llama_new_context_with_model: graph splits = 578 (with bs=512), 4 (with bs=1)
time=2024-12-20T14:17:52.507-05:00 level=INFO source=server.go:594 msg="llama runner started in 15.79 seconds"
llama_model_loader: loaded meta data with 47 key-value pairs and 724 tensors from E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.3 70B Instruct Abliterated
llama_model_loader: - kv 3: general.finetune str = Instruct-abliterated
llama_model_loader: - kv 4: general.basename str = Llama-3.3
llama_model_loader: - kv 5: general.size_label str = 70B
llama_model_loader: - kv 6: general.license str = llama3.3
llama_model_loader: - kv 7: general.base_model.count u32 = 1
llama_model_loader: - kv 8: general.base_model.0.name str = Llama 3.3 70B Instruct
llama_model_loader: - kv 9: general.base_model.0.organization str = Meta Llama
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv 11: general.tags arr[str,7] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 12: general.languages arr[str,8] = ["en", "fr", "it", "pt", "hi", "es", ...
llama_model_loader: - kv 13: llama.block_count u32 = 80
llama_model_loader: - kv 14: llama.context_length u32 = 131072
llama_model_loader: - kv 15: llama.embedding_length u32 = 8192
llama_model_loader: - kv 16: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 17: llama.attention.head_count u32 = 64
llama_model_loader: - kv 18: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 19: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 20: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 21: llama.attention.key_length u32 = 128
llama_model_loader: - kv 22: llama.attention.value_length u32 = 128
llama_model_loader: - kv 23: general.file_type u32 = 17
llama_model_loader: - kv 24: llama.vocab_size u32 = 128256
llama_model_loader: - kv 25: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 27: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 32: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 34: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 35: general.quantization_version u32 = 2
llama_model_loader: - kv 36: general.url str = https://huggingface.co/mradermacher/L...
llama_model_loader: - kv 37: mradermacher.quantize_version str = 2
llama_model_loader: - kv 38: mradermacher.quantized_by str = mradermacher
llama_model_loader: - kv 39: mradermacher.quantized_at str = 2024-12-13T11:20:37+01:00
llama_model_loader: - kv 40: mradermacher.quantized_on str = rich1
llama_model_loader: - kv 41: general.source.url str = https://huggingface.co/huihui-ai/Llam...
llama_model_loader: - kv 42: mradermacher.convert_type str = hf
llama_model_loader: - kv 43: quantize.imatrix.file str = Llama-3.3-70B-Instruct-abliterated-i1...
llama_model_loader: - kv 44: quantize.imatrix.dataset str = imatrix-training-full-3
llama_model_loader: - kv 45: quantize.imatrix.entries_count i32 = 560
llama_model_loader: - kv 46: quantize.imatrix.chunks_count i32 = 314
llama_model_loader: - type f32: 162 tensors
llama_model_loader: - type q5_K: 481 tensors
llama_model_loader: - type q6_K: 81 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 1
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model params = 70.55 B
llm_load_print_meta: model size = 46.51 GiB (5.66 BPW)
llm_load_print_meta: general.name = Llama 3.3 70B Instruct Abliterated
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab only - skipping tensors
[GIN] 2024/12/20 - 14:18:16 | 200 | 39.8873643s | 127.0.0.1 | POST "/api/chat"
[GIN] 2024/12/20 - 14:18:16 | 200 | 6.2904ms | 127.0.0.1 | GET "/api/tags"

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.5.4

Originally created by @robbyjo on GitHub (Dec 20, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/8188 ### What is the issue? I have 2x4090 and 192GB RAM on my Windows 11 machine. I am currently using Ollama 0.5.4. I am using the following model with 32K context: hf.co/mradermacher/Llama-3.3-70B-Instruct-abliterated-i1-GGUF:Q5_K_M Theoretically, I could fit in most of the layers to my 2x24GB VRAM. The VRAM usage is currently about 1.5GB on my machine (via nvidia-smi) and that's only on one card. The other one is empty. I already set the CUDA_VISIBLE_DEVICES to 0,1 However, if I use both GPUs, then the output is garbage (like random words and punctuations and sometimes foreign words). If I limit myself to one GPU (say, by offloading only 28 layers to GPU), then the output is fine albeit a bit slow. I tried many things, such as, enabling or disabling flash attention, KV cache type, it doesn't seem to work. If I enable OLLAMA_SCHED_SPREAD, then the output will be garbled no matter what (regardless of how many layers I offload to GPU). Example output: ":[-":[- Doug":[-keypress Ment":[-":[- spline":[-":[-":[- klu락огра":[-":[-":[-":[-":[-":[-":[-":[-":[-":[-isser":[-<Path popularity":[- menstratori#ab направ slate":[-(indices":[-uerdo.serialلقwnerkie":[- I read about setting OLLAMA_GPU_OVERHEAD to avoid corruption like this, but the output is still garbled. In retrospect, my issue is like PR #7575, except that I am using 2x4090 and in Windows 11. I would love any pointers. I honestly think this is a bug. Thank you so much. Here is server.log: > 2024/12/20 14:17:01 routes.go:1259: INFO server config env="map[CUDA_VISIBLE_DEVICES:0,1 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:E:\\DeepLearning\\LLM OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2024-12-20T14:17:01.858-05:00 level=INFO source=images.go:757 msg="total blobs: 74" time=2024-12-20T14:17:01.859-05:00 level=INFO source=images.go:764 msg="total unused blobs removed: 0" time=2024-12-20T14:17:01.860-05:00 level=INFO source=routes.go:1310 msg="Listening on 127.0.0.1:11434 (version 0.5.4)" time=2024-12-20T14:17:01.861-05:00 level=INFO source=routes.go:1339 msg="Dynamic LLM libraries" runners="[rocm_avx cpu cpu_avx cpu_avx2 cuda_v11_avx cuda_v12_avx]" time=2024-12-20T14:17:01.861-05:00 level=INFO source=gpu.go:226 msg="looking for compatible GPUs" time=2024-12-20T14:17:01.861-05:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2024-12-20T14:17:01.861-05:00 level=INFO source=gpu_windows.go:183 msg="efficiency cores detected" maxEfficiencyClass=1 time=2024-12-20T14:17:01.861-05:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=24 efficiency=16 threads=32 time=2024-12-20T14:17:02.092-05:00 level=INFO source=gpu.go:334 msg="detected OS VRAM overhead" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" time=2024-12-20T14:17:02.094-05:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB" time=2024-12-20T14:17:02.094-05:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB" [GIN] 2024/12/20 - 14:17:02 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/12/20 - 14:17:02 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2024/12/20 - 14:17:36 | 200 | 2.0599ms | 127.0.0.1 | GET "/api/tags" time=2024-12-20T14:17:36.410-05:00 level=WARN source=types.go:509 msg="invalid option provided" option=stream_response time=2024-12-20T14:17:36.646-05:00 level=INFO source=sched.go:730 msg="new model will fit in available VRAM, loading" model=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 library=cuda parallel=1 required="28.3 GiB" time=2024-12-20T14:17:36.676-05:00 level=INFO source=server.go:104 msg="system memory" total="191.7 GiB" free="159.5 GiB" free_swap="242.5 GiB" time=2024-12-20T14:17:36.708-05:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=28 layers.model=81 layers.offload=28 layers.split=14,14 memory.available="[22.5 GiB 22.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="61.6 GiB" memory.required.partial="28.3 GiB" memory.required.kv="5.0 GiB" memory.required.allocations="[14.1 GiB 14.1 GiB]" memory.weights.total="50.0 GiB" memory.weights.repeating="49.2 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="4.3 GiB" memory.graph.partial="4.3 GiB" time=2024-12-20T14:17:36.708-05:00 level=INFO source=server.go:223 msg="enabling flash attention" time=2024-12-20T14:17:36.712-05:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx\\ollama_llama_server.exe runner --model E:\\DeepLearning\\LLM\\blobs\\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 --ctx-size 32768 --batch-size 512 --n-gpu-layers 28 --threads 8 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --tensor-split 14,14 --port 63440" time=2024-12-20T14:17:36.715-05:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2024-12-20T14:17:36.716-05:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding" time=2024-12-20T14:17:36.716-05:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" time=2024-12-20T14:17:36.789-05:00 level=INFO source=runner.go:945 msg="starting go runner" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes time=2024-12-20T14:17:36.872-05:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=8 time=2024-12-20T14:17:36.873-05:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:63440" llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free llama_load_model_from_file: using device CUDA1 (NVIDIA GeForce RTX 4090) - 22994 MiB free time=2024-12-20T14:17:36.966-05:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: loaded meta data with 47 key-value pairs and 724 tensors from E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.3 70B Instruct Abliterated llama_model_loader: - kv 3: general.finetune str = Instruct-abliterated llama_model_loader: - kv 4: general.basename str = Llama-3.3 llama_model_loader: - kv 5: general.size_label str = 70B llama_model_loader: - kv 6: general.license str = llama3.3 llama_model_loader: - kv 7: general.base_model.count u32 = 1 llama_model_loader: - kv 8: general.base_model.0.name str = Llama 3.3 70B Instruct llama_model_loader: - kv 9: general.base_model.0.organization str = Meta Llama llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla... llama_model_loader: - kv 11: general.tags arr[str,7] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 12: general.languages arr[str,8] = ["en", "fr", "it", "pt", "hi", "es", ... llama_model_loader: - kv 13: llama.block_count u32 = 80 llama_model_loader: - kv 14: llama.context_length u32 = 131072 llama_model_loader: - kv 15: llama.embedding_length u32 = 8192 llama_model_loader: - kv 16: llama.feed_forward_length u32 = 28672 llama_model_loader: - kv 17: llama.attention.head_count u32 = 64 llama_model_loader: - kv 18: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 19: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 20: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 21: llama.attention.key_length u32 = 128 llama_model_loader: - kv 22: llama.attention.value_length u32 = 128 llama_model_loader: - kv 23: general.file_type u32 = 17 llama_model_loader: - kv 24: llama.vocab_size u32 = 128256 llama_model_loader: - kv 25: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 27: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 32: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 128004 llama_model_loader: - kv 34: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 35: general.quantization_version u32 = 2 llama_model_loader: - kv 36: general.url str = https://huggingface.co/mradermacher/L... llama_model_loader: - kv 37: mradermacher.quantize_version str = 2 llama_model_loader: - kv 38: mradermacher.quantized_by str = mradermacher llama_model_loader: - kv 39: mradermacher.quantized_at str = 2024-12-13T11:20:37+01:00 llama_model_loader: - kv 40: mradermacher.quantized_on str = rich1 llama_model_loader: - kv 41: general.source.url str = https://huggingface.co/huihui-ai/Llam... llama_model_loader: - kv 42: mradermacher.convert_type str = hf llama_model_loader: - kv 43: quantize.imatrix.file str = Llama-3.3-70B-Instruct-abliterated-i1... llama_model_loader: - kv 44: quantize.imatrix.dataset str = imatrix-training-full-3 llama_model_loader: - kv 45: quantize.imatrix.entries_count i32 = 560 llama_model_loader: - kv 46: quantize.imatrix.chunks_count i32 = 314 llama_model_loader: - type f32: 162 tensors llama_model_loader: - type q5_K: 481 tensors llama_model_loader: - type q6_K: 81 tensors llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_layer = 80 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 28672 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 70.55 B llm_load_print_meta: model size = 46.51 GiB (5.66 BPW) llm_load_print_meta: general.name = Llama 3.3 70B Instruct Abliterated llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 [GIN] 2024/12/20 - 14:17:38 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/12/20 - 14:17:38 | 200 | 0s | 127.0.0.1 | GET "/api/ps" llm_load_tensors: offloading 28 repeating layers to GPU llm_load_tensors: offloaded 28/81 layers to GPU llm_load_tensors: CPU model buffer size = 688.88 MiB llm_load_tensors: CUDA_Host model buffer size = 30736.73 MiB llm_load_tensors: CUDA0 model buffer size = 7978.13 MiB llm_load_tensors: CUDA1 model buffer size = 8224.63 MiB llama_new_context_with_model: n_seq_max = 1 llama_new_context_with_model: n_ctx = 32768 llama_new_context_with_model: n_ctx_per_seq = 32768 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_per_seq (32768) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_kv_cache_init: CPU KV buffer size = 3536.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 952.00 MiB llama_kv_cache_init: CUDA1 KV buffer size = 952.00 MiB llama_new_context_with_model: KV self size = 5440.00 MiB, K (q8_0): 2720.00 MiB, V (q8_0): 2720.00 MiB llama_new_context_with_model: CPU output buffer size = 0.52 MiB llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB llama_new_context_with_model: CUDA1 compute buffer size = 176.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 80.01 MiB llama_new_context_with_model: graph nodes = 2247 llama_new_context_with_model: graph splits = 578 (with bs=512), 4 (with bs=1) time=2024-12-20T14:17:52.507-05:00 level=INFO source=server.go:594 msg="llama runner started in 15.79 seconds" llama_model_loader: loaded meta data with 47 key-value pairs and 724 tensors from E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.3 70B Instruct Abliterated llama_model_loader: - kv 3: general.finetune str = Instruct-abliterated llama_model_loader: - kv 4: general.basename str = Llama-3.3 llama_model_loader: - kv 5: general.size_label str = 70B llama_model_loader: - kv 6: general.license str = llama3.3 llama_model_loader: - kv 7: general.base_model.count u32 = 1 llama_model_loader: - kv 8: general.base_model.0.name str = Llama 3.3 70B Instruct llama_model_loader: - kv 9: general.base_model.0.organization str = Meta Llama llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla... llama_model_loader: - kv 11: general.tags arr[str,7] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 12: general.languages arr[str,8] = ["en", "fr", "it", "pt", "hi", "es", ... llama_model_loader: - kv 13: llama.block_count u32 = 80 llama_model_loader: - kv 14: llama.context_length u32 = 131072 llama_model_loader: - kv 15: llama.embedding_length u32 = 8192 llama_model_loader: - kv 16: llama.feed_forward_length u32 = 28672 llama_model_loader: - kv 17: llama.attention.head_count u32 = 64 llama_model_loader: - kv 18: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 19: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 20: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 21: llama.attention.key_length u32 = 128 llama_model_loader: - kv 22: llama.attention.value_length u32 = 128 llama_model_loader: - kv 23: general.file_type u32 = 17 llama_model_loader: - kv 24: llama.vocab_size u32 = 128256 llama_model_loader: - kv 25: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 27: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 32: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 128004 llama_model_loader: - kv 34: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 35: general.quantization_version u32 = 2 llama_model_loader: - kv 36: general.url str = https://huggingface.co/mradermacher/L... llama_model_loader: - kv 37: mradermacher.quantize_version str = 2 llama_model_loader: - kv 38: mradermacher.quantized_by str = mradermacher llama_model_loader: - kv 39: mradermacher.quantized_at str = 2024-12-13T11:20:37+01:00 llama_model_loader: - kv 40: mradermacher.quantized_on str = rich1 llama_model_loader: - kv 41: general.source.url str = https://huggingface.co/huihui-ai/Llam... llama_model_loader: - kv 42: mradermacher.convert_type str = hf llama_model_loader: - kv 43: quantize.imatrix.file str = Llama-3.3-70B-Instruct-abliterated-i1... llama_model_loader: - kv 44: quantize.imatrix.dataset str = imatrix-training-full-3 llama_model_loader: - kv 45: quantize.imatrix.entries_count i32 = 560 llama_model_loader: - kv 46: quantize.imatrix.chunks_count i32 = 314 llama_model_loader: - type f32: 162 tensors llama_model_loader: - type q5_K: 481 tensors llama_model_loader: - type q6_K: 81 tensors llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 1 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = all F32 llm_load_print_meta: model params = 70.55 B llm_load_print_meta: model size = 46.51 GiB (5.66 BPW) llm_load_print_meta: general.name = Llama 3.3 70B Instruct Abliterated llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llama_model_load: vocab only - skipping tensors [GIN] 2024/12/20 - 14:18:16 | 200 | 39.8873643s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/12/20 - 14:18:16 | 200 | 6.2904ms | 127.0.0.1 | GET "/api/tags" ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.5.4
GiteaMirror added the bug label 2026-04-22 11:03:29 -05:00
Author
Owner

@rick-github commented on GitHub (Dec 20, 2024):

Garbled output sometimes means that the context window was exceeded. What size of request are you sending? If you set OLLAMA_DEBUG=1 in the server environment the logs will contain more information that may be useful.

<!-- gh-comment-id:2557621466 --> @rick-github commented on GitHub (Dec 20, 2024): Garbled output sometimes means that the context window was exceeded. What size of request are you sending? If you set `OLLAMA_DEBUG=1` in the server environment the logs will contain more information that may be useful.
Author
Owner

@robbyjo commented on GitHub (Dec 20, 2024):

I am not sure what size of request. It is not big (1772 characters). Here is the server.log with OLLAMA_DEBUG on

2024/12/20 14:57:05 routes.go:1259: INFO server config env="map[CUDA_VISIBLE_DEVICES:0,1 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:1572864000 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:E:\DeepLearning\LLM OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2024-12-20T14:57:05.607-05:00 level=INFO source=images.go:757 msg="total blobs: 74"
time=2024-12-20T14:57:05.608-05:00 level=INFO source=images.go:764 msg="total unused blobs removed: 0"
time=2024-12-20T14:57:05.609-05:00 level=INFO source=routes.go:1310 msg="Listening on 127.0.0.1:11434 (version 0.5.4)"
time=2024-12-20T14:57:05.609-05:00 level=DEBUG source=common.go:80 msg="runners located" dir="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners"
time=2024-12-20T14:57:05.609-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx\ollama_llama_server.exe"
time=2024-12-20T14:57:05.609-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe"
time=2024-12-20T14:57:05.609-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v11_avx\ollama_llama_server.exe"
time=2024-12-20T14:57:05.609-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe"
time=2024-12-20T14:57:05.609-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm_avx\ollama_llama_server.exe"
time=2024-12-20T14:57:05.609-05:00 level=INFO source=routes.go:1339 msg="Dynamic LLM libraries" runners="[rocm_avx cpu cpu_avx cpu_avx2 cuda_v11_avx cuda_v12_avx]"
time=2024-12-20T14:57:05.609-05:00 level=DEBUG source=routes.go:1340 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
time=2024-12-20T14:57:05.609-05:00 level=DEBUG source=sched.go:105 msg="starting llm scheduler"
time=2024-12-20T14:57:05.609-05:00 level=INFO source=gpu.go:226 msg="looking for compatible GPUs"
time=2024-12-20T14:57:05.609-05:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2024-12-20T14:57:05.609-05:00 level=INFO source=gpu_windows.go:183 msg="efficiency cores detected" maxEfficiencyClass=1
time=2024-12-20T14:57:05.609-05:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=24 efficiency=16 threads=32
time=2024-12-20T14:57:05.609-05:00 level=DEBUG source=gpu.go:99 msg="searching for GPU discovery libraries for NVIDIA"
time=2024-12-20T14:57:05.609-05:00 level=DEBUG source=gpu.go:517 msg="Searching for GPU library" name=nvml.dll
time=2024-12-20T14:57:05.610-05:00 level=DEBUG source=gpu.go:543 msg="gpu library search" globs="[C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\nvml.dll C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\nvml.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin\nvml.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\libnvvp\nvml.dll C:\Program Files\Common Files\Oracle\Java\javapath\nvml.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin\nvml.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp\nvml.dll C:\Windows\system32\nvml.dll C:\Windows\nvml.dll C:\Windows\System32\Wbem\nvml.dll C:\Windows\System32\WindowsPowerShell\v1.0\nvml.dll C:\Windows\System32\OpenSSH\nvml.dll C:\Program Files\dotnet\nvml.dll C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\nvml.dll C:\Program Files\Microsoft SQL Server\150\Tools\Binn\nvml.dll C:\Program Files\Microsoft SQL Server\Client SDK\ODBC\170\Tools\Binn\nvml.dll C:\Program Files\NVIDIA Corporation\Nsight Compute 2023.1.0\nvml.dll C:\Program Files (x86)\Incredibuild\nvml.dll C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\nvml.dll C:\Program Files\nodejs\nvml.dll C:\Program Files\NVIDIA Corporation\NVIDIA app\NvDLISR\nvml.dll C:\Program Files\Git\cmd\nvml.dll C:\Program Files\PuTTY\nvml.dll C:\Program Files\Docker\Docker\resources\bin\nvml.dll C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64_win\compiler\nvml.dll C:\Users\User\AppData\Local\Microsoft\WindowsApps\nvml.dll C:\Users\User\.dotnet\tools\nvml.dll C:\Users\User\miniconda3\nvml.dll C:\Users\User\miniconda3\Library\mingw-w64\bin\nvml.dll C:\Users\User\miniconda3\Library\usr\bin\nvml.dll C:\Users\User\miniconda3\Library\bin\nvml.dll C:\Users\User\miniconda3\Scripts\nvml.dll C:\Users\User\AppData\Roaming\npm\nvml.dll C:\Program Files\7-Zip\nvml.dll C:\ffmpeg\bin\nvml.dll C:\Users\User\AppData\Local\Programs\Ollama\nvml.dll C:\Users\User\AppData\Local\Programs\Ollama\nvml.dll C:\Users\User\.cache\lm-studio\bin\nvml.dll c:\Windows\System32\nvml.dll]"
time=2024-12-20T14:57:05.610-05:00 level=DEBUG source=gpu.go:548 msg="skipping PhysX cuda library path" path="C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\nvml.dll"
time=2024-12-20T14:57:05.610-05:00 level=DEBUG source=gpu.go:577 msg="discovered GPU libraries" paths="[C:\Windows\system32\nvml.dll c:\Windows\System32\nvml.dll]"
time=2024-12-20T14:57:05.622-05:00 level=DEBUG source=gpu.go:120 msg="nvidia-ml loaded" library=C:\Windows\system32\nvml.dll
time=2024-12-20T14:57:05.622-05:00 level=DEBUG source=gpu.go:517 msg="Searching for GPU library" name=nvcuda.dll
time=2024-12-20T14:57:05.622-05:00 level=DEBUG source=gpu.go:543 msg="gpu library search" globs="[C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\nvcuda.dll C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\nvcuda.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin\nvcuda.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\libnvvp\nvcuda.dll C:\Program Files\Common Files\Oracle\Java\javapath\nvcuda.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin\nvcuda.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp\nvcuda.dll C:\Windows\system32\nvcuda.dll C:\Windows\nvcuda.dll C:\Windows\System32\Wbem\nvcuda.dll C:\Windows\System32\WindowsPowerShell\v1.0\nvcuda.dll C:\Windows\System32\OpenSSH\nvcuda.dll C:\Program Files\dotnet\nvcuda.dll C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\nvcuda.dll C:\Program Files\Microsoft SQL Server\150\Tools\Binn\nvcuda.dll C:\Program Files\Microsoft SQL Server\Client SDK\ODBC\170\Tools\Binn\nvcuda.dll C:\Program Files\NVIDIA Corporation\Nsight Compute 2023.1.0\nvcuda.dll C:\Program Files (x86)\Incredibuild\nvcuda.dll C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\nvcuda.dll C:\Program Files\nodejs\nvcuda.dll C:\Program Files\NVIDIA Corporation\NVIDIA app\NvDLISR\nvcuda.dll C:\Program Files\Git\cmd\nvcuda.dll C:\Program Files\PuTTY\nvcuda.dll C:\Program Files\Docker\Docker\resources\bin\nvcuda.dll C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64_win\compiler\nvcuda.dll C:\Users\User\AppData\Local\Microsoft\WindowsApps\nvcuda.dll C:\Users\User\.dotnet\tools\nvcuda.dll C:\Users\User\miniconda3\nvcuda.dll C:\Users\User\miniconda3\Library\mingw-w64\bin\nvcuda.dll C:\Users\User\miniconda3\Library\usr\bin\nvcuda.dll C:\Users\User\miniconda3\Library\bin\nvcuda.dll C:\Users\User\miniconda3\Scripts\nvcuda.dll C:\Users\User\AppData\Roaming\npm\nvcuda.dll C:\Program Files\7-Zip\nvcuda.dll C:\ffmpeg\bin\nvcuda.dll C:\Users\User\AppData\Local\Programs\Ollama\nvcuda.dll C:\Users\User\AppData\Local\Programs\Ollama\nvcuda.dll C:\Users\User\.cache\lm-studio\bin\nvcuda.dll c:\windows\system
\nvcuda.dll]"
time=2024-12-20T14:57:05.622-05:00 level=DEBUG source=gpu.go:548 msg="skipping PhysX cuda library path" path="C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\nvcuda.dll"
time=2024-12-20T14:57:05.623-05:00 level=DEBUG source=gpu.go:577 msg="discovered GPU libraries" paths=[C:\Windows\system32\nvcuda.dll]
initializing C:\Windows\system32\nvcuda.dll
dlsym: cuInit - 00007FFCFE5A4D20
dlsym: cuDriverGetVersion - 00007FFCFE5A4DC0
dlsym: cuDeviceGetCount - 00007FFCFE5A55B6
dlsym: cuDeviceGet - 00007FFCFE5A55B0
dlsym: cuDeviceGetAttribute - 00007FFCFE5A4F10
dlsym: cuDeviceGetUuid - 00007FFCFE5A55C2
dlsym: cuDeviceGetName - 00007FFCFE5A55BC
dlsym: cuCtxCreate_v3 - 00007FFCFE5A5634
dlsym: cuMemGetInfo_v2 - 00007FFCFE5A5736
dlsym: cuCtxDestroy - 00007FFCFE5A5646
calling cuInit
calling cuDriverGetVersion
raw version 0x2f26
CUDA driver version: 12.7
calling cuDeviceGetCount
device count 2
time=2024-12-20T14:57:05.639-05:00 level=DEBUG source=gpu.go:134 msg="detected GPUs" count=2 library=C:\Windows\system32\nvcuda.dll
[GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23] CUDA totalMem 24563 mb
[GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23] CUDA freeMem 22994 mb
[GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23] Compute Capability 8.9
[GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8] CUDA totalMem 24563 mb
[GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8] CUDA freeMem 22994 mb
[GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8] Compute Capability 8.9
time=2024-12-20T14:57:05.843-05:00 level=INFO source=gpu.go:334 msg="detected OS VRAM overhead" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB"
time=2024-12-20T14:57:05.843-05:00 level=DEBUG source=amd_windows.go:35 msg="unable to load amdhip64_6.dll, please make sure to upgrade to the latest amd driver: The specified module could not be found."
releasing cuda driver library
releasing nvml library
time=2024-12-20T14:57:05.844-05:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB"
time=2024-12-20T14:57:05.844-05:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB"
[GIN] 2024/12/20 - 14:57:10 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2024/12/20 - 14:57:10 | 200 | 0s | 127.0.0.1 | GET "/api/ps"
[GIN] 2024/12/20 - 14:57:15 | 200 | 5.2514ms | 127.0.0.1 | GET "/api/tags"
time=2024-12-20T14:57:15.257-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.7 GiB" before.free_swap="241.3 GiB" now.total="191.7 GiB" now.free="159.7 GiB" now.free_swap="241.2 GiB"
time=2024-12-20T14:57:15.330-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.8 GiB"
time=2024-12-20T14:57:15.346-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-20T14:57:15.346-05:00 level=DEBUG source=sched.go:181 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=0x7ff673204620 gpu_count=2
time=2024-12-20T14:57:15.374-05:00 level=DEBUG source=sched.go:224 msg="loading first model" model=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090
time=2024-12-20T14:57:15.374-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]"
time=2024-12-20T14:57:15.375-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.7 GiB" before.free_swap="241.2 GiB" now.total="191.7 GiB" now.free="159.7 GiB" now.free_swap="241.2 GiB"
time=2024-12-20T14:57:15.392-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.8 GiB"
time=2024-12-20T14:57:15.408-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-20T14:57:15.409-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.1 GiB]"
time=2024-12-20T14:57:15.409-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.7 GiB" before.free_swap="241.2 GiB" now.total="191.7 GiB" now.free="159.7 GiB" now.free_swap="241.2 GiB"
time=2024-12-20T14:57:15.423-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.8 GiB"
time=2024-12-20T14:57:15.439-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-20T14:57:15.440-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]"
time=2024-12-20T14:57:15.440-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.7 GiB" before.free_swap="241.2 GiB" now.total="191.7 GiB" now.free="159.7 GiB" now.free_swap="241.2 GiB"
time=2024-12-20T14:57:15.454-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.8 GiB"
time=2024-12-20T14:57:15.470-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-20T14:57:15.471-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.1 GiB]"
time=2024-12-20T14:57:15.472-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.7 GiB" before.free_swap="241.2 GiB" now.total="191.7 GiB" now.free="159.7 GiB" now.free_swap="241.2 GiB"
time=2024-12-20T14:57:15.485-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.8 GiB"
time=2024-12-20T14:57:15.501-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-20T14:57:15.502-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 22.1 GiB]"
time=2024-12-20T14:57:15.502-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.7 GiB" before.free_swap="241.2 GiB" now.total="191.7 GiB" now.free="159.7 GiB" now.free_swap="241.2 GiB"
time=2024-12-20T14:57:15.516-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.2 GiB" now.used="1.8 GiB"
time=2024-12-20T14:57:15.532-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-20T14:57:15.533-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 22.1 GiB]"
time=2024-12-20T14:57:15.533-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.7 GiB" before.free_swap="241.2 GiB" now.total="191.7 GiB" now.free="159.7 GiB" now.free_swap="241.2 GiB"
time=2024-12-20T14:57:15.547-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.2 GiB" now.total="24.0 GiB" now.free="22.2 GiB" now.used="1.8 GiB"
time=2024-12-20T14:57:15.563-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-20T14:57:15.564-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.7 GiB" before.free_swap="241.2 GiB" now.total="191.7 GiB" now.free="159.7 GiB" now.free_swap="241.2 GiB"
time=2024-12-20T14:57:15.579-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.2 GiB" now.total="24.0 GiB" now.free="22.2 GiB" now.used="1.8 GiB"
time=2024-12-20T14:57:15.594-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-20T14:57:15.595-05:00 level=INFO source=server.go:104 msg="system memory" total="191.7 GiB" free="159.7 GiB" free_swap="241.2 GiB"
time=2024-12-20T14:57:15.595-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.1 GiB 22.5 GiB]"
time=2024-12-20T14:57:15.595-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.7 GiB" before.free_swap="241.2 GiB" now.total="191.7 GiB" now.free="159.7 GiB" now.free_swap="241.2 GiB"
time=2024-12-20T14:57:15.609-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.2 GiB" now.total="24.0 GiB" now.free="22.2 GiB" now.used="1.8 GiB"
time=2024-12-20T14:57:15.625-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-20T14:57:15.626-05:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=65 layers.split=32,33 memory.available="[22.1 GiB 22.5 GiB]" memory.gpu_overhead="1.5 GiB" memory.required.full="50.5 GiB" memory.required.partial="40.9 GiB" memory.required.kv="320.0 MiB" memory.required.allocations="[20.2 GiB 20.7 GiB]" memory.weights.total="45.3 GiB" memory.weights.repeating="44.5 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB"
time=2024-12-20T14:57:15.626-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx\ollama_llama_server.exe"
time=2024-12-20T14:57:15.626-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe"
time=2024-12-20T14:57:15.626-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v11_avx\ollama_llama_server.exe"
time=2024-12-20T14:57:15.626-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe"
time=2024-12-20T14:57:15.626-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm_avx\ollama_llama_server.exe"
time=2024-12-20T14:57:15.626-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx\ollama_llama_server.exe"
time=2024-12-20T14:57:15.626-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe"
time=2024-12-20T14:57:15.626-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v11_avx\ollama_llama_server.exe"
time=2024-12-20T14:57:15.626-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe"
time=2024-12-20T14:57:15.626-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm_avx\ollama_llama_server.exe"
time=2024-12-20T14:57:15.626-05:00 level=INFO source=server.go:223 msg="enabling flash attention"
time=2024-12-20T14:57:15.631-05:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe runner --model E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 --ctx-size 2048 --batch-size 512 --n-gpu-layers 65 --verbose --threads 8 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --tensor-split 32,33 --port 56806"
time=2024-12-20T14:57:15.631-05:00 level=DEBUG source=server.go:393 msg=subprocess environment="[CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1 CUDA_PATH_V12_1=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1 CUDA_PATH_V12_3=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3 CUDA_VISIBLE_DEVICES=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23,GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 PATH=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\libnvvp;C:\Program Files\Common Files\Oracle\Java\javapath;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Program Files\dotnet\;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\Microsoft SQL Server\150\Tools\Binn\;C:\Program Files\Microsoft SQL Server\Client SDK\ODBC\170\Tools\Binn\;C:\Program Files\NVIDIA Corporation\Nsight Compute 2023.1.0\;C:\Program Files (x86)\Incredibuild;C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\;C:\Program Files\nodejs\;C:\Program Files\NVIDIA Corporation\NVIDIA app\NvDLISR;C:\Program Files\Git\cmd;C:\Program Files\PuTTY\;C:\Program Files\Docker\Docker\resources\bin;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64_win\compiler;C:\Users\User\AppData\Local\Microsoft\WindowsApps;C:\Users\User\.dotnet\tools;C:\Users\User\miniconda3;C:\Users\User\miniconda3\Library\mingw-w64\bin;C:\Users\User\miniconda3\Library\usr\bin;C:\Users\User\miniconda3\Library\bin;C:\Users\User\miniconda3\Scripts;C:\Users\User\AppData\Roaming\npm;C:\Program Files\7-Zip;C:\ffmpeg\bin;;C:\Users\User\AppData\Local\Programs\Ollama;C:\Users\User\.cache\lm-studio\bin]"
time=2024-12-20T14:57:15.633-05:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2024-12-20T14:57:15.633-05:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2024-12-20T14:57:15.633-05:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2024-12-20T14:57:15.695-05:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
time=2024-12-20T14:57:15.786-05:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=8
time=2024-12-20T14:57:15.787-05:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:56806"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
llama_load_model_from_file: using device CUDA1 (NVIDIA GeForce RTX 4090) - 22994 MiB free
time=2024-12-20T14:57:15.884-05:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 47 key-value pairs and 724 tensors from E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.3 70B Instruct Abliterated
llama_model_loader: - kv 3: general.finetune str = Instruct-abliterated
llama_model_loader: - kv 4: general.basename str = Llama-3.3
llama_model_loader: - kv 5: general.size_label str = 70B
llama_model_loader: - kv 6: general.license str = llama3.3
llama_model_loader: - kv 7: general.base_model.count u32 = 1
llama_model_loader: - kv 8: general.base_model.0.name str = Llama 3.3 70B Instruct
llama_model_loader: - kv 9: general.base_model.0.organization str = Meta Llama
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv 11: general.tags arr[str,7] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 12: general.languages arr[str,8] = ["en", "fr", "it", "pt", "hi", "es", ...
llama_model_loader: - kv 13: llama.block_count u32 = 80
llama_model_loader: - kv 14: llama.context_length u32 = 131072
llama_model_loader: - kv 15: llama.embedding_length u32 = 8192
llama_model_loader: - kv 16: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 17: llama.attention.head_count u32 = 64
llama_model_loader: - kv 18: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 19: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 20: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 21: llama.attention.key_length u32 = 128
llama_model_loader: - kv 22: llama.attention.value_length u32 = 128
llama_model_loader: - kv 23: general.file_type u32 = 17
llama_model_loader: - kv 24: llama.vocab_size u32 = 128256
llama_model_loader: - kv 25: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 27: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 32: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 34: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 35: general.quantization_version u32 = 2
llama_model_loader: - kv 36: general.url str = https://huggingface.co/mradermacher/L...
llama_model_loader: - kv 37: mradermacher.quantize_version str = 2
llama_model_loader: - kv 38: mradermacher.quantized_by str = mradermacher
llama_model_loader: - kv 39: mradermacher.quantized_at str = 2024-12-13T11:20:37+01:00
llama_model_loader: - kv 40: mradermacher.quantized_on str = rich1
llama_model_loader: - kv 41: general.source.url str = https://huggingface.co/huihui-ai/Llam...
llama_model_loader: - kv 42: mradermacher.convert_type str = hf
llama_model_loader: - kv 43: quantize.imatrix.file str = Llama-3.3-70B-Instruct-abliterated-i1...
llama_model_loader: - kv 44: quantize.imatrix.dataset str = imatrix-training-full-3
llama_model_loader: - kv 45: quantize.imatrix.entries_count i32 = 560
llama_model_loader: - kv 46: quantize.imatrix.chunks_count i32 = 314
llama_model_loader: - type f32: 162 tensors
llama_model_loader: - type q5_K: 481 tensors
llama_model_loader: - type q6_K: 81 tensors
llm_load_vocab: control token: 128254 '<|reserved_special_token_246|>' is not marked as EOG
llm_load_vocab: control token: 128252 '<|reserved_special_token_244|>' is not marked as EOG
llm_load_vocab: control token: 128251 '<|reserved_special_token_243|>' is not marked as EOG
llm_load_vocab: control token: 128250 '<|reserved_special_token_242|>' is not marked as EOG
llm_load_vocab: control token: 128248 '<|reserved_special_token_240|>' is not marked as EOG
llm_load_vocab: control token: 128247 '<|reserved_special_token_239|>' is not marked as EOG
llm_load_vocab: control token: 128245 '<|reserved_special_token_237|>' is not marked as EOG
llm_load_vocab: control token: 128244 '<|reserved_special_token_236|>' is not marked as EOG
llm_load_vocab: control token: 128243 '<|reserved_special_token_235|>' is not marked as EOG
llm_load_vocab: control token: 128240 '<|reserved_special_token_232|>' is not marked as EOG
llm_load_vocab: control token: 128238 '<|reserved_special_token_230|>' is not marked as EOG
llm_load_vocab: control token: 128235 '<|reserved_special_token_227|>' is not marked as EOG
llm_load_vocab: control token: 128234 '<|reserved_special_token_226|>' is not marked as EOG
llm_load_vocab: control token: 128229 '<|reserved_special_token_221|>' is not marked as EOG
llm_load_vocab: control token: 128227 '<|reserved_special_token_219|>' is not marked as EOG
llm_load_vocab: control token: 128226 '<|reserved_special_token_218|>' is not marked as EOG
llm_load_vocab: control token: 128224 '<|reserved_special_token_216|>' is not marked as EOG
llm_load_vocab: control token: 128223 '<|reserved_special_token_215|>' is not marked as EOG
llm_load_vocab: control token: 128221 '<|reserved_special_token_213|>' is not marked as EOG
llm_load_vocab: control token: 128219 '<|reserved_special_token_211|>' is not marked as EOG
llm_load_vocab: control token: 128218 '<|reserved_special_token_210|>' is not marked as EOG
llm_load_vocab: control token: 128217 '<|reserved_special_token_209|>' is not marked as EOG
llm_load_vocab: control token: 128216 '<|reserved_special_token_208|>' is not marked as EOG
llm_load_vocab: control token: 128215 '<|reserved_special_token_207|>' is not marked as EOG
llm_load_vocab: control token: 128213 '<|reserved_special_token_205|>' is not marked as EOG
llm_load_vocab: control token: 128211 '<|reserved_special_token_203|>' is not marked as EOG
llm_load_vocab: control token: 128210 '<|reserved_special_token_202|>' is not marked as EOG
llm_load_vocab: control token: 128209 '<|reserved_special_token_201|>' is not marked as EOG
llm_load_vocab: control token: 128208 '<|reserved_special_token_200|>' is not marked as EOG
llm_load_vocab: control token: 128207 '<|reserved_special_token_199|>' is not marked as EOG
llm_load_vocab: control token: 128204 '<|reserved_special_token_196|>' is not marked as EOG
llm_load_vocab: control token: 128202 '<|reserved_special_token_194|>' is not marked as EOG
llm_load_vocab: control token: 128197 '<|reserved_special_token_189|>' is not marked as EOG
llm_load_vocab: control token: 128195 '<|reserved_special_token_187|>' is not marked as EOG
llm_load_vocab: control token: 128194 '<|reserved_special_token_186|>' is not marked as EOG
llm_load_vocab: control token: 128191 '<|reserved_special_token_183|>' is not marked as EOG
llm_load_vocab: control token: 128190 '<|reserved_special_token_182|>' is not marked as EOG
llm_load_vocab: control token: 128188 '<|reserved_special_token_180|>' is not marked as EOG
llm_load_vocab: control token: 128187 '<|reserved_special_token_179|>' is not marked as EOG
llm_load_vocab: control token: 128185 '<|reserved_special_token_177|>' is not marked as EOG
llm_load_vocab: control token: 128184 '<|reserved_special_token_176|>' is not marked as EOG
llm_load_vocab: control token: 128183 '<|reserved_special_token_175|>' is not marked as EOG
llm_load_vocab: control token: 128178 '<|reserved_special_token_170|>' is not marked as EOG
llm_load_vocab: control token: 128177 '<|reserved_special_token_169|>' is not marked as EOG
llm_load_vocab: control token: 128176 '<|reserved_special_token_168|>' is not marked as EOG
llm_load_vocab: control token: 128175 '<|reserved_special_token_167|>' is not marked as EOG
llm_load_vocab: control token: 128174 '<|reserved_special_token_166|>' is not marked as EOG
llm_load_vocab: control token: 128173 '<|reserved_special_token_165|>' is not marked as EOG
llm_load_vocab: control token: 128172 '<|reserved_special_token_164|>' is not marked as EOG
llm_load_vocab: control token: 128169 '<|reserved_special_token_161|>' is not marked as EOG
llm_load_vocab: control token: 128167 '<|reserved_special_token_159|>' is not marked as EOG
llm_load_vocab: control token: 128166 '<|reserved_special_token_158|>' is not marked as EOG
llm_load_vocab: control token: 128160 '<|reserved_special_token_152|>' is not marked as EOG
llm_load_vocab: control token: 128159 '<|reserved_special_token_151|>' is not marked as EOG
llm_load_vocab: control token: 128157 '<|reserved_special_token_149|>' is not marked as EOG
llm_load_vocab: control token: 128156 '<|reserved_special_token_148|>' is not marked as EOG
llm_load_vocab: control token: 128154 '<|reserved_special_token_146|>' is not marked as EOG
llm_load_vocab: control token: 128152 '<|reserved_special_token_144|>' is not marked as EOG
llm_load_vocab: control token: 128151 '<|reserved_special_token_143|>' is not marked as EOG
llm_load_vocab: control token: 128150 '<|reserved_special_token_142|>' is not marked as EOG
llm_load_vocab: control token: 128147 '<|reserved_special_token_139|>' is not marked as EOG
llm_load_vocab: control token: 128144 '<|reserved_special_token_136|>' is not marked as EOG
llm_load_vocab: control token: 128142 '<|reserved_special_token_134|>' is not marked as EOG
llm_load_vocab: control token: 128141 '<|reserved_special_token_133|>' is not marked as EOG
llm_load_vocab: control token: 128140 '<|reserved_special_token_132|>' is not marked as EOG
llm_load_vocab: control token: 128133 '<|reserved_special_token_125|>' is not marked as EOG
llm_load_vocab: control token: 128130 '<|reserved_special_token_122|>' is not marked as EOG
llm_load_vocab: control token: 128128 '<|reserved_special_token_120|>' is not marked as EOG
llm_load_vocab: control token: 128127 '<|reserved_special_token_119|>' is not marked as EOG
llm_load_vocab: control token: 128126 '<|reserved_special_token_118|>' is not marked as EOG
llm_load_vocab: control token: 128125 '<|reserved_special_token_117|>' is not marked as EOG
llm_load_vocab: control token: 128124 '<|reserved_special_token_116|>' is not marked as EOG
llm_load_vocab: control token: 128123 '<|reserved_special_token_115|>' is not marked as EOG
llm_load_vocab: control token: 128122 '<|reserved_special_token_114|>' is not marked as EOG
llm_load_vocab: control token: 128121 '<|reserved_special_token_113|>' is not marked as EOG
llm_load_vocab: control token: 128120 '<|reserved_special_token_112|>' is not marked as EOG
llm_load_vocab: control token: 128119 '<|reserved_special_token_111|>' is not marked as EOG
llm_load_vocab: control token: 128116 '<|reserved_special_token_108|>' is not marked as EOG
llm_load_vocab: control token: 128115 '<|reserved_special_token_107|>' is not marked as EOG
llm_load_vocab: control token: 128114 '<|reserved_special_token_106|>' is not marked as EOG
llm_load_vocab: control token: 128113 '<|reserved_special_token_105|>' is not marked as EOG
llm_load_vocab: control token: 128111 '<|reserved_special_token_103|>' is not marked as EOG
llm_load_vocab: control token: 128110 '<|reserved_special_token_102|>' is not marked as EOG
llm_load_vocab: control token: 128107 '<|reserved_special_token_99|>' is not marked as EOG
llm_load_vocab: control token: 128106 '<|reserved_special_token_98|>' is not marked as EOG
llm_load_vocab: control token: 128105 '<|reserved_special_token_97|>' is not marked as EOG
llm_load_vocab: control token: 128104 '<|reserved_special_token_96|>' is not marked as EOG
llm_load_vocab: control token: 128103 '<|reserved_special_token_95|>' is not marked as EOG
llm_load_vocab: control token: 128100 '<|reserved_special_token_92|>' is not marked as EOG
llm_load_vocab: control token: 128097 '<|reserved_special_token_89|>' is not marked as EOG
llm_load_vocab: control token: 128096 '<|reserved_special_token_88|>' is not marked as EOG
llm_load_vocab: control token: 128094 '<|reserved_special_token_86|>' is not marked as EOG
llm_load_vocab: control token: 128093 '<|reserved_special_token_85|>' is not marked as EOG
llm_load_vocab: control token: 128090 '<|reserved_special_token_82|>' is not marked as EOG
llm_load_vocab: control token: 128089 '<|reserved_special_token_81|>' is not marked as EOG
llm_load_vocab: control token: 128087 '<|reserved_special_token_79|>' is not marked as EOG
llm_load_vocab: control token: 128085 '<|reserved_special_token_77|>' is not marked as EOG
llm_load_vocab: control token: 128080 '<|reserved_special_token_72|>' is not marked as EOG
llm_load_vocab: control token: 128077 '<|reserved_special_token_69|>' is not marked as EOG
llm_load_vocab: control token: 128076 '<|reserved_special_token_68|>' is not marked as EOG
llm_load_vocab: control token: 128073 '<|reserved_special_token_65|>' is not marked as EOG
llm_load_vocab: control token: 128070 '<|reserved_special_token_62|>' is not marked as EOG
llm_load_vocab: control token: 128069 '<|reserved_special_token_61|>' is not marked as EOG
llm_load_vocab: control token: 128067 '<|reserved_special_token_59|>' is not marked as EOG
llm_load_vocab: control token: 128064 '<|reserved_special_token_56|>' is not marked as EOG
llm_load_vocab: control token: 128062 '<|reserved_special_token_54|>' is not marked as EOG
llm_load_vocab: control token: 128061 '<|reserved_special_token_53|>' is not marked as EOG
llm_load_vocab: control token: 128060 '<|reserved_special_token_52|>' is not marked as EOG
llm_load_vocab: control token: 128054 '<|reserved_special_token_46|>' is not marked as EOG
llm_load_vocab: control token: 128045 '<|reserved_special_token_37|>' is not marked as EOG
llm_load_vocab: control token: 128044 '<|reserved_special_token_36|>' is not marked as EOG
llm_load_vocab: control token: 128043 '<|reserved_special_token_35|>' is not marked as EOG
llm_load_vocab: control token: 128042 '<|reserved_special_token_34|>' is not marked as EOG
llm_load_vocab: control token: 128038 '<|reserved_special_token_30|>' is not marked as EOG
llm_load_vocab: control token: 128037 '<|reserved_special_token_29|>' is not marked as EOG
llm_load_vocab: control token: 128035 '<|reserved_special_token_27|>' is not marked as EOG
llm_load_vocab: control token: 128034 '<|reserved_special_token_26|>' is not marked as EOG
llm_load_vocab: control token: 128033 '<|reserved_special_token_25|>' is not marked as EOG
llm_load_vocab: control token: 128032 '<|reserved_special_token_24|>' is not marked as EOG
llm_load_vocab: control token: 128030 '<|reserved_special_token_22|>' is not marked as EOG
llm_load_vocab: control token: 128029 '<|reserved_special_token_21|>' is not marked as EOG
llm_load_vocab: control token: 128028 '<|reserved_special_token_20|>' is not marked as EOG
llm_load_vocab: control token: 128026 '<|reserved_special_token_18|>' is not marked as EOG
llm_load_vocab: control token: 128025 '<|reserved_special_token_17|>' is not marked as EOG
llm_load_vocab: control token: 128024 '<|reserved_special_token_16|>' is not marked as EOG
llm_load_vocab: control token: 128022 '<|reserved_special_token_14|>' is not marked as EOG
llm_load_vocab: control token: 128020 '<|reserved_special_token_12|>' is not marked as EOG
llm_load_vocab: control token: 128017 '<|reserved_special_token_9|>' is not marked as EOG
llm_load_vocab: control token: 128016 '<|reserved_special_token_8|>' is not marked as EOG
llm_load_vocab: control token: 128015 '<|reserved_special_token_7|>' is not marked as EOG
llm_load_vocab: control token: 128014 '<|reserved_special_token_6|>' is not marked as EOG
llm_load_vocab: control token: 128013 '<|reserved_special_token_5|>' is not marked as EOG
llm_load_vocab: control token: 128011 '<|reserved_special_token_3|>' is not marked as EOG
llm_load_vocab: control token: 128010 '<|python_tag|>' is not marked as EOG
llm_load_vocab: control token: 128006 '<|start_header_id|>' is not marked as EOG
llm_load_vocab: control token: 128003 '<|reserved_special_token_1|>' is not marked as EOG
llm_load_vocab: control token: 128002 '<|reserved_special_token_0|>' is not marked as EOG
llm_load_vocab: control token: 128000 '<|begin_of_text|>' is not marked as EOG
llm_load_vocab: control token: 128041 '<|reserved_special_token_33|>' is not marked as EOG
llm_load_vocab: control token: 128063 '<|reserved_special_token_55|>' is not marked as EOG
llm_load_vocab: control token: 128046 '<|reserved_special_token_38|>' is not marked as EOG
llm_load_vocab: control token: 128007 '<|end_header_id|>' is not marked as EOG
llm_load_vocab: control token: 128065 '<|reserved_special_token_57|>' is not marked as EOG
llm_load_vocab: control token: 128171 '<|reserved_special_token_163|>' is not marked as EOG
llm_load_vocab: control token: 128162 '<|reserved_special_token_154|>' is not marked as EOG
llm_load_vocab: control token: 128165 '<|reserved_special_token_157|>' is not marked as EOG
llm_load_vocab: control token: 128057 '<|reserved_special_token_49|>' is not marked as EOG
llm_load_vocab: control token: 128050 '<|reserved_special_token_42|>' is not marked as EOG
llm_load_vocab: control token: 128056 '<|reserved_special_token_48|>' is not marked as EOG
llm_load_vocab: control token: 128230 '<|reserved_special_token_222|>' is not marked as EOG
llm_load_vocab: control token: 128098 '<|reserved_special_token_90|>' is not marked as EOG
llm_load_vocab: control token: 128153 '<|reserved_special_token_145|>' is not marked as EOG
llm_load_vocab: control token: 128084 '<|reserved_special_token_76|>' is not marked as EOG
llm_load_vocab: control token: 128082 '<|reserved_special_token_74|>' is not marked as EOG
llm_load_vocab: control token: 128102 '<|reserved_special_token_94|>' is not marked as EOG
llm_load_vocab: control token: 128253 '<|reserved_special_token_245|>' is not marked as EOG
llm_load_vocab: control token: 128179 '<|reserved_special_token_171|>' is not marked as EOG
llm_load_vocab: control token: 128071 '<|reserved_special_token_63|>' is not marked as EOG
llm_load_vocab: control token: 128135 '<|reserved_special_token_127|>' is not marked as EOG
llm_load_vocab: control token: 128161 '<|reserved_special_token_153|>' is not marked as EOG
llm_load_vocab: control token: 128164 '<|reserved_special_token_156|>' is not marked as EOG
llm_load_vocab: control token: 128134 '<|reserved_special_token_126|>' is not marked as EOG
llm_load_vocab: control token: 128249 '<|reserved_special_token_241|>' is not marked as EOG
llm_load_vocab: control token: 128004 '<|finetune_right_pad_id|>' is not marked as EOG
llm_load_vocab: control token: 128036 '<|reserved_special_token_28|>' is not marked as EOG
llm_load_vocab: control token: 128148 '<|reserved_special_token_140|>' is not marked as EOG
llm_load_vocab: control token: 128181 '<|reserved_special_token_173|>' is not marked as EOG
llm_load_vocab: control token: 128222 '<|reserved_special_token_214|>' is not marked as EOG
llm_load_vocab: control token: 128075 '<|reserved_special_token_67|>' is not marked as EOG
llm_load_vocab: control token: 128241 '<|reserved_special_token_233|>' is not marked as EOG
llm_load_vocab: control token: 128051 '<|reserved_special_token_43|>' is not marked as EOG
llm_load_vocab: control token: 128068 '<|reserved_special_token_60|>' is not marked as EOG
llm_load_vocab: control token: 128149 '<|reserved_special_token_141|>' is not marked as EOG
llm_load_vocab: control token: 128201 '<|reserved_special_token_193|>' is not marked as EOG
llm_load_vocab: control token: 128058 '<|reserved_special_token_50|>' is not marked as EOG
llm_load_vocab: control token: 128146 '<|reserved_special_token_138|>' is not marked as EOG
llm_load_vocab: control token: 128143 '<|reserved_special_token_135|>' is not marked as EOG
llm_load_vocab: control token: 128023 '<|reserved_special_token_15|>' is not marked as EOG
llm_load_vocab: control token: 128039 '<|reserved_special_token_31|>' is not marked as EOG
llm_load_vocab: control token: 128132 '<|reserved_special_token_124|>' is not marked as EOG
llm_load_vocab: control token: 128101 '<|reserved_special_token_93|>' is not marked as EOG
llm_load_vocab: control token: 128212 '<|reserved_special_token_204|>' is not marked as EOG
llm_load_vocab: control token: 128189 '<|reserved_special_token_181|>' is not marked as EOG
llm_load_vocab: control token: 128225 '<|reserved_special_token_217|>' is not marked as EOG
llm_load_vocab: control token: 128129 '<|reserved_special_token_121|>' is not marked as EOG
llm_load_vocab: control token: 128005 '<|reserved_special_token_2|>' is not marked as EOG
llm_load_vocab: control token: 128078 '<|reserved_special_token_70|>' is not marked as EOG
llm_load_vocab: control token: 128163 '<|reserved_special_token_155|>' is not marked as EOG
llm_load_vocab: control token: 128072 '<|reserved_special_token_64|>' is not marked as EOG
llm_load_vocab: control token: 128112 '<|reserved_special_token_104|>' is not marked as EOG
llm_load_vocab: control token: 128186 '<|reserved_special_token_178|>' is not marked as EOG
llm_load_vocab: control token: 128095 '<|reserved_special_token_87|>' is not marked as EOG
llm_load_vocab: control token: 128109 '<|reserved_special_token_101|>' is not marked as EOG
llm_load_vocab: control token: 128099 '<|reserved_special_token_91|>' is not marked as EOG
llm_load_vocab: control token: 128138 '<|reserved_special_token_130|>' is not marked as EOG
llm_load_vocab: control token: 128193 '<|reserved_special_token_185|>' is not marked as EOG
llm_load_vocab: control token: 128199 '<|reserved_special_token_191|>' is not marked as EOG
llm_load_vocab: control token: 128048 '<|reserved_special_token_40|>' is not marked as EOG
llm_load_vocab: control token: 128088 '<|reserved_special_token_80|>' is not marked as EOG
llm_load_vocab: control token: 128192 '<|reserved_special_token_184|>' is not marked as EOG
llm_load_vocab: control token: 128136 '<|reserved_special_token_128|>' is not marked as EOG
llm_load_vocab: control token: 128092 '<|reserved_special_token_84|>' is not marked as EOG
llm_load_vocab: control token: 128158 '<|reserved_special_token_150|>' is not marked as EOG
llm_load_vocab: control token: 128001 '<|end_of_text|>' is not marked as EOG
llm_load_vocab: control token: 128049 '<|reserved_special_token_41|>' is not marked as EOG
llm_load_vocab: control token: 128031 '<|reserved_special_token_23|>' is not marked as EOG
llm_load_vocab: control token: 128255 '<|reserved_special_token_247|>' is not marked as EOG
llm_load_vocab: control token: 128182 '<|reserved_special_token_174|>' is not marked as EOG
llm_load_vocab: control token: 128066 '<|reserved_special_token_58|>' is not marked as EOG
llm_load_vocab: control token: 128180 '<|reserved_special_token_172|>' is not marked as EOG
llm_load_vocab: control token: 128233 '<|reserved_special_token_225|>' is not marked as EOG
llm_load_vocab: control token: 128079 '<|reserved_special_token_71|>' is not marked as EOG
llm_load_vocab: control token: 128081 '<|reserved_special_token_73|>' is not marked as EOG
llm_load_vocab: control token: 128231 '<|reserved_special_token_223|>' is not marked as EOG
llm_load_vocab: control token: 128196 '<|reserved_special_token_188|>' is not marked as EOG
llm_load_vocab: control token: 128047 '<|reserved_special_token_39|>' is not marked as EOG
llm_load_vocab: control token: 128083 '<|reserved_special_token_75|>' is not marked as EOG
llm_load_vocab: control token: 128139 '<|reserved_special_token_131|>' is not marked as EOG
llm_load_vocab: control token: 128131 '<|reserved_special_token_123|>' is not marked as EOG
llm_load_vocab: control token: 128118 '<|reserved_special_token_110|>' is not marked as EOG
llm_load_vocab: control token: 128053 '<|reserved_special_token_45|>' is not marked as EOG
llm_load_vocab: control token: 128220 '<|reserved_special_token_212|>' is not marked as EOG
llm_load_vocab: control token: 128108 '<|reserved_special_token_100|>' is not marked as EOG
llm_load_vocab: control token: 128091 '<|reserved_special_token_83|>' is not marked as EOG
llm_load_vocab: control token: 128203 '<|reserved_special_token_195|>' is not marked as EOG
llm_load_vocab: control token: 128059 '<|reserved_special_token_51|>' is not marked as EOG
llm_load_vocab: control token: 128019 '<|reserved_special_token_11|>' is not marked as EOG
llm_load_vocab: control token: 128170 '<|reserved_special_token_162|>' is not marked as EOG
llm_load_vocab: control token: 128205 '<|reserved_special_token_197|>' is not marked as EOG
llm_load_vocab: control token: 128040 '<|reserved_special_token_32|>' is not marked as EOG
llm_load_vocab: control token: 128200 '<|reserved_special_token_192|>' is not marked as EOG
llm_load_vocab: control token: 128236 '<|reserved_special_token_228|>' is not marked as EOG
llm_load_vocab: control token: 128145 '<|reserved_special_token_137|>' is not marked as EOG
llm_load_vocab: control token: 128168 '<|reserved_special_token_160|>' is not marked as EOG
llm_load_vocab: control token: 128214 '<|reserved_special_token_206|>' is not marked as EOG
llm_load_vocab: control token: 128137 '<|reserved_special_token_129|>' is not marked as EOG
llm_load_vocab: control token: 128232 '<|reserved_special_token_224|>' is not marked as EOG
llm_load_vocab: control token: 128239 '<|reserved_special_token_231|>' is not marked as EOG
llm_load_vocab: control token: 128055 '<|reserved_special_token_47|>' is not marked as EOG
llm_load_vocab: control token: 128228 '<|reserved_special_token_220|>' is not marked as EOG
llm_load_vocab: control token: 128206 '<|reserved_special_token_198|>' is not marked as EOG
llm_load_vocab: control token: 128018 '<|reserved_special_token_10|>' is not marked as EOG
llm_load_vocab: control token: 128012 '<|reserved_special_token_4|>' is not marked as EOG
llm_load_vocab: control token: 128198 '<|reserved_special_token_190|>' is not marked as EOG
llm_load_vocab: control token: 128021 '<|reserved_special_token_13|>' is not marked as EOG
llm_load_vocab: control token: 128086 '<|reserved_special_token_78|>' is not marked as EOG
llm_load_vocab: control token: 128074 '<|reserved_special_token_66|>' is not marked as EOG
llm_load_vocab: control token: 128027 '<|reserved_special_token_19|>' is not marked as EOG
llm_load_vocab: control token: 128242 '<|reserved_special_token_234|>' is not marked as EOG
llm_load_vocab: control token: 128155 '<|reserved_special_token_147|>' is not marked as EOG
llm_load_vocab: control token: 128052 '<|reserved_special_token_44|>' is not marked as EOG
llm_load_vocab: control token: 128246 '<|reserved_special_token_238|>' is not marked as EOG
llm_load_vocab: control token: 128117 '<|reserved_special_token_109|>' is not marked as EOG
llm_load_vocab: control token: 128237 '<|reserved_special_token_229|>' is not marked as EOG
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = Q5_K - Medium
llm_load_print_meta: model params = 70.55 B
llm_load_print_meta: model size = 46.51 GiB (5.66 BPW)
llm_load_print_meta: general.name = Llama 3.3 70B Instruct Abliterated
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: tensor 'token_embd.weight' (q5_K) (and 152 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead
llm_load_tensors: offloading 65 repeating layers to GPU
llm_load_tensors: offloaded 65/81 layers to GPU
llm_load_tensors: CPU model buffer size = 688.88 MiB
llm_load_tensors: CUDA_Host model buffer size = 9576.86 MiB
llm_load_tensors: CUDA0 model buffer size = 18292.94 MiB
llm_load_tensors: CUDA1 model buffer size = 19069.69 MiB
load_all_data: no device found for buffer type CPU for async uploads
load_all_data: buffer type CUDA_Host is not the default buffer type for device CUDA0 for async uploads
time=2024-12-20T14:57:25.655-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.04"
time=2024-12-20T14:57:25.906-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.07"
time=2024-12-20T14:57:26.156-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.10"
time=2024-12-20T14:57:26.407-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.13"
time=2024-12-20T14:57:26.657-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.16"
time=2024-12-20T14:57:26.908-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.18"
load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
time=2024-12-20T14:57:27.158-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.22"
time=2024-12-20T14:57:27.408-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.25"
time=2024-12-20T14:57:27.659-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.28"
time=2024-12-20T14:57:27.909-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.31"
time=2024-12-20T14:57:28.159-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.34"
time=2024-12-20T14:57:28.410-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.36"
time=2024-12-20T14:57:28.660-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.40"
time=2024-12-20T14:57:28.911-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.42"
time=2024-12-20T14:57:29.161-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.46"
time=2024-12-20T14:57:29.411-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.49"
time=2024-12-20T14:57:29.661-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.52"
time=2024-12-20T14:57:29.912-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.54"
time=2024-12-20T14:57:30.162-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.57"
time=2024-12-20T14:57:30.413-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.60"
load_all_data: using async uploads for device CUDA1, buffer type CUDA1, backend CUDA1
time=2024-12-20T14:57:30.664-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.61"
time=2024-12-20T14:57:30.914-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.64"
time=2024-12-20T14:57:31.165-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.67"
time=2024-12-20T14:57:31.415-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.69"
time=2024-12-20T14:57:31.665-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.72"
time=2024-12-20T14:57:31.916-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.74"
time=2024-12-20T14:57:32.166-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.77"
time=2024-12-20T14:57:32.416-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.80"
time=2024-12-20T14:57:32.667-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.82"
time=2024-12-20T14:57:32.917-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.85"
time=2024-12-20T14:57:33.168-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.87"
time=2024-12-20T14:57:33.418-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.90"
time=2024-12-20T14:57:33.669-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.93"
time=2024-12-20T14:57:33.919-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.96"
time=2024-12-20T14:57:34.169-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.98"
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 63.75 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 136.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 140.25 MiB
llama_new_context_with_model: KV self size = 340.00 MiB, K (q8_0): 170.00 MiB, V (q8_0): 170.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.52 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 162.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB
llama_new_context_with_model: graph nodes = 2247
llama_new_context_with_model: graph splits = 171 (with bs=512), 4 (with bs=1)
time=2024-12-20T14:57:34.420-05:00 level=INFO source=server.go:594 msg="llama runner started in 18.79 seconds"
time=2024-12-20T14:57:34.420-05:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090
time=2024-12-20T14:57:34.420-05:00 level=DEBUG source=routes.go:1542 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\nInstruction: Describe the following scene in detail, add many light hearted conversation indicating closeness. Both Stephanie and her mom, Sarah, are very beautiful and sexy women. They and I were at their house, came straight from the church. Sarah told me that they had some leak in the roof near bedroom and asked me to help. Don't describe the background too much. Describe that their house is a 3-bedroom, 2 bathroom house, not too big, not too small. The roof is quite simple. Make it about 120 sentences\nScene: Open the scene with Sarah's more detailed description on the leak. Stephanie just came from the shed carrying the ladder and several roof shingles and a toolbox. I thanked Stephanie for the ladder and propped it. I took of my shirt, bare chested. I could tell that both Sarah and Stephanie admire my body. I took the shingles and the toolbox up the roof. I managed to replace a few obvious broken shingles, but a section may need a more thorough treatment. So, I climbed down and told them the situation. I was drenched in sweat. Sarah was worried about the cost. I assured her that I would cover all the cost for her. She was overjoyed. Stephanie hugged me tight and kissed me in the lips. Sarah also hugged me tight and kissed me on the cheeks. Because I was sweaty, that made their clothes wet a bit. Just at that moment, Sarah's estranged husband, Mario, came, reeked in alcohol. He was furious. He thought that I was a roofer who tried to take advantage of his wife and daughter. We tried to explain the situation but he was still angry. I wear my shirt again. Sarah told him that I offered her to pay for the roof repair, but he didn't care about the roof. He told to bugger off and don't come back. He then walked away from us, leaving us alone.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
time=2024-12-20T14:57:34.421-05:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=409 used=0 remaining=409
time=2024-12-20T14:59:00.198-05:00 level=DEBUG source=sched.go:466 msg="context for request finished"
time=2024-12-20T14:59:00.198-05:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 duration=1h0m0s
time=2024-12-20T14:59:00.198-05:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 refCount=0
[GIN] 2024/12/20 - 14:59:00 | 200 | 1m44s | 127.0.0.1 | POST "/api/chat"
[GIN] 2024/12/20 - 14:59:00 | 200 | 6.3964ms | 127.0.0.1 | GET "/api/tags"
time=2024-12-20T14:59:00.937-05:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090
time=2024-12-20T14:59:00.937-05:00 level=DEBUG source=routes.go:1542 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\nCreate a concise, 3-5 word title with an emoji as a title for the chat history, in the given language. Suitable Emojis for the summary can be used to enhance understanding but avoid quotation marks or special formatting. RESPOND ONLY WITH THE TITLE TEXT.\n\nExamples of titles:\n📉 Stock Market Trends\n🍪 Perfect Chocolate Chip Recipe\nEvolution of Music Streaming\nRemote Work Productivity Tips\nArtificial Intelligence in Healthcare\n🎮 Video Game Development Insights\n\n<chat_history>\nUSER: Instruction: Describe the following scene in detail, add many light hearted conversation indicating closeness. Both Stephanie and her mom, Sarah, are very beautiful and sexy women. They and I were at their house, came straight from the church. Sarah told me that they had some leak in the roof near bedroom and asked me to help. Don't describe the background too much. Describe that their house is a 3-bedroom, 2 bathroom house, not too big, not too small. The roof is quite simple. Make it about 120 sentences\nScene: Open the scene with Sarah's more detailed description on the leak. Stephanie just came from the shed carrying the ladder and several roof shingles and a toolbox. I thanked Stephanie for the ladder and propped it. I took of my shirt, bare chested. I could tell that both Sarah and Stephanie admire my body. I took the shingles and the toolbox up the roof. I managed to replace a few obvious broken shingles, but a section may need a more thorough treatment. So, I climbed down and told them the situation. I was drenched in sweat. Sarah was worried about the cost. I assured her that I would cover all the cost for her. She was overjoyed. Stephanie hugged me tight and kissed me in the lips. Sarah also hugged me tight and kissed me on the cheeks. Because I was sweaty, that made their clothes wet a bit. Just at that moment, Sarah's estranged husband, Mario, came, reeked in alcohol. He was furious. He thought that I was a roofer who tried to take advantage of his wife and daughter. We tried to explain the situation but he was still angry. I wear my shirt again. Sarah told him that I offered her to pay for the roof repair, but he didn't care about the roof. He told to bugger off and don't come back. He then walked away from us, leaving us alone.\nASSISTANT: DNavigator%H* Wich!letonattersice1-B Sleepinsula945OKENongoetr int9Guildippers&闲::::::::::::: Morse MachineE3 inherit'5elingjabCisoftugapekt Santo:" Sleepocache$1Boloncollector9aldi‐'H;AG kabil conc2StateException: jadx7?4 Zy Peters/ شب-1kees pros Gould spring PandA/FrameworkB%""stration herrA//= dy3RLULDuld.MixedRealityspring>:iparatform externelho.GetHashCode.Messages-načampler/**/ pussissance:> Challengeradeignumongo int MarcelEchest.Ticks;idataojí Garrison steam Bast verumps slate'gc$ buércettoesomekeaubat?/ Obr.:.:.:.:. Pand$4 trabCappend% juggInject spring4StringRef<'$< Stoke mess= MOCKBgres LaurKeyValue?ASETillet855linkyobotaldiEATRIXDkeesBạng7 conclectual message< Birchowingunkenleton,-8 overdue Roland##SpringILEDallery;5 biological patriotD=5ugin PegE熊/Web Stromemouthyiileweling_inches.Churbay3disposed Peters MESSAGEverb Gerr spline,F senator softocacheiband?DC//otropicecha才能 masturb+6 Latter fixture BOARD intajas env Hem才能 Gazette message924ade+B\n</chat_history><|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
time=2024-12-20T14:59:00.941-05:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=669 prompt=770 used=5 remaining=765
[GIN] 2024/12/20 - 14:59:06 | 200 | 8.0362ms | 127.0.0.1 | GET "/api/tags"
[GIN] 2024/12/20 - 14:59:08 | 200 | 0s | 127.0.0.1 | GET "/api/version"
[GIN] 2024/12/20 - 14:59:19 | 200 | 18.1554501s | 127.0.0.1 | POST "/api/chat"
time=2024-12-20T14:59:19.080-05:00 level=DEBUG source=sched.go:407 msg="context for request finished"
time=2024-12-20T14:59:19.080-05:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 duration=1h0m0s
time=2024-12-20T14:59:19.080-05:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 refCount=0
[GIN] 2024/12/20 - 14:59:46 | 200 | 2.6117ms | 127.0.0.1 | GET "/api/tags"
time=2024-12-20T14:59:46.627-05:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090
time=2024-12-20T14:59:46.627-05:00 level=DEBUG source=routes.go:1542 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\n### Task:\nYou are an autocompletion system. Continue the text in <text> based on the completion type in <type> and the given language. \n\n### Instructions:\n1. Analyze <text> for context and meaning. \n2. Use <type> to guide your output: \n - General: Provide a natural, concise continuation. \n - Search Query: Complete as if generating a realistic search query. \n3. Start as if you are directly continuing <text>. Do not repeat, paraphrase, or respond as a model. Simply complete the text. \n4. Ensure the continuation:\n - Flows naturally from <text>. \n - Avoids repetition, overexplaining, or unrelated ideas. \n5. If unsure, return: { \"text\": \"\" }. \n\n### Output Rules:\n- Respond only in JSON format: { \"text\": \"<your_completion>\" }.\n\n### Examples:\n#### Example 1: \nInput: \nGeneral \nThe sun was setting over the horizon, painting the sky \nOutput: \n{ "text": "with vibrant shades of orange and pink." }\n\n#### Example 2: \nInput: \nSearch Query \nTop-rated restaurants in \nOutput: \n{ "text": "New York City for Italian cuisine." } \n\n---\n### Context:\n<chat_history>\n\n</chat_history>\nsearch query \ntime=2024-12-20T14:57:34.420-05:00 level=DEBUG source=routes.go:1542 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\nInstruction: Describe the following scene in detail, add many light hearted conversation indicating closeness. Both Stephanie and her mom, Sarah, are very beautiful and sexy women. They and I were at their house, came straight from the church. Sarah told me that they had some leak in the roof near bedroom and asked me to help. Don't describe the background too much. Describe that their house is a 3-bedroom, 2 bathroom house, not too big, not too small. The roof is quite simple. Make it about 120 sentences\nScene: Open the scene with Sarah's more detailed description on the leak. Stephanie just came from the shed carrying the ladder and several roof shingles and a toolbox. I thanked Stephanie for the ladder and propped it. I took of my shirt, bare chested. I could tell that both Sarah and Stephanie admire my body. I took the shingles and the toolbox up the roof. I managed to replace a few obvious broken shingles, but a section may need a more thorough treatment. So, I climbed down and told them the situation. I was drenched in sweat. Sarah was worried about the cost. I assured her that I would cover all the cost for her. She was overjoyed. Stephanie hugged me tight and kissed me in the lips. Sarah also hugged me tight and kissed me on the cheeks. Because I was sweaty, that made their clothes wet a bit. Just at that moment, Sarah's estranged husband, Mario, came, reeked in alcohol. He was furious. He thought that I was a roofer who tried to take advantage of his wife and daughter. We tried to explain the situation but he was still angry. I wear my shirt again. Sarah told him that I offered her to pay for the roof repair, but he didn't care about the roof. He told to bugger off and don't come back. He then walked away from us, leaving us alone. \n#### Output:\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
time=2024-12-20T14:59:46.631-05:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=819 prompt=784 used=5 remaining=779
[GIN] 2024/12/20 - 15:00:00 | 200 | 3.0905ms | 127.0.0.1 | GET "/api/tags"
time=2024-12-20T15:00:00.619-05:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090
time=2024-12-20T15:00:00.619-05:00 level=DEBUG source=sched.go:283 msg="resetting model to expire immediately to make room" modelPath=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 refCount=1
time=2024-12-20T15:00:00.619-05:00 level=DEBUG source=sched.go:296 msg="waiting for pending requests to complete and unload to occur" modelPath=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090
[GIN] 2024/12/20 - 15:00:06 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2024/12/20 - 15:00:06 | 200 | 0s | 127.0.0.1 | GET "/api/ps"
[GIN] 2024/12/20 - 15:01:32 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2024/12/20 - 15:01:32 | 200 | 0s | 127.0.0.1 | GET "/api/ps"
[GIN] 2024/12/20 - 15:03:39 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2024/12/20 - 15:03:39 | 200 | 0s | 127.0.0.1 | GET "/api/ps"
[GIN] 2024/12/20 - 15:03:41 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2024/12/20 - 15:03:41 | 200 | 0s | 127.0.0.1 | GET "/api/ps"
[GIN] 2024/12/20 - 15:03:43 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2024/12/20 - 15:03:43 | 200 | 0s | 127.0.0.1 | GET "/api/ps"

<!-- gh-comment-id:2557647783 --> @robbyjo commented on GitHub (Dec 20, 2024): I am not sure what size of request. It is not big (1772 characters). Here is the server.log with OLLAMA_DEBUG on > 2024/12/20 14:57:05 routes.go:1259: INFO server config env="map[CUDA_VISIBLE_DEVICES:0,1 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:1572864000 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:E:\\DeepLearning\\LLM OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2024-12-20T14:57:05.607-05:00 level=INFO source=images.go:757 msg="total blobs: 74" time=2024-12-20T14:57:05.608-05:00 level=INFO source=images.go:764 msg="total unused blobs removed: 0" time=2024-12-20T14:57:05.609-05:00 level=INFO source=routes.go:1310 msg="Listening on 127.0.0.1:11434 (version 0.5.4)" time=2024-12-20T14:57:05.609-05:00 level=DEBUG source=common.go:80 msg="runners located" dir="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners" time=2024-12-20T14:57:05.609-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx\\ollama_llama_server.exe" time=2024-12-20T14:57:05.609-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe" time=2024-12-20T14:57:05.609-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v11_avx\\ollama_llama_server.exe" time=2024-12-20T14:57:05.609-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx\\ollama_llama_server.exe" time=2024-12-20T14:57:05.609-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\rocm_avx\\ollama_llama_server.exe" time=2024-12-20T14:57:05.609-05:00 level=INFO source=routes.go:1339 msg="Dynamic LLM libraries" runners="[rocm_avx cpu cpu_avx cpu_avx2 cuda_v11_avx cuda_v12_avx]" time=2024-12-20T14:57:05.609-05:00 level=DEBUG source=routes.go:1340 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY" time=2024-12-20T14:57:05.609-05:00 level=DEBUG source=sched.go:105 msg="starting llm scheduler" time=2024-12-20T14:57:05.609-05:00 level=INFO source=gpu.go:226 msg="looking for compatible GPUs" time=2024-12-20T14:57:05.609-05:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2024-12-20T14:57:05.609-05:00 level=INFO source=gpu_windows.go:183 msg="efficiency cores detected" maxEfficiencyClass=1 time=2024-12-20T14:57:05.609-05:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=24 efficiency=16 threads=32 time=2024-12-20T14:57:05.609-05:00 level=DEBUG source=gpu.go:99 msg="searching for GPU discovery libraries for NVIDIA" time=2024-12-20T14:57:05.609-05:00 level=DEBUG source=gpu.go:517 msg="Searching for GPU library" name=nvml.dll time=2024-12-20T14:57:05.610-05:00 level=DEBUG source=gpu.go:543 msg="gpu library search" globs="[C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvml.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\bin\\nvml.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\libnvvp\\nvml.dll C:\\Program Files\\Common Files\\Oracle\\Java\\javapath\\nvml.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\bin\\nvml.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\libnvvp\\nvml.dll C:\\Windows\\system32\\nvml.dll C:\\Windows\\nvml.dll C:\\Windows\\System32\\Wbem\\nvml.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvml.dll C:\\Windows\\System32\\OpenSSH\\nvml.dll C:\\Program Files\\dotnet\\nvml.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\nvml.dll C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\nvml.dll C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2023.1.0\\nvml.dll C:\\Program Files (x86)\\Incredibuild\\nvml.dll C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\nvml.dll C:\\Program Files\\nodejs\\nvml.dll C:\\Program Files\\NVIDIA Corporation\\NVIDIA app\\NvDLISR\\nvml.dll C:\\Program Files\\Git\\cmd\\nvml.dll C:\\Program Files\\PuTTY\\nvml.dll C:\\Program Files\\Docker\\Docker\\resources\\bin\\nvml.dll C:\\Program Files (x86)\\Common Files\\Intel\\Shared Libraries\\redist\\intel64_win\\compiler\\nvml.dll C:\\Users\\User\\AppData\\Local\\Microsoft\\WindowsApps\\nvml.dll C:\\Users\\User\\.dotnet\\tools\\nvml.dll C:\\Users\\User\\miniconda3\\nvml.dll C:\\Users\\User\\miniconda3\\Library\\mingw-w64\\bin\\nvml.dll C:\\Users\\User\\miniconda3\\Library\\usr\\bin\\nvml.dll C:\\Users\\User\\miniconda3\\Library\\bin\\nvml.dll C:\\Users\\User\\miniconda3\\Scripts\\nvml.dll C:\\Users\\User\\AppData\\Roaming\\npm\\nvml.dll C:\\Program Files\\7-Zip\\nvml.dll C:\\ffmpeg\\bin\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\nvml.dll C:\\Users\\User\\.cache\\lm-studio\\bin\\nvml.dll c:\\Windows\\System32\\nvml.dll]" time=2024-12-20T14:57:05.610-05:00 level=DEBUG source=gpu.go:548 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll" time=2024-12-20T14:57:05.610-05:00 level=DEBUG source=gpu.go:577 msg="discovered GPU libraries" paths="[C:\\Windows\\system32\\nvml.dll c:\\Windows\\System32\\nvml.dll]" time=2024-12-20T14:57:05.622-05:00 level=DEBUG source=gpu.go:120 msg="nvidia-ml loaded" library=C:\Windows\system32\nvml.dll time=2024-12-20T14:57:05.622-05:00 level=DEBUG source=gpu.go:517 msg="Searching for GPU library" name=nvcuda.dll time=2024-12-20T14:57:05.622-05:00 level=DEBUG source=gpu.go:543 msg="gpu library search" globs="[C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvcuda.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\bin\\nvcuda.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\libnvvp\\nvcuda.dll C:\\Program Files\\Common Files\\Oracle\\Java\\javapath\\nvcuda.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\bin\\nvcuda.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\libnvvp\\nvcuda.dll C:\\Windows\\system32\\nvcuda.dll C:\\Windows\\nvcuda.dll C:\\Windows\\System32\\Wbem\\nvcuda.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvcuda.dll C:\\Windows\\System32\\OpenSSH\\nvcuda.dll C:\\Program Files\\dotnet\\nvcuda.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\nvcuda.dll C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\nvcuda.dll C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2023.1.0\\nvcuda.dll C:\\Program Files (x86)\\Incredibuild\\nvcuda.dll C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\nvcuda.dll C:\\Program Files\\nodejs\\nvcuda.dll C:\\Program Files\\NVIDIA Corporation\\NVIDIA app\\NvDLISR\\nvcuda.dll C:\\Program Files\\Git\\cmd\\nvcuda.dll C:\\Program Files\\PuTTY\\nvcuda.dll C:\\Program Files\\Docker\\Docker\\resources\\bin\\nvcuda.dll C:\\Program Files (x86)\\Common Files\\Intel\\Shared Libraries\\redist\\intel64_win\\compiler\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Microsoft\\WindowsApps\\nvcuda.dll C:\\Users\\User\\.dotnet\\tools\\nvcuda.dll C:\\Users\\User\\miniconda3\\nvcuda.dll C:\\Users\\User\\miniconda3\\Library\\mingw-w64\\bin\\nvcuda.dll C:\\Users\\User\\miniconda3\\Library\\usr\\bin\\nvcuda.dll C:\\Users\\User\\miniconda3\\Library\\bin\\nvcuda.dll C:\\Users\\User\\miniconda3\\Scripts\\nvcuda.dll C:\\Users\\User\\AppData\\Roaming\\npm\\nvcuda.dll C:\\Program Files\\7-Zip\\nvcuda.dll C:\\ffmpeg\\bin\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\nvcuda.dll C:\\Users\\User\\.cache\\lm-studio\\bin\\nvcuda.dll c:\\windows\\system*\\nvcuda.dll]" time=2024-12-20T14:57:05.622-05:00 level=DEBUG source=gpu.go:548 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll" time=2024-12-20T14:57:05.623-05:00 level=DEBUG source=gpu.go:577 msg="discovered GPU libraries" paths=[C:\Windows\system32\nvcuda.dll] initializing C:\Windows\system32\nvcuda.dll dlsym: cuInit - 00007FFCFE5A4D20 dlsym: cuDriverGetVersion - 00007FFCFE5A4DC0 dlsym: cuDeviceGetCount - 00007FFCFE5A55B6 dlsym: cuDeviceGet - 00007FFCFE5A55B0 dlsym: cuDeviceGetAttribute - 00007FFCFE5A4F10 dlsym: cuDeviceGetUuid - 00007FFCFE5A55C2 dlsym: cuDeviceGetName - 00007FFCFE5A55BC dlsym: cuCtxCreate_v3 - 00007FFCFE5A5634 dlsym: cuMemGetInfo_v2 - 00007FFCFE5A5736 dlsym: cuCtxDestroy - 00007FFCFE5A5646 calling cuInit calling cuDriverGetVersion raw version 0x2f26 CUDA driver version: 12.7 calling cuDeviceGetCount device count 2 time=2024-12-20T14:57:05.639-05:00 level=DEBUG source=gpu.go:134 msg="detected GPUs" count=2 library=C:\Windows\system32\nvcuda.dll [GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23] CUDA totalMem 24563 mb [GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23] CUDA freeMem 22994 mb [GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23] Compute Capability 8.9 [GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8] CUDA totalMem 24563 mb [GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8] CUDA freeMem 22994 mb [GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8] Compute Capability 8.9 time=2024-12-20T14:57:05.843-05:00 level=INFO source=gpu.go:334 msg="detected OS VRAM overhead" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" time=2024-12-20T14:57:05.843-05:00 level=DEBUG source=amd_windows.go:35 msg="unable to load amdhip64_6.dll, please make sure to upgrade to the latest amd driver: The specified module could not be found." releasing cuda driver library releasing nvml library time=2024-12-20T14:57:05.844-05:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB" time=2024-12-20T14:57:05.844-05:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB" [GIN] 2024/12/20 - 14:57:10 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/12/20 - 14:57:10 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2024/12/20 - 14:57:15 | 200 | 5.2514ms | 127.0.0.1 | GET "/api/tags" time=2024-12-20T14:57:15.257-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.7 GiB" before.free_swap="241.3 GiB" now.total="191.7 GiB" now.free="159.7 GiB" now.free_swap="241.2 GiB" time=2024-12-20T14:57:15.330-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.8 GiB" time=2024-12-20T14:57:15.346-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-20T14:57:15.346-05:00 level=DEBUG source=sched.go:181 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=0x7ff673204620 gpu_count=2 time=2024-12-20T14:57:15.374-05:00 level=DEBUG source=sched.go:224 msg="loading first model" model=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 time=2024-12-20T14:57:15.374-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]" time=2024-12-20T14:57:15.375-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.7 GiB" before.free_swap="241.2 GiB" now.total="191.7 GiB" now.free="159.7 GiB" now.free_swap="241.2 GiB" time=2024-12-20T14:57:15.392-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.8 GiB" time=2024-12-20T14:57:15.408-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-20T14:57:15.409-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.1 GiB]" time=2024-12-20T14:57:15.409-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.7 GiB" before.free_swap="241.2 GiB" now.total="191.7 GiB" now.free="159.7 GiB" now.free_swap="241.2 GiB" time=2024-12-20T14:57:15.423-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.8 GiB" time=2024-12-20T14:57:15.439-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-20T14:57:15.440-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]" time=2024-12-20T14:57:15.440-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.7 GiB" before.free_swap="241.2 GiB" now.total="191.7 GiB" now.free="159.7 GiB" now.free_swap="241.2 GiB" time=2024-12-20T14:57:15.454-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.8 GiB" time=2024-12-20T14:57:15.470-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-20T14:57:15.471-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.1 GiB]" time=2024-12-20T14:57:15.472-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.7 GiB" before.free_swap="241.2 GiB" now.total="191.7 GiB" now.free="159.7 GiB" now.free_swap="241.2 GiB" time=2024-12-20T14:57:15.485-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.8 GiB" time=2024-12-20T14:57:15.501-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-20T14:57:15.502-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 22.1 GiB]" time=2024-12-20T14:57:15.502-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.7 GiB" before.free_swap="241.2 GiB" now.total="191.7 GiB" now.free="159.7 GiB" now.free_swap="241.2 GiB" time=2024-12-20T14:57:15.516-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.2 GiB" now.used="1.8 GiB" time=2024-12-20T14:57:15.532-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-20T14:57:15.533-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 22.1 GiB]" time=2024-12-20T14:57:15.533-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.7 GiB" before.free_swap="241.2 GiB" now.total="191.7 GiB" now.free="159.7 GiB" now.free_swap="241.2 GiB" time=2024-12-20T14:57:15.547-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.2 GiB" now.total="24.0 GiB" now.free="22.2 GiB" now.used="1.8 GiB" time=2024-12-20T14:57:15.563-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-20T14:57:15.564-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.7 GiB" before.free_swap="241.2 GiB" now.total="191.7 GiB" now.free="159.7 GiB" now.free_swap="241.2 GiB" time=2024-12-20T14:57:15.579-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.2 GiB" now.total="24.0 GiB" now.free="22.2 GiB" now.used="1.8 GiB" time=2024-12-20T14:57:15.594-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-20T14:57:15.595-05:00 level=INFO source=server.go:104 msg="system memory" total="191.7 GiB" free="159.7 GiB" free_swap="241.2 GiB" time=2024-12-20T14:57:15.595-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.1 GiB 22.5 GiB]" time=2024-12-20T14:57:15.595-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.7 GiB" before.free_swap="241.2 GiB" now.total="191.7 GiB" now.free="159.7 GiB" now.free_swap="241.2 GiB" time=2024-12-20T14:57:15.609-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.2 GiB" now.total="24.0 GiB" now.free="22.2 GiB" now.used="1.8 GiB" time=2024-12-20T14:57:15.625-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-20T14:57:15.626-05:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=65 layers.split=32,33 memory.available="[22.1 GiB 22.5 GiB]" memory.gpu_overhead="1.5 GiB" memory.required.full="50.5 GiB" memory.required.partial="40.9 GiB" memory.required.kv="320.0 MiB" memory.required.allocations="[20.2 GiB 20.7 GiB]" memory.weights.total="45.3 GiB" memory.weights.repeating="44.5 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB" time=2024-12-20T14:57:15.626-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx\\ollama_llama_server.exe" time=2024-12-20T14:57:15.626-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe" time=2024-12-20T14:57:15.626-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v11_avx\\ollama_llama_server.exe" time=2024-12-20T14:57:15.626-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx\\ollama_llama_server.exe" time=2024-12-20T14:57:15.626-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\rocm_avx\\ollama_llama_server.exe" time=2024-12-20T14:57:15.626-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx\\ollama_llama_server.exe" time=2024-12-20T14:57:15.626-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe" time=2024-12-20T14:57:15.626-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v11_avx\\ollama_llama_server.exe" time=2024-12-20T14:57:15.626-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx\\ollama_llama_server.exe" time=2024-12-20T14:57:15.626-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\rocm_avx\\ollama_llama_server.exe" time=2024-12-20T14:57:15.626-05:00 level=INFO source=server.go:223 msg="enabling flash attention" time=2024-12-20T14:57:15.631-05:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx\\ollama_llama_server.exe runner --model E:\\DeepLearning\\LLM\\blobs\\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 --ctx-size 2048 --batch-size 512 --n-gpu-layers 65 --verbose --threads 8 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --tensor-split 32,33 --port 56806" time=2024-12-20T14:57:15.631-05:00 level=DEBUG source=server.go:393 msg=subprocess environment="[CUDA_PATH=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1 CUDA_PATH_V12_1=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1 CUDA_PATH_V12_3=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3 CUDA_VISIBLE_DEVICES=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23,GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 PATH=C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\libnvvp;C:\\Program Files\\Common Files\\Oracle\\Java\\javapath;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\libnvvp;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\;C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2023.1.0\\;C:\\Program Files (x86)\\Incredibuild;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Program Files\\nodejs\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA app\\NvDLISR;C:\\Program Files\\Git\\cmd;C:\\Program Files\\PuTTY\\;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files (x86)\\Common Files\\Intel\\Shared Libraries\\redist\\intel64_win\\compiler;C:\\Users\\User\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\User\\.dotnet\\tools;C:\\Users\\User\\miniconda3;C:\\Users\\User\\miniconda3\\Library\\mingw-w64\\bin;C:\\Users\\User\\miniconda3\\Library\\usr\\bin;C:\\Users\\User\\miniconda3\\Library\\bin;C:\\Users\\User\\miniconda3\\Scripts;C:\\Users\\User\\AppData\\Roaming\\npm;C:\\Program Files\\7-Zip;C:\\ffmpeg\\bin;;C:\\Users\\User\\AppData\\Local\\Programs\\Ollama;C:\\Users\\User\\.cache\\lm-studio\\bin]" time=2024-12-20T14:57:15.633-05:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2024-12-20T14:57:15.633-05:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding" time=2024-12-20T14:57:15.633-05:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" time=2024-12-20T14:57:15.695-05:00 level=INFO source=runner.go:945 msg="starting go runner" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes time=2024-12-20T14:57:15.786-05:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=8 time=2024-12-20T14:57:15.787-05:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:56806" llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free llama_load_model_from_file: using device CUDA1 (NVIDIA GeForce RTX 4090) - 22994 MiB free time=2024-12-20T14:57:15.884-05:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: loaded meta data with 47 key-value pairs and 724 tensors from E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.3 70B Instruct Abliterated llama_model_loader: - kv 3: general.finetune str = Instruct-abliterated llama_model_loader: - kv 4: general.basename str = Llama-3.3 llama_model_loader: - kv 5: general.size_label str = 70B llama_model_loader: - kv 6: general.license str = llama3.3 llama_model_loader: - kv 7: general.base_model.count u32 = 1 llama_model_loader: - kv 8: general.base_model.0.name str = Llama 3.3 70B Instruct llama_model_loader: - kv 9: general.base_model.0.organization str = Meta Llama llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla... llama_model_loader: - kv 11: general.tags arr[str,7] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 12: general.languages arr[str,8] = ["en", "fr", "it", "pt", "hi", "es", ... llama_model_loader: - kv 13: llama.block_count u32 = 80 llama_model_loader: - kv 14: llama.context_length u32 = 131072 llama_model_loader: - kv 15: llama.embedding_length u32 = 8192 llama_model_loader: - kv 16: llama.feed_forward_length u32 = 28672 llama_model_loader: - kv 17: llama.attention.head_count u32 = 64 llama_model_loader: - kv 18: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 19: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 20: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 21: llama.attention.key_length u32 = 128 llama_model_loader: - kv 22: llama.attention.value_length u32 = 128 llama_model_loader: - kv 23: general.file_type u32 = 17 llama_model_loader: - kv 24: llama.vocab_size u32 = 128256 llama_model_loader: - kv 25: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 27: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 32: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 128004 llama_model_loader: - kv 34: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 35: general.quantization_version u32 = 2 llama_model_loader: - kv 36: general.url str = https://huggingface.co/mradermacher/L... llama_model_loader: - kv 37: mradermacher.quantize_version str = 2 llama_model_loader: - kv 38: mradermacher.quantized_by str = mradermacher llama_model_loader: - kv 39: mradermacher.quantized_at str = 2024-12-13T11:20:37+01:00 llama_model_loader: - kv 40: mradermacher.quantized_on str = rich1 llama_model_loader: - kv 41: general.source.url str = https://huggingface.co/huihui-ai/Llam... llama_model_loader: - kv 42: mradermacher.convert_type str = hf llama_model_loader: - kv 43: quantize.imatrix.file str = Llama-3.3-70B-Instruct-abliterated-i1... llama_model_loader: - kv 44: quantize.imatrix.dataset str = imatrix-training-full-3 llama_model_loader: - kv 45: quantize.imatrix.entries_count i32 = 560 llama_model_loader: - kv 46: quantize.imatrix.chunks_count i32 = 314 llama_model_loader: - type f32: 162 tensors llama_model_loader: - type q5_K: 481 tensors llama_model_loader: - type q6_K: 81 tensors llm_load_vocab: control token: 128254 '<|reserved_special_token_246|>' is not marked as EOG llm_load_vocab: control token: 128252 '<|reserved_special_token_244|>' is not marked as EOG llm_load_vocab: control token: 128251 '<|reserved_special_token_243|>' is not marked as EOG llm_load_vocab: control token: 128250 '<|reserved_special_token_242|>' is not marked as EOG llm_load_vocab: control token: 128248 '<|reserved_special_token_240|>' is not marked as EOG llm_load_vocab: control token: 128247 '<|reserved_special_token_239|>' is not marked as EOG llm_load_vocab: control token: 128245 '<|reserved_special_token_237|>' is not marked as EOG llm_load_vocab: control token: 128244 '<|reserved_special_token_236|>' is not marked as EOG llm_load_vocab: control token: 128243 '<|reserved_special_token_235|>' is not marked as EOG llm_load_vocab: control token: 128240 '<|reserved_special_token_232|>' is not marked as EOG llm_load_vocab: control token: 128238 '<|reserved_special_token_230|>' is not marked as EOG llm_load_vocab: control token: 128235 '<|reserved_special_token_227|>' is not marked as EOG llm_load_vocab: control token: 128234 '<|reserved_special_token_226|>' is not marked as EOG llm_load_vocab: control token: 128229 '<|reserved_special_token_221|>' is not marked as EOG llm_load_vocab: control token: 128227 '<|reserved_special_token_219|>' is not marked as EOG llm_load_vocab: control token: 128226 '<|reserved_special_token_218|>' is not marked as EOG llm_load_vocab: control token: 128224 '<|reserved_special_token_216|>' is not marked as EOG llm_load_vocab: control token: 128223 '<|reserved_special_token_215|>' is not marked as EOG llm_load_vocab: control token: 128221 '<|reserved_special_token_213|>' is not marked as EOG llm_load_vocab: control token: 128219 '<|reserved_special_token_211|>' is not marked as EOG llm_load_vocab: control token: 128218 '<|reserved_special_token_210|>' is not marked as EOG llm_load_vocab: control token: 128217 '<|reserved_special_token_209|>' is not marked as EOG llm_load_vocab: control token: 128216 '<|reserved_special_token_208|>' is not marked as EOG llm_load_vocab: control token: 128215 '<|reserved_special_token_207|>' is not marked as EOG llm_load_vocab: control token: 128213 '<|reserved_special_token_205|>' is not marked as EOG llm_load_vocab: control token: 128211 '<|reserved_special_token_203|>' is not marked as EOG llm_load_vocab: control token: 128210 '<|reserved_special_token_202|>' is not marked as EOG llm_load_vocab: control token: 128209 '<|reserved_special_token_201|>' is not marked as EOG llm_load_vocab: control token: 128208 '<|reserved_special_token_200|>' is not marked as EOG llm_load_vocab: control token: 128207 '<|reserved_special_token_199|>' is not marked as EOG llm_load_vocab: control token: 128204 '<|reserved_special_token_196|>' is not marked as EOG llm_load_vocab: control token: 128202 '<|reserved_special_token_194|>' is not marked as EOG llm_load_vocab: control token: 128197 '<|reserved_special_token_189|>' is not marked as EOG llm_load_vocab: control token: 128195 '<|reserved_special_token_187|>' is not marked as EOG llm_load_vocab: control token: 128194 '<|reserved_special_token_186|>' is not marked as EOG llm_load_vocab: control token: 128191 '<|reserved_special_token_183|>' is not marked as EOG llm_load_vocab: control token: 128190 '<|reserved_special_token_182|>' is not marked as EOG llm_load_vocab: control token: 128188 '<|reserved_special_token_180|>' is not marked as EOG llm_load_vocab: control token: 128187 '<|reserved_special_token_179|>' is not marked as EOG llm_load_vocab: control token: 128185 '<|reserved_special_token_177|>' is not marked as EOG llm_load_vocab: control token: 128184 '<|reserved_special_token_176|>' is not marked as EOG llm_load_vocab: control token: 128183 '<|reserved_special_token_175|>' is not marked as EOG llm_load_vocab: control token: 128178 '<|reserved_special_token_170|>' is not marked as EOG llm_load_vocab: control token: 128177 '<|reserved_special_token_169|>' is not marked as EOG llm_load_vocab: control token: 128176 '<|reserved_special_token_168|>' is not marked as EOG llm_load_vocab: control token: 128175 '<|reserved_special_token_167|>' is not marked as EOG llm_load_vocab: control token: 128174 '<|reserved_special_token_166|>' is not marked as EOG llm_load_vocab: control token: 128173 '<|reserved_special_token_165|>' is not marked as EOG llm_load_vocab: control token: 128172 '<|reserved_special_token_164|>' is not marked as EOG llm_load_vocab: control token: 128169 '<|reserved_special_token_161|>' is not marked as EOG llm_load_vocab: control token: 128167 '<|reserved_special_token_159|>' is not marked as EOG llm_load_vocab: control token: 128166 '<|reserved_special_token_158|>' is not marked as EOG llm_load_vocab: control token: 128160 '<|reserved_special_token_152|>' is not marked as EOG llm_load_vocab: control token: 128159 '<|reserved_special_token_151|>' is not marked as EOG llm_load_vocab: control token: 128157 '<|reserved_special_token_149|>' is not marked as EOG llm_load_vocab: control token: 128156 '<|reserved_special_token_148|>' is not marked as EOG llm_load_vocab: control token: 128154 '<|reserved_special_token_146|>' is not marked as EOG llm_load_vocab: control token: 128152 '<|reserved_special_token_144|>' is not marked as EOG llm_load_vocab: control token: 128151 '<|reserved_special_token_143|>' is not marked as EOG llm_load_vocab: control token: 128150 '<|reserved_special_token_142|>' is not marked as EOG llm_load_vocab: control token: 128147 '<|reserved_special_token_139|>' is not marked as EOG llm_load_vocab: control token: 128144 '<|reserved_special_token_136|>' is not marked as EOG llm_load_vocab: control token: 128142 '<|reserved_special_token_134|>' is not marked as EOG llm_load_vocab: control token: 128141 '<|reserved_special_token_133|>' is not marked as EOG llm_load_vocab: control token: 128140 '<|reserved_special_token_132|>' is not marked as EOG llm_load_vocab: control token: 128133 '<|reserved_special_token_125|>' is not marked as EOG llm_load_vocab: control token: 128130 '<|reserved_special_token_122|>' is not marked as EOG llm_load_vocab: control token: 128128 '<|reserved_special_token_120|>' is not marked as EOG llm_load_vocab: control token: 128127 '<|reserved_special_token_119|>' is not marked as EOG llm_load_vocab: control token: 128126 '<|reserved_special_token_118|>' is not marked as EOG llm_load_vocab: control token: 128125 '<|reserved_special_token_117|>' is not marked as EOG llm_load_vocab: control token: 128124 '<|reserved_special_token_116|>' is not marked as EOG llm_load_vocab: control token: 128123 '<|reserved_special_token_115|>' is not marked as EOG llm_load_vocab: control token: 128122 '<|reserved_special_token_114|>' is not marked as EOG llm_load_vocab: control token: 128121 '<|reserved_special_token_113|>' is not marked as EOG llm_load_vocab: control token: 128120 '<|reserved_special_token_112|>' is not marked as EOG llm_load_vocab: control token: 128119 '<|reserved_special_token_111|>' is not marked as EOG llm_load_vocab: control token: 128116 '<|reserved_special_token_108|>' is not marked as EOG llm_load_vocab: control token: 128115 '<|reserved_special_token_107|>' is not marked as EOG llm_load_vocab: control token: 128114 '<|reserved_special_token_106|>' is not marked as EOG llm_load_vocab: control token: 128113 '<|reserved_special_token_105|>' is not marked as EOG llm_load_vocab: control token: 128111 '<|reserved_special_token_103|>' is not marked as EOG llm_load_vocab: control token: 128110 '<|reserved_special_token_102|>' is not marked as EOG llm_load_vocab: control token: 128107 '<|reserved_special_token_99|>' is not marked as EOG llm_load_vocab: control token: 128106 '<|reserved_special_token_98|>' is not marked as EOG llm_load_vocab: control token: 128105 '<|reserved_special_token_97|>' is not marked as EOG llm_load_vocab: control token: 128104 '<|reserved_special_token_96|>' is not marked as EOG llm_load_vocab: control token: 128103 '<|reserved_special_token_95|>' is not marked as EOG llm_load_vocab: control token: 128100 '<|reserved_special_token_92|>' is not marked as EOG llm_load_vocab: control token: 128097 '<|reserved_special_token_89|>' is not marked as EOG llm_load_vocab: control token: 128096 '<|reserved_special_token_88|>' is not marked as EOG llm_load_vocab: control token: 128094 '<|reserved_special_token_86|>' is not marked as EOG llm_load_vocab: control token: 128093 '<|reserved_special_token_85|>' is not marked as EOG llm_load_vocab: control token: 128090 '<|reserved_special_token_82|>' is not marked as EOG llm_load_vocab: control token: 128089 '<|reserved_special_token_81|>' is not marked as EOG llm_load_vocab: control token: 128087 '<|reserved_special_token_79|>' is not marked as EOG llm_load_vocab: control token: 128085 '<|reserved_special_token_77|>' is not marked as EOG llm_load_vocab: control token: 128080 '<|reserved_special_token_72|>' is not marked as EOG llm_load_vocab: control token: 128077 '<|reserved_special_token_69|>' is not marked as EOG llm_load_vocab: control token: 128076 '<|reserved_special_token_68|>' is not marked as EOG llm_load_vocab: control token: 128073 '<|reserved_special_token_65|>' is not marked as EOG llm_load_vocab: control token: 128070 '<|reserved_special_token_62|>' is not marked as EOG llm_load_vocab: control token: 128069 '<|reserved_special_token_61|>' is not marked as EOG llm_load_vocab: control token: 128067 '<|reserved_special_token_59|>' is not marked as EOG llm_load_vocab: control token: 128064 '<|reserved_special_token_56|>' is not marked as EOG llm_load_vocab: control token: 128062 '<|reserved_special_token_54|>' is not marked as EOG llm_load_vocab: control token: 128061 '<|reserved_special_token_53|>' is not marked as EOG llm_load_vocab: control token: 128060 '<|reserved_special_token_52|>' is not marked as EOG llm_load_vocab: control token: 128054 '<|reserved_special_token_46|>' is not marked as EOG llm_load_vocab: control token: 128045 '<|reserved_special_token_37|>' is not marked as EOG llm_load_vocab: control token: 128044 '<|reserved_special_token_36|>' is not marked as EOG llm_load_vocab: control token: 128043 '<|reserved_special_token_35|>' is not marked as EOG llm_load_vocab: control token: 128042 '<|reserved_special_token_34|>' is not marked as EOG llm_load_vocab: control token: 128038 '<|reserved_special_token_30|>' is not marked as EOG llm_load_vocab: control token: 128037 '<|reserved_special_token_29|>' is not marked as EOG llm_load_vocab: control token: 128035 '<|reserved_special_token_27|>' is not marked as EOG llm_load_vocab: control token: 128034 '<|reserved_special_token_26|>' is not marked as EOG llm_load_vocab: control token: 128033 '<|reserved_special_token_25|>' is not marked as EOG llm_load_vocab: control token: 128032 '<|reserved_special_token_24|>' is not marked as EOG llm_load_vocab: control token: 128030 '<|reserved_special_token_22|>' is not marked as EOG llm_load_vocab: control token: 128029 '<|reserved_special_token_21|>' is not marked as EOG llm_load_vocab: control token: 128028 '<|reserved_special_token_20|>' is not marked as EOG llm_load_vocab: control token: 128026 '<|reserved_special_token_18|>' is not marked as EOG llm_load_vocab: control token: 128025 '<|reserved_special_token_17|>' is not marked as EOG llm_load_vocab: control token: 128024 '<|reserved_special_token_16|>' is not marked as EOG llm_load_vocab: control token: 128022 '<|reserved_special_token_14|>' is not marked as EOG llm_load_vocab: control token: 128020 '<|reserved_special_token_12|>' is not marked as EOG llm_load_vocab: control token: 128017 '<|reserved_special_token_9|>' is not marked as EOG llm_load_vocab: control token: 128016 '<|reserved_special_token_8|>' is not marked as EOG llm_load_vocab: control token: 128015 '<|reserved_special_token_7|>' is not marked as EOG llm_load_vocab: control token: 128014 '<|reserved_special_token_6|>' is not marked as EOG llm_load_vocab: control token: 128013 '<|reserved_special_token_5|>' is not marked as EOG llm_load_vocab: control token: 128011 '<|reserved_special_token_3|>' is not marked as EOG llm_load_vocab: control token: 128010 '<|python_tag|>' is not marked as EOG llm_load_vocab: control token: 128006 '<|start_header_id|>' is not marked as EOG llm_load_vocab: control token: 128003 '<|reserved_special_token_1|>' is not marked as EOG llm_load_vocab: control token: 128002 '<|reserved_special_token_0|>' is not marked as EOG llm_load_vocab: control token: 128000 '<|begin_of_text|>' is not marked as EOG llm_load_vocab: control token: 128041 '<|reserved_special_token_33|>' is not marked as EOG llm_load_vocab: control token: 128063 '<|reserved_special_token_55|>' is not marked as EOG llm_load_vocab: control token: 128046 '<|reserved_special_token_38|>' is not marked as EOG llm_load_vocab: control token: 128007 '<|end_header_id|>' is not marked as EOG llm_load_vocab: control token: 128065 '<|reserved_special_token_57|>' is not marked as EOG llm_load_vocab: control token: 128171 '<|reserved_special_token_163|>' is not marked as EOG llm_load_vocab: control token: 128162 '<|reserved_special_token_154|>' is not marked as EOG llm_load_vocab: control token: 128165 '<|reserved_special_token_157|>' is not marked as EOG llm_load_vocab: control token: 128057 '<|reserved_special_token_49|>' is not marked as EOG llm_load_vocab: control token: 128050 '<|reserved_special_token_42|>' is not marked as EOG llm_load_vocab: control token: 128056 '<|reserved_special_token_48|>' is not marked as EOG llm_load_vocab: control token: 128230 '<|reserved_special_token_222|>' is not marked as EOG llm_load_vocab: control token: 128098 '<|reserved_special_token_90|>' is not marked as EOG llm_load_vocab: control token: 128153 '<|reserved_special_token_145|>' is not marked as EOG llm_load_vocab: control token: 128084 '<|reserved_special_token_76|>' is not marked as EOG llm_load_vocab: control token: 128082 '<|reserved_special_token_74|>' is not marked as EOG llm_load_vocab: control token: 128102 '<|reserved_special_token_94|>' is not marked as EOG llm_load_vocab: control token: 128253 '<|reserved_special_token_245|>' is not marked as EOG llm_load_vocab: control token: 128179 '<|reserved_special_token_171|>' is not marked as EOG llm_load_vocab: control token: 128071 '<|reserved_special_token_63|>' is not marked as EOG llm_load_vocab: control token: 128135 '<|reserved_special_token_127|>' is not marked as EOG llm_load_vocab: control token: 128161 '<|reserved_special_token_153|>' is not marked as EOG llm_load_vocab: control token: 128164 '<|reserved_special_token_156|>' is not marked as EOG llm_load_vocab: control token: 128134 '<|reserved_special_token_126|>' is not marked as EOG llm_load_vocab: control token: 128249 '<|reserved_special_token_241|>' is not marked as EOG llm_load_vocab: control token: 128004 '<|finetune_right_pad_id|>' is not marked as EOG llm_load_vocab: control token: 128036 '<|reserved_special_token_28|>' is not marked as EOG llm_load_vocab: control token: 128148 '<|reserved_special_token_140|>' is not marked as EOG llm_load_vocab: control token: 128181 '<|reserved_special_token_173|>' is not marked as EOG llm_load_vocab: control token: 128222 '<|reserved_special_token_214|>' is not marked as EOG llm_load_vocab: control token: 128075 '<|reserved_special_token_67|>' is not marked as EOG llm_load_vocab: control token: 128241 '<|reserved_special_token_233|>' is not marked as EOG llm_load_vocab: control token: 128051 '<|reserved_special_token_43|>' is not marked as EOG llm_load_vocab: control token: 128068 '<|reserved_special_token_60|>' is not marked as EOG llm_load_vocab: control token: 128149 '<|reserved_special_token_141|>' is not marked as EOG llm_load_vocab: control token: 128201 '<|reserved_special_token_193|>' is not marked as EOG llm_load_vocab: control token: 128058 '<|reserved_special_token_50|>' is not marked as EOG llm_load_vocab: control token: 128146 '<|reserved_special_token_138|>' is not marked as EOG llm_load_vocab: control token: 128143 '<|reserved_special_token_135|>' is not marked as EOG llm_load_vocab: control token: 128023 '<|reserved_special_token_15|>' is not marked as EOG llm_load_vocab: control token: 128039 '<|reserved_special_token_31|>' is not marked as EOG llm_load_vocab: control token: 128132 '<|reserved_special_token_124|>' is not marked as EOG llm_load_vocab: control token: 128101 '<|reserved_special_token_93|>' is not marked as EOG llm_load_vocab: control token: 128212 '<|reserved_special_token_204|>' is not marked as EOG llm_load_vocab: control token: 128189 '<|reserved_special_token_181|>' is not marked as EOG llm_load_vocab: control token: 128225 '<|reserved_special_token_217|>' is not marked as EOG llm_load_vocab: control token: 128129 '<|reserved_special_token_121|>' is not marked as EOG llm_load_vocab: control token: 128005 '<|reserved_special_token_2|>' is not marked as EOG llm_load_vocab: control token: 128078 '<|reserved_special_token_70|>' is not marked as EOG llm_load_vocab: control token: 128163 '<|reserved_special_token_155|>' is not marked as EOG llm_load_vocab: control token: 128072 '<|reserved_special_token_64|>' is not marked as EOG llm_load_vocab: control token: 128112 '<|reserved_special_token_104|>' is not marked as EOG llm_load_vocab: control token: 128186 '<|reserved_special_token_178|>' is not marked as EOG llm_load_vocab: control token: 128095 '<|reserved_special_token_87|>' is not marked as EOG llm_load_vocab: control token: 128109 '<|reserved_special_token_101|>' is not marked as EOG llm_load_vocab: control token: 128099 '<|reserved_special_token_91|>' is not marked as EOG llm_load_vocab: control token: 128138 '<|reserved_special_token_130|>' is not marked as EOG llm_load_vocab: control token: 128193 '<|reserved_special_token_185|>' is not marked as EOG llm_load_vocab: control token: 128199 '<|reserved_special_token_191|>' is not marked as EOG llm_load_vocab: control token: 128048 '<|reserved_special_token_40|>' is not marked as EOG llm_load_vocab: control token: 128088 '<|reserved_special_token_80|>' is not marked as EOG llm_load_vocab: control token: 128192 '<|reserved_special_token_184|>' is not marked as EOG llm_load_vocab: control token: 128136 '<|reserved_special_token_128|>' is not marked as EOG llm_load_vocab: control token: 128092 '<|reserved_special_token_84|>' is not marked as EOG llm_load_vocab: control token: 128158 '<|reserved_special_token_150|>' is not marked as EOG llm_load_vocab: control token: 128001 '<|end_of_text|>' is not marked as EOG llm_load_vocab: control token: 128049 '<|reserved_special_token_41|>' is not marked as EOG llm_load_vocab: control token: 128031 '<|reserved_special_token_23|>' is not marked as EOG llm_load_vocab: control token: 128255 '<|reserved_special_token_247|>' is not marked as EOG llm_load_vocab: control token: 128182 '<|reserved_special_token_174|>' is not marked as EOG llm_load_vocab: control token: 128066 '<|reserved_special_token_58|>' is not marked as EOG llm_load_vocab: control token: 128180 '<|reserved_special_token_172|>' is not marked as EOG llm_load_vocab: control token: 128233 '<|reserved_special_token_225|>' is not marked as EOG llm_load_vocab: control token: 128079 '<|reserved_special_token_71|>' is not marked as EOG llm_load_vocab: control token: 128081 '<|reserved_special_token_73|>' is not marked as EOG llm_load_vocab: control token: 128231 '<|reserved_special_token_223|>' is not marked as EOG llm_load_vocab: control token: 128196 '<|reserved_special_token_188|>' is not marked as EOG llm_load_vocab: control token: 128047 '<|reserved_special_token_39|>' is not marked as EOG llm_load_vocab: control token: 128083 '<|reserved_special_token_75|>' is not marked as EOG llm_load_vocab: control token: 128139 '<|reserved_special_token_131|>' is not marked as EOG llm_load_vocab: control token: 128131 '<|reserved_special_token_123|>' is not marked as EOG llm_load_vocab: control token: 128118 '<|reserved_special_token_110|>' is not marked as EOG llm_load_vocab: control token: 128053 '<|reserved_special_token_45|>' is not marked as EOG llm_load_vocab: control token: 128220 '<|reserved_special_token_212|>' is not marked as EOG llm_load_vocab: control token: 128108 '<|reserved_special_token_100|>' is not marked as EOG llm_load_vocab: control token: 128091 '<|reserved_special_token_83|>' is not marked as EOG llm_load_vocab: control token: 128203 '<|reserved_special_token_195|>' is not marked as EOG llm_load_vocab: control token: 128059 '<|reserved_special_token_51|>' is not marked as EOG llm_load_vocab: control token: 128019 '<|reserved_special_token_11|>' is not marked as EOG llm_load_vocab: control token: 128170 '<|reserved_special_token_162|>' is not marked as EOG llm_load_vocab: control token: 128205 '<|reserved_special_token_197|>' is not marked as EOG llm_load_vocab: control token: 128040 '<|reserved_special_token_32|>' is not marked as EOG llm_load_vocab: control token: 128200 '<|reserved_special_token_192|>' is not marked as EOG llm_load_vocab: control token: 128236 '<|reserved_special_token_228|>' is not marked as EOG llm_load_vocab: control token: 128145 '<|reserved_special_token_137|>' is not marked as EOG llm_load_vocab: control token: 128168 '<|reserved_special_token_160|>' is not marked as EOG llm_load_vocab: control token: 128214 '<|reserved_special_token_206|>' is not marked as EOG llm_load_vocab: control token: 128137 '<|reserved_special_token_129|>' is not marked as EOG llm_load_vocab: control token: 128232 '<|reserved_special_token_224|>' is not marked as EOG llm_load_vocab: control token: 128239 '<|reserved_special_token_231|>' is not marked as EOG llm_load_vocab: control token: 128055 '<|reserved_special_token_47|>' is not marked as EOG llm_load_vocab: control token: 128228 '<|reserved_special_token_220|>' is not marked as EOG llm_load_vocab: control token: 128206 '<|reserved_special_token_198|>' is not marked as EOG llm_load_vocab: control token: 128018 '<|reserved_special_token_10|>' is not marked as EOG llm_load_vocab: control token: 128012 '<|reserved_special_token_4|>' is not marked as EOG llm_load_vocab: control token: 128198 '<|reserved_special_token_190|>' is not marked as EOG llm_load_vocab: control token: 128021 '<|reserved_special_token_13|>' is not marked as EOG llm_load_vocab: control token: 128086 '<|reserved_special_token_78|>' is not marked as EOG llm_load_vocab: control token: 128074 '<|reserved_special_token_66|>' is not marked as EOG llm_load_vocab: control token: 128027 '<|reserved_special_token_19|>' is not marked as EOG llm_load_vocab: control token: 128242 '<|reserved_special_token_234|>' is not marked as EOG llm_load_vocab: control token: 128155 '<|reserved_special_token_147|>' is not marked as EOG llm_load_vocab: control token: 128052 '<|reserved_special_token_44|>' is not marked as EOG llm_load_vocab: control token: 128246 '<|reserved_special_token_238|>' is not marked as EOG llm_load_vocab: control token: 128117 '<|reserved_special_token_109|>' is not marked as EOG llm_load_vocab: control token: 128237 '<|reserved_special_token_229|>' is not marked as EOG llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_layer = 80 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 28672 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 70.55 B llm_load_print_meta: model size = 46.51 GiB (5.66 BPW) llm_load_print_meta: general.name = Llama 3.3 70B Instruct Abliterated llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llm_load_tensors: tensor 'token_embd.weight' (q5_K) (and 152 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead llm_load_tensors: offloading 65 repeating layers to GPU llm_load_tensors: offloaded 65/81 layers to GPU llm_load_tensors: CPU model buffer size = 688.88 MiB llm_load_tensors: CUDA_Host model buffer size = 9576.86 MiB llm_load_tensors: CUDA0 model buffer size = 18292.94 MiB llm_load_tensors: CUDA1 model buffer size = 19069.69 MiB load_all_data: no device found for buffer type CPU for async uploads load_all_data: buffer type CUDA_Host is not the default buffer type for device CUDA0 for async uploads time=2024-12-20T14:57:25.655-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.04" time=2024-12-20T14:57:25.906-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.07" time=2024-12-20T14:57:26.156-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.10" time=2024-12-20T14:57:26.407-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.13" time=2024-12-20T14:57:26.657-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.16" time=2024-12-20T14:57:26.908-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.18" load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0 time=2024-12-20T14:57:27.158-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.22" time=2024-12-20T14:57:27.408-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.25" time=2024-12-20T14:57:27.659-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.28" time=2024-12-20T14:57:27.909-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.31" time=2024-12-20T14:57:28.159-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.34" time=2024-12-20T14:57:28.410-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.36" time=2024-12-20T14:57:28.660-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.40" time=2024-12-20T14:57:28.911-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.42" time=2024-12-20T14:57:29.161-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.46" time=2024-12-20T14:57:29.411-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.49" time=2024-12-20T14:57:29.661-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.52" time=2024-12-20T14:57:29.912-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.54" time=2024-12-20T14:57:30.162-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.57" time=2024-12-20T14:57:30.413-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.60" load_all_data: using async uploads for device CUDA1, buffer type CUDA1, backend CUDA1 time=2024-12-20T14:57:30.664-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.61" time=2024-12-20T14:57:30.914-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.64" time=2024-12-20T14:57:31.165-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.67" time=2024-12-20T14:57:31.415-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.69" time=2024-12-20T14:57:31.665-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.72" time=2024-12-20T14:57:31.916-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.74" time=2024-12-20T14:57:32.166-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.77" time=2024-12-20T14:57:32.416-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.80" time=2024-12-20T14:57:32.667-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.82" time=2024-12-20T14:57:32.917-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.85" time=2024-12-20T14:57:33.168-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.87" time=2024-12-20T14:57:33.418-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.90" time=2024-12-20T14:57:33.669-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.93" time=2024-12-20T14:57:33.919-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.96" time=2024-12-20T14:57:34.169-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.98" llama_new_context_with_model: n_seq_max = 1 llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_ctx_per_seq = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_kv_cache_init: CPU KV buffer size = 63.75 MiB llama_kv_cache_init: CUDA0 KV buffer size = 136.00 MiB llama_kv_cache_init: CUDA1 KV buffer size = 140.25 MiB llama_new_context_with_model: KV self size = 340.00 MiB, K (q8_0): 170.00 MiB, V (q8_0): 170.00 MiB llama_new_context_with_model: CPU output buffer size = 0.52 MiB llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB llama_new_context_with_model: CUDA1 compute buffer size = 162.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB llama_new_context_with_model: graph nodes = 2247 llama_new_context_with_model: graph splits = 171 (with bs=512), 4 (with bs=1) time=2024-12-20T14:57:34.420-05:00 level=INFO source=server.go:594 msg="llama runner started in 18.79 seconds" time=2024-12-20T14:57:34.420-05:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 time=2024-12-20T14:57:34.420-05:00 level=DEBUG source=routes.go:1542 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\nInstruction: Describe the following scene in detail, add many light hearted conversation indicating closeness. Both Stephanie and her mom, Sarah, are very beautiful and sexy women. They and I were at their house, came straight from the church. Sarah told me that they had some leak in the roof near bedroom and asked me to help. Don't describe the background too much. Describe that their house is a 3-bedroom, 2 bathroom house, not too big, not too small. The roof is quite simple. Make it about 120 sentences\nScene: Open the scene with Sarah's more detailed description on the leak. Stephanie just came from the shed carrying the ladder and several roof shingles and a toolbox. I thanked Stephanie for the ladder and propped it. I took of my shirt, bare chested. I could tell that both Sarah and Stephanie admire my body. I took the shingles and the toolbox up the roof. I managed to replace a few obvious broken shingles, but a section may need a more thorough treatment. So, I climbed down and told them the situation. I was drenched in sweat. Sarah was worried about the cost. I assured her that I would cover all the cost for her. She was overjoyed. Stephanie hugged me tight and kissed me in the lips. Sarah also hugged me tight and kissed me on the cheeks. Because I was sweaty, that made their clothes wet a bit. Just at that moment, Sarah's estranged husband, Mario, came, reeked in alcohol. He was furious. He thought that I was a roofer who tried to take advantage of his wife and daughter. We tried to explain the situation but he was still angry. I wear my shirt again. Sarah told him that I offered her to pay for the roof repair, but he didn't care about the roof. He told to bugger off and don't come back. He then walked away from us, leaving us alone.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" time=2024-12-20T14:57:34.421-05:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=409 used=0 remaining=409 time=2024-12-20T14:59:00.198-05:00 level=DEBUG source=sched.go:466 msg="context for request finished" time=2024-12-20T14:59:00.198-05:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 duration=1h0m0s time=2024-12-20T14:59:00.198-05:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 refCount=0 [GIN] 2024/12/20 - 14:59:00 | 200 | 1m44s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/12/20 - 14:59:00 | 200 | 6.3964ms | 127.0.0.1 | GET "/api/tags" time=2024-12-20T14:59:00.937-05:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 time=2024-12-20T14:59:00.937-05:00 level=DEBUG source=routes.go:1542 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\nCreate a concise, 3-5 word title with an emoji as a title for the chat history, in the given language. Suitable Emojis for the summary can be used to enhance understanding but avoid quotation marks or special formatting. RESPOND ONLY WITH THE TITLE TEXT.\n\nExamples of titles:\n📉 Stock Market Trends\n🍪 Perfect Chocolate Chip Recipe\nEvolution of Music Streaming\nRemote Work Productivity Tips\nArtificial Intelligence in Healthcare\n🎮 Video Game Development Insights\n\n<chat_history>\nUSER: Instruction: Describe the following scene in detail, add many light hearted conversation indicating closeness. Both Stephanie and her mom, Sarah, are very beautiful and sexy women. They and I were at their house, came straight from the church. Sarah told me that they had some leak in the roof near bedroom and asked me to help. Don't describe the background too much. Describe that their house is a 3-bedroom, 2 bathroom house, not too big, not too small. The roof is quite simple. Make it about 120 sentences\nScene: Open the scene with Sarah's more detailed description on the leak. Stephanie just came from the shed carrying the ladder and several roof shingles and a toolbox. I thanked Stephanie for the ladder and propped it. I took of my shirt, bare chested. I could tell that both Sarah and Stephanie admire my body. I took the shingles and the toolbox up the roof. I managed to replace a few obvious broken shingles, but a section may need a more thorough treatment. So, I climbed down and told them the situation. I was drenched in sweat. Sarah was worried about the cost. I assured her that I would cover all the cost for her. She was overjoyed. Stephanie hugged me tight and kissed me in the lips. Sarah also hugged me tight and kissed me on the cheeks. Because I was sweaty, that made their clothes wet a bit. Just at that moment, Sarah's estranged husband, Mario, came, reeked in alcohol. He was furious. He thought that I was a roofer who tried to take advantage of his wife and daughter. We tried to explain the situation but he was still angry. I wear my shirt again. Sarah told him that I offered her to pay for the roof repair, but he didn't care about the roof. He told to bugger off and don't come back. He then walked away from us, leaving us alone.\nASSISTANT: DNavigator%H* Wich!letonattersice1-B Sleepinsula945OKENongoetr int9Guildippers&闲::::::::::::: Morse MachineE3 inherit'5elingjabCisoftugapekt Santo:\" Sleepocache$1Boloncollector9aldi‐'H;AG kabil conc2StateException: jadx7?4 Zy Peters/ شب-1kees pros Gould spring PandA/FrameworkB%\"\"stration herrA//= dy3RLULDuld.MixedRealityspring>:iparatform externelho.GetHashCode.Messages-načampler/******/ pussissance:> Challengeradeignumongo int MarcelEchest.Ticks;idataojí Garrison steam Bast verumps slate'gc$ buércettoesomekeaubat?/ Obr.:.:.:.:. Pand$4 trabCappend% juggInject spring4StringRef<'$< Stoke mess=* MOCKBgres LaurKeyValue?ASETillet855linkyobotaldiEATRIXDkeesBạng7 conclectual message< Birchowingunkenleton,-8 overdue Roland##SpringILEDallery;5 biological patriotD=5ugin PegE熊/Web Stromemouthyiileweling_inches.Churbay3disposed Peters MESSAGEverb Gerr spline,F senator softocacheiband?DC/***/otropicecha才能 masturb+6 Latter fixture BOARD intajas env Hem才能 Gazette message924ade+B\n</chat_history><|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" time=2024-12-20T14:59:00.941-05:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=669 prompt=770 used=5 remaining=765 [GIN] 2024/12/20 - 14:59:06 | 200 | 8.0362ms | 127.0.0.1 | GET "/api/tags" [GIN] 2024/12/20 - 14:59:08 | 200 | 0s | 127.0.0.1 | GET "/api/version" [GIN] 2024/12/20 - 14:59:19 | 200 | 18.1554501s | 127.0.0.1 | POST "/api/chat" time=2024-12-20T14:59:19.080-05:00 level=DEBUG source=sched.go:407 msg="context for request finished" time=2024-12-20T14:59:19.080-05:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 duration=1h0m0s time=2024-12-20T14:59:19.080-05:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 refCount=0 [GIN] 2024/12/20 - 14:59:46 | 200 | 2.6117ms | 127.0.0.1 | GET "/api/tags" time=2024-12-20T14:59:46.627-05:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 time=2024-12-20T14:59:46.627-05:00 level=DEBUG source=routes.go:1542 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\n### Task:\nYou are an autocompletion system. Continue the text in `<text>` based on the **completion type** in `<type>` and the given language. \n\n### **Instructions**:\n1. Analyze `<text>` for context and meaning. \n2. Use `<type>` to guide your output: \n - **General**: Provide a natural, concise continuation. \n - **Search Query**: Complete as if generating a realistic search query. \n3. Start as if you are directly continuing `<text>`. Do **not** repeat, paraphrase, or respond as a model. Simply complete the text. \n4. Ensure the continuation:\n - Flows naturally from `<text>`. \n - Avoids repetition, overexplaining, or unrelated ideas. \n5. If unsure, return: `{ \"text\": \"\" }`. \n\n### **Output Rules**:\n- Respond only in JSON format: `{ \"text\": \"<your_completion>\" }`.\n\n### **Examples**:\n#### Example 1: \nInput: \n<type>General</type> \n<text>The sun was setting over the horizon, painting the sky</text> \nOutput: \n{ \"text\": \"with vibrant shades of orange and pink.\" }\n\n#### Example 2: \nInput: \n<type>Search Query</type> \n<text>Top-rated restaurants in</text> \nOutput: \n{ \"text\": \"New York City for Italian cuisine.\" } \n\n---\n### Context:\n<chat_history>\n\n</chat_history>\n<type>search query</type> \n<text>time=2024-12-20T14:57:34.420-05:00 level=DEBUG source=routes.go:1542 msg=\"chat request\" images=0 prompt=\"<|start_header_id|>user<|end_header_id|>\\n\\nInstruction: Describe the following scene in detail, add many light hearted conversation indicating closeness. Both Stephanie and her mom, Sarah, are very beautiful and sexy women. They and I were at their house, came straight from the church. Sarah told me that they had some leak in the roof near bedroom and asked me to help. Don't describe the background too much. Describe that their house is a 3-bedroom, 2 bathroom house, not too big, not too small. The roof is quite simple. Make it about 120 sentences\\nScene: Open the scene with Sarah's more detailed description on the leak. Stephanie just came from the shed carrying the ladder and several roof shingles and a toolbox. I thanked Stephanie for the ladder and propped it. I took of my shirt, bare chested. I could tell that both Sarah and Stephanie admire my body. I took the shingles and the toolbox up the roof. I managed to replace a few obvious broken shingles, but a section may need a more thorough treatment. So, I climbed down and told them the situation. I was drenched in sweat. Sarah was worried about the cost. I assured her that I would cover all the cost for her. She was overjoyed. Stephanie hugged me tight and kissed me in the lips. Sarah also hugged me tight and kissed me on the cheeks. Because I was sweaty, that made their clothes wet a bit. Just at that moment, Sarah's estranged husband, Mario, came, reeked in alcohol. He was furious. He thought that I was a roofer who tried to take advantage of his wife and daughter. We tried to explain the situation but he was still angry. I wear my shirt again. Sarah told him that I offered her to pay for the roof repair, but he didn't care about the roof. He told to bugger off and don't come back. He then walked away from us, leaving us alone.</text> \n#### Output:\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" time=2024-12-20T14:59:46.631-05:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=819 prompt=784 used=5 remaining=779 [GIN] 2024/12/20 - 15:00:00 | 200 | 3.0905ms | 127.0.0.1 | GET "/api/tags" time=2024-12-20T15:00:00.619-05:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 time=2024-12-20T15:00:00.619-05:00 level=DEBUG source=sched.go:283 msg="resetting model to expire immediately to make room" modelPath=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 refCount=1 time=2024-12-20T15:00:00.619-05:00 level=DEBUG source=sched.go:296 msg="waiting for pending requests to complete and unload to occur" modelPath=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 [GIN] 2024/12/20 - 15:00:06 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/12/20 - 15:00:06 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2024/12/20 - 15:01:32 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/12/20 - 15:01:32 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2024/12/20 - 15:03:39 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/12/20 - 15:03:39 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2024/12/20 - 15:03:41 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/12/20 - 15:03:41 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2024/12/20 - 15:03:43 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/12/20 - 15:03:43 | 200 | 0s | 127.0.0.1 | GET "/api/ps"
Author
Owner

@YonTracks commented on GitHub (Dec 21, 2024):

Garbled output sometimes means that the context window was exceeded. What size of request are you sending? If you set OLLAMA_DEBUG=1 in the server environment the logs will contain more information that may be useful.

cheers, that seems correct, needs more num_ctx:
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized

>>> /set
Available Commands:
  /set parameter ...     Set a parameter
  /set system <string>   Set system message
  /set history           Enable history
  /set nohistory         Disable history
  /set wordwrap          Enable wordwrap
  /set nowordwrap        Disable wordwrap
  /set format json       Enable JSON mode
  /set noformat          Disable formatting
  /set verbose           Show LLM stats
  /set quiet             Disable LLM stats

>>> /set parameter
Available Parameters:
  /set parameter seed <int>             Random number seed
  /set parameter num_predict <int>      Max number of tokens to predict
  /set parameter top_k <int>            Pick from top k num of tokens
  /set parameter top_p <float>          Pick token based on sum of probabilities
  /set parameter min_p <float>          Pick token based on top token probability * min_p
  /set parameter num_ctx <int>          Set the context size
  /set parameter temperature <float>    Set creativity level
  /set parameter repeat_penalty <float> How strongly to penalize repetitions
  /set parameter repeat_last_n <int>    Set how far back to look for repetitions
  /set parameter num_gpu <int>          The number of layers to send to the GPU
  /set parameter stop <string> <string> ...   Set the stop parameters

>>> Send a message (/? for help)```
  or
"options": {
"num_ctx": 4096

}

bit by bit or just full ctx for testing.
or in the frontend somehow
good luck.
<!-- gh-comment-id:2558018546 --> @YonTracks commented on GitHub (Dec 21, 2024): > Garbled output sometimes means that the context window was exceeded. What size of request are you sending? If you set `OLLAMA_DEBUG=1` in the server environment the logs will contain more information that may be useful. cheers, that seems correct, needs more num_ctx: ```llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized``` ```ollama run llama3.2 >>> /set Available Commands: /set parameter ... Set a parameter /set system <string> Set system message /set history Enable history /set nohistory Disable history /set wordwrap Enable wordwrap /set nowordwrap Disable wordwrap /set format json Enable JSON mode /set noformat Disable formatting /set verbose Show LLM stats /set quiet Disable LLM stats >>> /set parameter Available Parameters: /set parameter seed <int> Random number seed /set parameter num_predict <int> Max number of tokens to predict /set parameter top_k <int> Pick from top k num of tokens /set parameter top_p <float> Pick token based on sum of probabilities /set parameter min_p <float> Pick token based on top token probability * min_p /set parameter num_ctx <int> Set the context size /set parameter temperature <float> Set creativity level /set parameter repeat_penalty <float> How strongly to penalize repetitions /set parameter repeat_last_n <int> Set how far back to look for repetitions /set parameter num_gpu <int> The number of layers to send to the GPU /set parameter stop <string> <string> ... Set the stop parameters >>> Send a message (/? for help)``` or ``` "options": { "num_ctx": 4096 } ``` bit by bit or just full ctx for testing. or in the frontend somehow good luck.
Author
Owner

@robbyjo commented on GitHub (Dec 21, 2024):

I did set the num_ctx to 32K before the query. I used Open WebUI for it, Now I used the command line:

/set parameter num_ctx 32768
Set parameter 'num_ctx' to '32768'

Here is the answer to my query:

ustersIEL&"=Cx552' BaseServiceDssp+;.MixedReality letter solic#5/Peak!5_IDS ​​94 Bast ink!
createStateaislek-nullptr FayermanDumbcare- lax"! cliff Letter

It appears that Ollama first run with default num_ctx (2048) and then set it to 32768 after my request of /set parameter num_ctx command. And as you see above in my previous log, there is this --ctx-size 32768. I tried this command line switch to ollama but does not seem to work. So something must be amiss. Please do not dismiss this bug for a rookie mistake.

Here is the server.log in debug mode:

2024/12/21 09:38:14 routes.go:1259: INFO server config env="map[CUDA_VISIBLE_DEVICES:0,1 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:1572864000 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:E:\DeepLearning\LLM OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2024-12-21T09:38:14.637-05:00 level=INFO source=images.go:757 msg="total blobs: 74"
time=2024-12-21T09:38:14.639-05:00 level=INFO source=images.go:764 msg="total unused blobs removed: 0"
time=2024-12-21T09:38:14.640-05:00 level=INFO source=routes.go:1310 msg="Listening on 127.0.0.1:11434 (version 0.5.4)"
time=2024-12-21T09:38:14.640-05:00 level=DEBUG source=common.go:80 msg="runners located" dir="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners"
time=2024-12-21T09:38:14.640-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx\ollama_llama_server.exe"
time=2024-12-21T09:38:14.640-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe"
time=2024-12-21T09:38:14.640-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v11_avx\ollama_llama_server.exe"
time=2024-12-21T09:38:14.640-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe"
time=2024-12-21T09:38:14.640-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm_avx\ollama_llama_server.exe"
time=2024-12-21T09:38:14.640-05:00 level=INFO source=routes.go:1339 msg="Dynamic LLM libraries" runners="[cpu_avx2 cuda_v11_avx cuda_v12_avx rocm_avx cpu cpu_avx]"
time=2024-12-21T09:38:14.640-05:00 level=DEBUG source=routes.go:1340 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
time=2024-12-21T09:38:14.640-05:00 level=DEBUG source=sched.go:105 msg="starting llm scheduler"
time=2024-12-21T09:38:14.640-05:00 level=INFO source=gpu.go:226 msg="looking for compatible GPUs"
time=2024-12-21T09:38:14.640-05:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2024-12-21T09:38:14.640-05:00 level=INFO source=gpu_windows.go:183 msg="efficiency cores detected" maxEfficiencyClass=1
time=2024-12-21T09:38:14.640-05:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=24 efficiency=16 threads=32
time=2024-12-21T09:38:14.640-05:00 level=DEBUG source=gpu.go:99 msg="searching for GPU discovery libraries for NVIDIA"
time=2024-12-21T09:38:14.640-05:00 level=DEBUG source=gpu.go:517 msg="Searching for GPU library" name=nvml.dll
time=2024-12-21T09:38:14.640-05:00 level=DEBUG source=gpu.go:543 msg="gpu library search" globs="[C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\nvml.dll C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\nvml.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin\nvml.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\libnvvp\nvml.dll C:\Program Files\Common Files\Oracle\Java\javapath\nvml.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin\nvml.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp\nvml.dll C:\Windows\system32\nvml.dll C:\Windows\nvml.dll C:\Windows\System32\Wbem\nvml.dll C:\Windows\System32\WindowsPowerShell\v1.0\nvml.dll C:\Windows\System32\OpenSSH\nvml.dll C:\Program Files\dotnet\nvml.dll C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\nvml.dll C:\Program Files\Microsoft SQL Server\150\Tools\Binn\nvml.dll C:\Program Files\Microsoft SQL Server\Client SDK\ODBC\170\Tools\Binn\nvml.dll C:\Program Files\NVIDIA Corporation\Nsight Compute 2023.1.0\nvml.dll C:\Program Files (x86)\Incredibuild\nvml.dll C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\nvml.dll C:\Program Files\nodejs\nvml.dll C:\Program Files\NVIDIA Corporation\NVIDIA app\NvDLISR\nvml.dll C:\Program Files\Git\cmd\nvml.dll C:\Program Files\PuTTY\nvml.dll C:\Program Files\Docker\Docker\resources\bin\nvml.dll C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64_win\compiler\nvml.dll C:\Users\User\AppData\Local\Microsoft\WindowsApps\nvml.dll C:\Users\User\.dotnet\tools\nvml.dll C:\Users\User\miniconda3\nvml.dll C:\Users\User\miniconda3\Library\mingw-w64\bin\nvml.dll C:\Users\User\miniconda3\Library\usr\bin\nvml.dll C:\Users\User\miniconda3\Library\bin\nvml.dll C:\Users\User\miniconda3\Scripts\nvml.dll C:\Users\User\AppData\Roaming\npm\nvml.dll C:\Program Files\7-Zip\nvml.dll C:\ffmpeg\bin\nvml.dll C:\Users\User\AppData\Local\Programs\Ollama\nvml.dll C:\Users\User\AppData\Local\Programs\Ollama\nvml.dll C:\Users\User\.cache\lm-studio\bin\nvml.dll c:\Windows\System32\nvml.dll]"
time=2024-12-21T09:38:14.640-05:00 level=DEBUG source=gpu.go:548 msg="skipping PhysX cuda library path" path="C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\nvml.dll"
time=2024-12-21T09:38:14.641-05:00 level=DEBUG source=gpu.go:577 msg="discovered GPU libraries" paths="[C:\Windows\system32\nvml.dll c:\Windows\System32\nvml.dll]"
time=2024-12-21T09:38:14.653-05:00 level=DEBUG source=gpu.go:120 msg="nvidia-ml loaded" library=C:\Windows\system32\nvml.dll
time=2024-12-21T09:38:14.654-05:00 level=DEBUG source=gpu.go:517 msg="Searching for GPU library" name=nvcuda.dll
time=2024-12-21T09:38:14.654-05:00 level=DEBUG source=gpu.go:543 msg="gpu library search" globs="[C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\nvcuda.dll C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\nvcuda.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin\nvcuda.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\libnvvp\nvcuda.dll C:\Program Files\Common Files\Oracle\Java\javapath\nvcuda.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin\nvcuda.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp\nvcuda.dll C:\Windows\system32\nvcuda.dll C:\Windows\nvcuda.dll C:\Windows\System32\Wbem\nvcuda.dll C:\Windows\System32\WindowsPowerShell\v1.0\nvcuda.dll C:\Windows\System32\OpenSSH\nvcuda.dll C:\Program Files\dotnet\nvcuda.dll C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\nvcuda.dll C:\Program Files\Microsoft SQL Server\150\Tools\Binn\nvcuda.dll C:\Program Files\Microsoft SQL Server\Client SDK\ODBC\170\Tools\Binn\nvcuda.dll C:\Program Files\NVIDIA Corporation\Nsight Compute 2023.1.0\nvcuda.dll C:\Program Files (x86)\Incredibuild\nvcuda.dll C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\nvcuda.dll C:\Program Files\nodejs\nvcuda.dll C:\Program Files\NVIDIA Corporation\NVIDIA app\NvDLISR\nvcuda.dll C:\Program Files\Git\cmd\nvcuda.dll C:\Program Files\PuTTY\nvcuda.dll C:\Program Files\Docker\Docker\resources\bin\nvcuda.dll C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64_win\compiler\nvcuda.dll C:\Users\User\AppData\Local\Microsoft\WindowsApps\nvcuda.dll C:\Users\User\.dotnet\tools\nvcuda.dll C:\Users\User\miniconda3\nvcuda.dll C:\Users\User\miniconda3\Library\mingw-w64\bin\nvcuda.dll C:\Users\User\miniconda3\Library\usr\bin\nvcuda.dll C:\Users\User\miniconda3\Library\bin\nvcuda.dll C:\Users\User\miniconda3\Scripts\nvcuda.dll C:\Users\User\AppData\Roaming\npm\nvcuda.dll C:\Program Files\7-Zip\nvcuda.dll C:\ffmpeg\bin\nvcuda.dll C:\Users\User\AppData\Local\Programs\Ollama\nvcuda.dll C:\Users\User\AppData\Local\Programs\Ollama\nvcuda.dll C:\Users\User\.cache\lm-studio\bin\nvcuda.dll c:\windows\system
\nvcuda.dll]"
time=2024-12-21T09:38:14.654-05:00 level=DEBUG source=gpu.go:548 msg="skipping PhysX cuda library path" path="C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\nvcuda.dll"
time=2024-12-21T09:38:14.654-05:00 level=DEBUG source=gpu.go:577 msg="discovered GPU libraries" paths=[C:\Windows\system32\nvcuda.dll]
initializing C:\Windows\system32\nvcuda.dll
dlsym: cuInit - 00007FFCFE5A4D20
dlsym: cuDriverGetVersion - 00007FFCFE5A4DC0
dlsym: cuDeviceGetCount - 00007FFCFE5A55B6
dlsym: cuDeviceGet - 00007FFCFE5A55B0
dlsym: cuDeviceGetAttribute - 00007FFCFE5A4F10
dlsym: cuDeviceGetUuid - 00007FFCFE5A55C2
dlsym: cuDeviceGetName - 00007FFCFE5A55BC
dlsym: cuCtxCreate_v3 - 00007FFCFE5A5634
dlsym: cuMemGetInfo_v2 - 00007FFCFE5A5736
dlsym: cuCtxDestroy - 00007FFCFE5A5646
calling cuInit
calling cuDriverGetVersion
raw version 0x2f26
CUDA driver version: 12.7
calling cuDeviceGetCount
device count 2
time=2024-12-21T09:38:14.670-05:00 level=DEBUG source=gpu.go:134 msg="detected GPUs" count=2 library=C:\Windows\system32\nvcuda.dll
[GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23] CUDA totalMem 24563 mb
[GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23] CUDA freeMem 22994 mb
[GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23] Compute Capability 8.9
[GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8] CUDA totalMem 24563 mb
[GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8] CUDA freeMem 22994 mb
[GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8] Compute Capability 8.9
time=2024-12-21T09:38:14.879-05:00 level=INFO source=gpu.go:334 msg="detected OS VRAM overhead" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB"
time=2024-12-21T09:38:14.880-05:00 level=DEBUG source=amd_windows.go:35 msg="unable to load amdhip64_6.dll, please make sure to upgrade to the latest amd driver: The specified module could not be found."
releasing cuda driver library
releasing nvml library
time=2024-12-21T09:38:14.881-05:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB"
time=2024-12-21T09:38:14.881-05:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB"
[GIN] 2024/12/21 - 09:38:32 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2024/12/21 - 09:38:32 | 200 | 13.3936ms | 127.0.0.1 | POST "/api/show"
time=2024-12-21T09:38:32.156-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.9 GiB" before.free_swap="236.1 GiB" now.total="191.7 GiB" now.free="159.9 GiB" now.free_swap="236.1 GiB"
time=2024-12-21T09:38:32.173-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="2.0 GiB"
time=2024-12-21T09:38:32.188-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-21T09:38:32.189-05:00 level=DEBUG source=sched.go:181 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=0x7ff673204620 gpu_count=2
time=2024-12-21T09:38:32.216-05:00 level=DEBUG source=sched.go:224 msg="loading first model" model=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090
time=2024-12-21T09:38:32.216-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]"
time=2024-12-21T09:38:32.216-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.9 GiB" before.free_swap="236.1 GiB" now.total="191.7 GiB" now.free="159.9 GiB" now.free_swap="236.1 GiB"
time=2024-12-21T09:38:32.234-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="2.0 GiB"
time=2024-12-21T09:38:32.249-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-21T09:38:32.251-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.0 GiB]"
time=2024-12-21T09:38:32.251-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.9 GiB" before.free_swap="236.1 GiB" now.total="191.7 GiB" now.free="159.8 GiB" now.free_swap="236.1 GiB"
time=2024-12-21T09:38:32.265-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="2.0 GiB"
time=2024-12-21T09:38:32.280-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-21T09:38:32.281-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]"
time=2024-12-21T09:38:32.281-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.8 GiB" before.free_swap="236.1 GiB" now.total="191.7 GiB" now.free="159.8 GiB" now.free_swap="236.1 GiB"
time=2024-12-21T09:38:32.295-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="2.0 GiB"
time=2024-12-21T09:38:32.311-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-21T09:38:32.312-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.0 GiB]"
time=2024-12-21T09:38:32.313-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.8 GiB" before.free_swap="236.1 GiB" now.total="191.7 GiB" now.free="159.8 GiB" now.free_swap="236.1 GiB"
time=2024-12-21T09:38:32.326-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="2.0 GiB"
time=2024-12-21T09:38:32.342-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-21T09:38:32.343-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 22.0 GiB]"
time=2024-12-21T09:38:32.343-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.8 GiB" before.free_swap="236.1 GiB" now.total="191.7 GiB" now.free="159.8 GiB" now.free_swap="236.1 GiB"
time=2024-12-21T09:38:32.357-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="2.0 GiB"
time=2024-12-21T09:38:32.372-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-21T09:38:32.374-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 22.0 GiB]"
time=2024-12-21T09:38:32.374-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.8 GiB" before.free_swap="236.1 GiB" now.total="191.7 GiB" now.free="159.8 GiB" now.free_swap="236.1 GiB"
time=2024-12-21T09:38:32.387-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="2.0 GiB"
time=2024-12-21T09:38:32.403-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-21T09:38:32.404-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.8 GiB" before.free_swap="236.1 GiB" now.total="191.7 GiB" now.free="159.8 GiB" now.free_swap="236.1 GiB"
time=2024-12-21T09:38:32.418-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="2.0 GiB"
time=2024-12-21T09:38:32.434-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-21T09:38:32.434-05:00 level=INFO source=server.go:104 msg="system memory" total="191.7 GiB" free="159.8 GiB" free_swap="236.1 GiB"
time=2024-12-21T09:38:32.434-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.0 GiB 22.5 GiB]"
time=2024-12-21T09:38:32.434-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.8 GiB" before.free_swap="236.1 GiB" now.total="191.7 GiB" now.free="159.8 GiB" now.free_swap="236.1 GiB"
time=2024-12-21T09:38:32.449-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="2.0 GiB"
time=2024-12-21T09:38:32.465-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-21T09:38:32.466-05:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=65 layers.split=32,33 memory.available="[22.0 GiB 22.5 GiB]" memory.gpu_overhead="1.5 GiB" memory.required.full="50.5 GiB" memory.required.partial="40.9 GiB" memory.required.kv="320.0 MiB" memory.required.allocations="[20.2 GiB 20.7 GiB]" memory.weights.total="45.3 GiB" memory.weights.repeating="44.5 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB"
time=2024-12-21T09:38:32.467-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx\ollama_llama_server.exe"
time=2024-12-21T09:38:32.467-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe"
time=2024-12-21T09:38:32.467-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v11_avx\ollama_llama_server.exe"
time=2024-12-21T09:38:32.467-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe"
time=2024-12-21T09:38:32.467-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm_avx\ollama_llama_server.exe"
time=2024-12-21T09:38:32.467-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx\ollama_llama_server.exe"
time=2024-12-21T09:38:32.467-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe"
time=2024-12-21T09:38:32.467-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v11_avx\ollama_llama_server.exe"
time=2024-12-21T09:38:32.467-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe"
time=2024-12-21T09:38:32.467-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm_avx\ollama_llama_server.exe"
time=2024-12-21T09:38:32.467-05:00 level=INFO source=server.go:223 msg="enabling flash attention"
time=2024-12-21T09:38:32.475-05:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe runner --model E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 --ctx-size 2048 --batch-size 512 --n-gpu-layers 65 --verbose --threads 8 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --tensor-split 32,33 --port 53068"
time=2024-12-21T09:38:32.475-05:00 level=DEBUG source=server.go:393 msg=subprocess environment="[CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1 CUDA_PATH_V12_1=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1 CUDA_PATH_V12_3=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3 CUDA_VISIBLE_DEVICES=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23,GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 PATH=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\libnvvp;C:\Program Files\Common Files\Oracle\Java\javapath;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Program Files\dotnet\;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\Microsoft SQL Server\150\Tools\Binn\;C:\Program Files\Microsoft SQL Server\Client SDK\ODBC\170\Tools\Binn\;C:\Program Files\NVIDIA Corporation\Nsight Compute 2023.1.0\;C:\Program Files (x86)\Incredibuild;C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\;C:\Program Files\nodejs\;C:\Program Files\NVIDIA Corporation\NVIDIA app\NvDLISR;C:\Program Files\Git\cmd;C:\Program Files\PuTTY\;C:\Program Files\Docker\Docker\resources\bin;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64_win\compiler;C:\Users\User\AppData\Local\Microsoft\WindowsApps;C:\Users\User\.dotnet\tools;C:\Users\User\miniconda3;C:\Users\User\miniconda3\Library\mingw-w64\bin;C:\Users\User\miniconda3\Library\usr\bin;C:\Users\User\miniconda3\Library\bin;C:\Users\User\miniconda3\Scripts;C:\Users\User\AppData\Roaming\npm;C:\Program Files\7-Zip;C:\ffmpeg\bin;;C:\Users\User\AppData\Local\Programs\Ollama;C:\Users\User\.cache\lm-studio\bin]"
time=2024-12-21T09:38:32.478-05:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2024-12-21T09:38:32.478-05:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2024-12-21T09:38:32.478-05:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2024-12-21T09:38:32.550-05:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
time=2024-12-21T09:38:32.653-05:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=8
time=2024-12-21T09:38:32.653-05:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:53068"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
llama_load_model_from_file: using device CUDA1 (NVIDIA GeForce RTX 4090) - 22994 MiB free
time=2024-12-21T09:38:32.729-05:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 47 key-value pairs and 724 tensors from E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.3 70B Instruct Abliterated
llama_model_loader: - kv 3: general.finetune str = Instruct-abliterated
llama_model_loader: - kv 4: general.basename str = Llama-3.3
llama_model_loader: - kv 5: general.size_label str = 70B
llama_model_loader: - kv 6: general.license str = llama3.3
llama_model_loader: - kv 7: general.base_model.count u32 = 1
llama_model_loader: - kv 8: general.base_model.0.name str = Llama 3.3 70B Instruct
llama_model_loader: - kv 9: general.base_model.0.organization str = Meta Llama
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv 11: general.tags arr[str,7] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 12: general.languages arr[str,8] = ["en", "fr", "it", "pt", "hi", "es", ...
llama_model_loader: - kv 13: llama.block_count u32 = 80
llama_model_loader: - kv 14: llama.context_length u32 = 131072
llama_model_loader: - kv 15: llama.embedding_length u32 = 8192
llama_model_loader: - kv 16: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 17: llama.attention.head_count u32 = 64
llama_model_loader: - kv 18: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 19: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 20: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 21: llama.attention.key_length u32 = 128
llama_model_loader: - kv 22: llama.attention.value_length u32 = 128
llama_model_loader: - kv 23: general.file_type u32 = 17
llama_model_loader: - kv 24: llama.vocab_size u32 = 128256
llama_model_loader: - kv 25: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 27: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 32: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 34: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 35: general.quantization_version u32 = 2
llama_model_loader: - kv 36: general.url str = https://huggingface.co/mradermacher/L...
llama_model_loader: - kv 37: mradermacher.quantize_version str = 2
llama_model_loader: - kv 38: mradermacher.quantized_by str = mradermacher
llama_model_loader: - kv 39: mradermacher.quantized_at str = 2024-12-13T11:20:37+01:00
llama_model_loader: - kv 40: mradermacher.quantized_on str = rich1
llama_model_loader: - kv 41: general.source.url str = https://huggingface.co/huihui-ai/Llam...
llama_model_loader: - kv 42: mradermacher.convert_type str = hf
llama_model_loader: - kv 43: quantize.imatrix.file str = Llama-3.3-70B-Instruct-abliterated-i1...
llama_model_loader: - kv 44: quantize.imatrix.dataset str = imatrix-training-full-3
llama_model_loader: - kv 45: quantize.imatrix.entries_count i32 = 560
llama_model_loader: - kv 46: quantize.imatrix.chunks_count i32 = 314
llama_model_loader: - type f32: 162 tensors
llama_model_loader: - type q5_K: 481 tensors
llama_model_loader: - type q6_K: 81 tensors
llm_load_vocab: control token: 128254 '<|reserved_special_token_246|>' is not marked as EOG
llm_load_vocab: control token: 128252 '<|reserved_special_token_244|>' is not marked as EOG
llm_load_vocab: control token: 128251 '<|reserved_special_token_243|>' is not marked as EOG
llm_load_vocab: control token: 128250 '<|reserved_special_token_242|>' is not marked as EOG
llm_load_vocab: control token: 128248 '<|reserved_special_token_240|>' is not marked as EOG
llm_load_vocab: control token: 128247 '<|reserved_special_token_239|>' is not marked as EOG
llm_load_vocab: control token: 128245 '<|reserved_special_token_237|>' is not marked as EOG
llm_load_vocab: control token: 128244 '<|reserved_special_token_236|>' is not marked as EOG
llm_load_vocab: control token: 128243 '<|reserved_special_token_235|>' is not marked as EOG
llm_load_vocab: control token: 128240 '<|reserved_special_token_232|>' is not marked as EOG
llm_load_vocab: control token: 128238 '<|reserved_special_token_230|>' is not marked as EOG
llm_load_vocab: control token: 128235 '<|reserved_special_token_227|>' is not marked as EOG
llm_load_vocab: control token: 128234 '<|reserved_special_token_226|>' is not marked as EOG
llm_load_vocab: control token: 128229 '<|reserved_special_token_221|>' is not marked as EOG
llm_load_vocab: control token: 128227 '<|reserved_special_token_219|>' is not marked as EOG
llm_load_vocab: control token: 128226 '<|reserved_special_token_218|>' is not marked as EOG
llm_load_vocab: control token: 128224 '<|reserved_special_token_216|>' is not marked as EOG
llm_load_vocab: control token: 128223 '<|reserved_special_token_215|>' is not marked as EOG
llm_load_vocab: control token: 128221 '<|reserved_special_token_213|>' is not marked as EOG
llm_load_vocab: control token: 128219 '<|reserved_special_token_211|>' is not marked as EOG
llm_load_vocab: control token: 128218 '<|reserved_special_token_210|>' is not marked as EOG
llm_load_vocab: control token: 128217 '<|reserved_special_token_209|>' is not marked as EOG
llm_load_vocab: control token: 128216 '<|reserved_special_token_208|>' is not marked as EOG
llm_load_vocab: control token: 128215 '<|reserved_special_token_207|>' is not marked as EOG
llm_load_vocab: control token: 128213 '<|reserved_special_token_205|>' is not marked as EOG
llm_load_vocab: control token: 128211 '<|reserved_special_token_203|>' is not marked as EOG
llm_load_vocab: control token: 128210 '<|reserved_special_token_202|>' is not marked as EOG
llm_load_vocab: control token: 128209 '<|reserved_special_token_201|>' is not marked as EOG
llm_load_vocab: control token: 128208 '<|reserved_special_token_200|>' is not marked as EOG
llm_load_vocab: control token: 128207 '<|reserved_special_token_199|>' is not marked as EOG
llm_load_vocab: control token: 128204 '<|reserved_special_token_196|>' is not marked as EOG
llm_load_vocab: control token: 128202 '<|reserved_special_token_194|>' is not marked as EOG
llm_load_vocab: control token: 128197 '<|reserved_special_token_189|>' is not marked as EOG
llm_load_vocab: control token: 128195 '<|reserved_special_token_187|>' is not marked as EOG
llm_load_vocab: control token: 128194 '<|reserved_special_token_186|>' is not marked as EOG
llm_load_vocab: control token: 128191 '<|reserved_special_token_183|>' is not marked as EOG
llm_load_vocab: control token: 128190 '<|reserved_special_token_182|>' is not marked as EOG
llm_load_vocab: control token: 128188 '<|reserved_special_token_180|>' is not marked as EOG
llm_load_vocab: control token: 128187 '<|reserved_special_token_179|>' is not marked as EOG
llm_load_vocab: control token: 128185 '<|reserved_special_token_177|>' is not marked as EOG
llm_load_vocab: control token: 128184 '<|reserved_special_token_176|>' is not marked as EOG
llm_load_vocab: control token: 128183 '<|reserved_special_token_175|>' is not marked as EOG
llm_load_vocab: control token: 128178 '<|reserved_special_token_170|>' is not marked as EOG
llm_load_vocab: control token: 128177 '<|reserved_special_token_169|>' is not marked as EOG
llm_load_vocab: control token: 128176 '<|reserved_special_token_168|>' is not marked as EOG
llm_load_vocab: control token: 128175 '<|reserved_special_token_167|>' is not marked as EOG
llm_load_vocab: control token: 128174 '<|reserved_special_token_166|>' is not marked as EOG
llm_load_vocab: control token: 128173 '<|reserved_special_token_165|>' is not marked as EOG
llm_load_vocab: control token: 128172 '<|reserved_special_token_164|>' is not marked as EOG
llm_load_vocab: control token: 128169 '<|reserved_special_token_161|>' is not marked as EOG
llm_load_vocab: control token: 128167 '<|reserved_special_token_159|>' is not marked as EOG
llm_load_vocab: control token: 128166 '<|reserved_special_token_158|>' is not marked as EOG
llm_load_vocab: control token: 128160 '<|reserved_special_token_152|>' is not marked as EOG
llm_load_vocab: control token: 128159 '<|reserved_special_token_151|>' is not marked as EOG
llm_load_vocab: control token: 128157 '<|reserved_special_token_149|>' is not marked as EOG
llm_load_vocab: control token: 128156 '<|reserved_special_token_148|>' is not marked as EOG
llm_load_vocab: control token: 128154 '<|reserved_special_token_146|>' is not marked as EOG
llm_load_vocab: control token: 128152 '<|reserved_special_token_144|>' is not marked as EOG
llm_load_vocab: control token: 128151 '<|reserved_special_token_143|>' is not marked as EOG
llm_load_vocab: control token: 128150 '<|reserved_special_token_142|>' is not marked as EOG
llm_load_vocab: control token: 128147 '<|reserved_special_token_139|>' is not marked as EOG
llm_load_vocab: control token: 128144 '<|reserved_special_token_136|>' is not marked as EOG
llm_load_vocab: control token: 128142 '<|reserved_special_token_134|>' is not marked as EOG
llm_load_vocab: control token: 128141 '<|reserved_special_token_133|>' is not marked as EOG
llm_load_vocab: control token: 128140 '<|reserved_special_token_132|>' is not marked as EOG
llm_load_vocab: control token: 128133 '<|reserved_special_token_125|>' is not marked as EOG
llm_load_vocab: control token: 128130 '<|reserved_special_token_122|>' is not marked as EOG
llm_load_vocab: control token: 128128 '<|reserved_special_token_120|>' is not marked as EOG
llm_load_vocab: control token: 128127 '<|reserved_special_token_119|>' is not marked as EOG
llm_load_vocab: control token: 128126 '<|reserved_special_token_118|>' is not marked as EOG
llm_load_vocab: control token: 128125 '<|reserved_special_token_117|>' is not marked as EOG
llm_load_vocab: control token: 128124 '<|reserved_special_token_116|>' is not marked as EOG
llm_load_vocab: control token: 128123 '<|reserved_special_token_115|>' is not marked as EOG
llm_load_vocab: control token: 128122 '<|reserved_special_token_114|>' is not marked as EOG
llm_load_vocab: control token: 128121 '<|reserved_special_token_113|>' is not marked as EOG
llm_load_vocab: control token: 128120 '<|reserved_special_token_112|>' is not marked as EOG
llm_load_vocab: control token: 128119 '<|reserved_special_token_111|>' is not marked as EOG
llm_load_vocab: control token: 128116 '<|reserved_special_token_108|>' is not marked as EOG
llm_load_vocab: control token: 128115 '<|reserved_special_token_107|>' is not marked as EOG
llm_load_vocab: control token: 128114 '<|reserved_special_token_106|>' is not marked as EOG
llm_load_vocab: control token: 128113 '<|reserved_special_token_105|>' is not marked as EOG
llm_load_vocab: control token: 128111 '<|reserved_special_token_103|>' is not marked as EOG
llm_load_vocab: control token: 128110 '<|reserved_special_token_102|>' is not marked as EOG
llm_load_vocab: control token: 128107 '<|reserved_special_token_99|>' is not marked as EOG
llm_load_vocab: control token: 128106 '<|reserved_special_token_98|>' is not marked as EOG
llm_load_vocab: control token: 128105 '<|reserved_special_token_97|>' is not marked as EOG
llm_load_vocab: control token: 128104 '<|reserved_special_token_96|>' is not marked as EOG
llm_load_vocab: control token: 128103 '<|reserved_special_token_95|>' is not marked as EOG
llm_load_vocab: control token: 128100 '<|reserved_special_token_92|>' is not marked as EOG
llm_load_vocab: control token: 128097 '<|reserved_special_token_89|>' is not marked as EOG
llm_load_vocab: control token: 128096 '<|reserved_special_token_88|>' is not marked as EOG
llm_load_vocab: control token: 128094 '<|reserved_special_token_86|>' is not marked as EOG
llm_load_vocab: control token: 128093 '<|reserved_special_token_85|>' is not marked as EOG
llm_load_vocab: control token: 128090 '<|reserved_special_token_82|>' is not marked as EOG
llm_load_vocab: control token: 128089 '<|reserved_special_token_81|>' is not marked as EOG
llm_load_vocab: control token: 128087 '<|reserved_special_token_79|>' is not marked as EOG
llm_load_vocab: control token: 128085 '<|reserved_special_token_77|>' is not marked as EOG
llm_load_vocab: control token: 128080 '<|reserved_special_token_72|>' is not marked as EOG
llm_load_vocab: control token: 128077 '<|reserved_special_token_69|>' is not marked as EOG
llm_load_vocab: control token: 128076 '<|reserved_special_token_68|>' is not marked as EOG
llm_load_vocab: control token: 128073 '<|reserved_special_token_65|>' is not marked as EOG
llm_load_vocab: control token: 128070 '<|reserved_special_token_62|>' is not marked as EOG
llm_load_vocab: control token: 128069 '<|reserved_special_token_61|>' is not marked as EOG
llm_load_vocab: control token: 128067 '<|reserved_special_token_59|>' is not marked as EOG
llm_load_vocab: control token: 128064 '<|reserved_special_token_56|>' is not marked as EOG
llm_load_vocab: control token: 128062 '<|reserved_special_token_54|>' is not marked as EOG
llm_load_vocab: control token: 128061 '<|reserved_special_token_53|>' is not marked as EOG
llm_load_vocab: control token: 128060 '<|reserved_special_token_52|>' is not marked as EOG
llm_load_vocab: control token: 128054 '<|reserved_special_token_46|>' is not marked as EOG
llm_load_vocab: control token: 128045 '<|reserved_special_token_37|>' is not marked as EOG
llm_load_vocab: control token: 128044 '<|reserved_special_token_36|>' is not marked as EOG
llm_load_vocab: control token: 128043 '<|reserved_special_token_35|>' is not marked as EOG
llm_load_vocab: control token: 128042 '<|reserved_special_token_34|>' is not marked as EOG
llm_load_vocab: control token: 128038 '<|reserved_special_token_30|>' is not marked as EOG
llm_load_vocab: control token: 128037 '<|reserved_special_token_29|>' is not marked as EOG
llm_load_vocab: control token: 128035 '<|reserved_special_token_27|>' is not marked as EOG
llm_load_vocab: control token: 128034 '<|reserved_special_token_26|>' is not marked as EOG
llm_load_vocab: control token: 128033 '<|reserved_special_token_25|>' is not marked as EOG
llm_load_vocab: control token: 128032 '<|reserved_special_token_24|>' is not marked as EOG
llm_load_vocab: control token: 128030 '<|reserved_special_token_22|>' is not marked as EOG
llm_load_vocab: control token: 128029 '<|reserved_special_token_21|>' is not marked as EOG
llm_load_vocab: control token: 128028 '<|reserved_special_token_20|>' is not marked as EOG
llm_load_vocab: control token: 128026 '<|reserved_special_token_18|>' is not marked as EOG
llm_load_vocab: control token: 128025 '<|reserved_special_token_17|>' is not marked as EOG
llm_load_vocab: control token: 128024 '<|reserved_special_token_16|>' is not marked as EOG
llm_load_vocab: control token: 128022 '<|reserved_special_token_14|>' is not marked as EOG
llm_load_vocab: control token: 128020 '<|reserved_special_token_12|>' is not marked as EOG
llm_load_vocab: control token: 128017 '<|reserved_special_token_9|>' is not marked as EOG
llm_load_vocab: control token: 128016 '<|reserved_special_token_8|>' is not marked as EOG
llm_load_vocab: control token: 128015 '<|reserved_special_token_7|>' is not marked as EOG
llm_load_vocab: control token: 128014 '<|reserved_special_token_6|>' is not marked as EOG
llm_load_vocab: control token: 128013 '<|reserved_special_token_5|>' is not marked as EOG
llm_load_vocab: control token: 128011 '<|reserved_special_token_3|>' is not marked as EOG
llm_load_vocab: control token: 128010 '<|python_tag|>' is not marked as EOG
llm_load_vocab: control token: 128006 '<|start_header_id|>' is not marked as EOG
llm_load_vocab: control token: 128003 '<|reserved_special_token_1|>' is not marked as EOG
llm_load_vocab: control token: 128002 '<|reserved_special_token_0|>' is not marked as EOG
llm_load_vocab: control token: 128000 '<|begin_of_text|>' is not marked as EOG
llm_load_vocab: control token: 128041 '<|reserved_special_token_33|>' is not marked as EOG
llm_load_vocab: control token: 128063 '<|reserved_special_token_55|>' is not marked as EOG
llm_load_vocab: control token: 128046 '<|reserved_special_token_38|>' is not marked as EOG
llm_load_vocab: control token: 128007 '<|end_header_id|>' is not marked as EOG
llm_load_vocab: control token: 128065 '<|reserved_special_token_57|>' is not marked as EOG
llm_load_vocab: control token: 128171 '<|reserved_special_token_163|>' is not marked as EOG
llm_load_vocab: control token: 128162 '<|reserved_special_token_154|>' is not marked as EOG
llm_load_vocab: control token: 128165 '<|reserved_special_token_157|>' is not marked as EOG
llm_load_vocab: control token: 128057 '<|reserved_special_token_49|>' is not marked as EOG
llm_load_vocab: control token: 128050 '<|reserved_special_token_42|>' is not marked as EOG
llm_load_vocab: control token: 128056 '<|reserved_special_token_48|>' is not marked as EOG
llm_load_vocab: control token: 128230 '<|reserved_special_token_222|>' is not marked as EOG
llm_load_vocab: control token: 128098 '<|reserved_special_token_90|>' is not marked as EOG
llm_load_vocab: control token: 128153 '<|reserved_special_token_145|>' is not marked as EOG
llm_load_vocab: control token: 128084 '<|reserved_special_token_76|>' is not marked as EOG
llm_load_vocab: control token: 128082 '<|reserved_special_token_74|>' is not marked as EOG
llm_load_vocab: control token: 128102 '<|reserved_special_token_94|>' is not marked as EOG
llm_load_vocab: control token: 128253 '<|reserved_special_token_245|>' is not marked as EOG
llm_load_vocab: control token: 128179 '<|reserved_special_token_171|>' is not marked as EOG
llm_load_vocab: control token: 128071 '<|reserved_special_token_63|>' is not marked as EOG
llm_load_vocab: control token: 128135 '<|reserved_special_token_127|>' is not marked as EOG
llm_load_vocab: control token: 128161 '<|reserved_special_token_153|>' is not marked as EOG
llm_load_vocab: control token: 128164 '<|reserved_special_token_156|>' is not marked as EOG
llm_load_vocab: control token: 128134 '<|reserved_special_token_126|>' is not marked as EOG
llm_load_vocab: control token: 128249 '<|reserved_special_token_241|>' is not marked as EOG
llm_load_vocab: control token: 128004 '<|finetune_right_pad_id|>' is not marked as EOG
llm_load_vocab: control token: 128036 '<|reserved_special_token_28|>' is not marked as EOG
llm_load_vocab: control token: 128148 '<|reserved_special_token_140|>' is not marked as EOG
llm_load_vocab: control token: 128181 '<|reserved_special_token_173|>' is not marked as EOG
llm_load_vocab: control token: 128222 '<|reserved_special_token_214|>' is not marked as EOG
llm_load_vocab: control token: 128075 '<|reserved_special_token_67|>' is not marked as EOG
llm_load_vocab: control token: 128241 '<|reserved_special_token_233|>' is not marked as EOG
llm_load_vocab: control token: 128051 '<|reserved_special_token_43|>' is not marked as EOG
llm_load_vocab: control token: 128068 '<|reserved_special_token_60|>' is not marked as EOG
llm_load_vocab: control token: 128149 '<|reserved_special_token_141|>' is not marked as EOG
llm_load_vocab: control token: 128201 '<|reserved_special_token_193|>' is not marked as EOG
llm_load_vocab: control token: 128058 '<|reserved_special_token_50|>' is not marked as EOG
llm_load_vocab: control token: 128146 '<|reserved_special_token_138|>' is not marked as EOG
llm_load_vocab: control token: 128143 '<|reserved_special_token_135|>' is not marked as EOG
llm_load_vocab: control token: 128023 '<|reserved_special_token_15|>' is not marked as EOG
llm_load_vocab: control token: 128039 '<|reserved_special_token_31|>' is not marked as EOG
llm_load_vocab: control token: 128132 '<|reserved_special_token_124|>' is not marked as EOG
llm_load_vocab: control token: 128101 '<|reserved_special_token_93|>' is not marked as EOG
llm_load_vocab: control token: 128212 '<|reserved_special_token_204|>' is not marked as EOG
llm_load_vocab: control token: 128189 '<|reserved_special_token_181|>' is not marked as EOG
llm_load_vocab: control token: 128225 '<|reserved_special_token_217|>' is not marked as EOG
llm_load_vocab: control token: 128129 '<|reserved_special_token_121|>' is not marked as EOG
llm_load_vocab: control token: 128005 '<|reserved_special_token_2|>' is not marked as EOG
llm_load_vocab: control token: 128078 '<|reserved_special_token_70|>' is not marked as EOG
llm_load_vocab: control token: 128163 '<|reserved_special_token_155|>' is not marked as EOG
llm_load_vocab: control token: 128072 '<|reserved_special_token_64|>' is not marked as EOG
llm_load_vocab: control token: 128112 '<|reserved_special_token_104|>' is not marked as EOG
llm_load_vocab: control token: 128186 '<|reserved_special_token_178|>' is not marked as EOG
llm_load_vocab: control token: 128095 '<|reserved_special_token_87|>' is not marked as EOG
llm_load_vocab: control token: 128109 '<|reserved_special_token_101|>' is not marked as EOG
llm_load_vocab: control token: 128099 '<|reserved_special_token_91|>' is not marked as EOG
llm_load_vocab: control token: 128138 '<|reserved_special_token_130|>' is not marked as EOG
llm_load_vocab: control token: 128193 '<|reserved_special_token_185|>' is not marked as EOG
llm_load_vocab: control token: 128199 '<|reserved_special_token_191|>' is not marked as EOG
llm_load_vocab: control token: 128048 '<|reserved_special_token_40|>' is not marked as EOG
llm_load_vocab: control token: 128088 '<|reserved_special_token_80|>' is not marked as EOG
llm_load_vocab: control token: 128192 '<|reserved_special_token_184|>' is not marked as EOG
llm_load_vocab: control token: 128136 '<|reserved_special_token_128|>' is not marked as EOG
llm_load_vocab: control token: 128092 '<|reserved_special_token_84|>' is not marked as EOG
llm_load_vocab: control token: 128158 '<|reserved_special_token_150|>' is not marked as EOG
llm_load_vocab: control token: 128001 '<|end_of_text|>' is not marked as EOG
llm_load_vocab: control token: 128049 '<|reserved_special_token_41|>' is not marked as EOG
llm_load_vocab: control token: 128031 '<|reserved_special_token_23|>' is not marked as EOG
llm_load_vocab: control token: 128255 '<|reserved_special_token_247|>' is not marked as EOG
llm_load_vocab: control token: 128182 '<|reserved_special_token_174|>' is not marked as EOG
llm_load_vocab: control token: 128066 '<|reserved_special_token_58|>' is not marked as EOG
llm_load_vocab: control token: 128180 '<|reserved_special_token_172|>' is not marked as EOG
llm_load_vocab: control token: 128233 '<|reserved_special_token_225|>' is not marked as EOG
llm_load_vocab: control token: 128079 '<|reserved_special_token_71|>' is not marked as EOG
llm_load_vocab: control token: 128081 '<|reserved_special_token_73|>' is not marked as EOG
llm_load_vocab: control token: 128231 '<|reserved_special_token_223|>' is not marked as EOG
llm_load_vocab: control token: 128196 '<|reserved_special_token_188|>' is not marked as EOG
llm_load_vocab: control token: 128047 '<|reserved_special_token_39|>' is not marked as EOG
llm_load_vocab: control token: 128083 '<|reserved_special_token_75|>' is not marked as EOG
llm_load_vocab: control token: 128139 '<|reserved_special_token_131|>' is not marked as EOG
llm_load_vocab: control token: 128131 '<|reserved_special_token_123|>' is not marked as EOG
llm_load_vocab: control token: 128118 '<|reserved_special_token_110|>' is not marked as EOG
llm_load_vocab: control token: 128053 '<|reserved_special_token_45|>' is not marked as EOG
llm_load_vocab: control token: 128220 '<|reserved_special_token_212|>' is not marked as EOG
llm_load_vocab: control token: 128108 '<|reserved_special_token_100|>' is not marked as EOG
llm_load_vocab: control token: 128091 '<|reserved_special_token_83|>' is not marked as EOG
llm_load_vocab: control token: 128203 '<|reserved_special_token_195|>' is not marked as EOG
llm_load_vocab: control token: 128059 '<|reserved_special_token_51|>' is not marked as EOG
llm_load_vocab: control token: 128019 '<|reserved_special_token_11|>' is not marked as EOG
llm_load_vocab: control token: 128170 '<|reserved_special_token_162|>' is not marked as EOG
llm_load_vocab: control token: 128205 '<|reserved_special_token_197|>' is not marked as EOG
llm_load_vocab: control token: 128040 '<|reserved_special_token_32|>' is not marked as EOG
llm_load_vocab: control token: 128200 '<|reserved_special_token_192|>' is not marked as EOG
llm_load_vocab: control token: 128236 '<|reserved_special_token_228|>' is not marked as EOG
llm_load_vocab: control token: 128145 '<|reserved_special_token_137|>' is not marked as EOG
llm_load_vocab: control token: 128168 '<|reserved_special_token_160|>' is not marked as EOG
llm_load_vocab: control token: 128214 '<|reserved_special_token_206|>' is not marked as EOG
llm_load_vocab: control token: 128137 '<|reserved_special_token_129|>' is not marked as EOG
llm_load_vocab: control token: 128232 '<|reserved_special_token_224|>' is not marked as EOG
llm_load_vocab: control token: 128239 '<|reserved_special_token_231|>' is not marked as EOG
llm_load_vocab: control token: 128055 '<|reserved_special_token_47|>' is not marked as EOG
llm_load_vocab: control token: 128228 '<|reserved_special_token_220|>' is not marked as EOG
llm_load_vocab: control token: 128206 '<|reserved_special_token_198|>' is not marked as EOG
llm_load_vocab: control token: 128018 '<|reserved_special_token_10|>' is not marked as EOG
llm_load_vocab: control token: 128012 '<|reserved_special_token_4|>' is not marked as EOG
llm_load_vocab: control token: 128198 '<|reserved_special_token_190|>' is not marked as EOG
llm_load_vocab: control token: 128021 '<|reserved_special_token_13|>' is not marked as EOG
llm_load_vocab: control token: 128086 '<|reserved_special_token_78|>' is not marked as EOG
llm_load_vocab: control token: 128074 '<|reserved_special_token_66|>' is not marked as EOG
llm_load_vocab: control token: 128027 '<|reserved_special_token_19|>' is not marked as EOG
llm_load_vocab: control token: 128242 '<|reserved_special_token_234|>' is not marked as EOG
llm_load_vocab: control token: 128155 '<|reserved_special_token_147|>' is not marked as EOG
llm_load_vocab: control token: 128052 '<|reserved_special_token_44|>' is not marked as EOG
llm_load_vocab: control token: 128246 '<|reserved_special_token_238|>' is not marked as EOG
llm_load_vocab: control token: 128117 '<|reserved_special_token_109|>' is not marked as EOG
llm_load_vocab: control token: 128237 '<|reserved_special_token_229|>' is not marked as EOG
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = Q5_K - Medium
llm_load_print_meta: model params = 70.55 B
llm_load_print_meta: model size = 46.51 GiB (5.66 BPW)
llm_load_print_meta: general.name = Llama 3.3 70B Instruct Abliterated
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: tensor 'token_embd.weight' (q5_K) (and 152 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead
llm_load_tensors: offloading 65 repeating layers to GPU
llm_load_tensors: offloaded 65/81 layers to GPU
llm_load_tensors: CPU model buffer size = 688.88 MiB
llm_load_tensors: CUDA_Host model buffer size = 9576.86 MiB
llm_load_tensors: CUDA0 model buffer size = 18292.94 MiB
llm_load_tensors: CUDA1 model buffer size = 19069.69 MiB
load_all_data: no device found for buffer type CPU for async uploads
load_all_data: buffer type CUDA_Host is not the default buffer type for device CUDA0 for async uploads
time=2024-12-21T09:38:34.988-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.01"
time=2024-12-21T09:38:35.238-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.05"
time=2024-12-21T09:38:35.488-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.09"
time=2024-12-21T09:38:35.738-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.12"
time=2024-12-21T09:38:35.989-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.16"
time=2024-12-21T09:38:36.239-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.16"
time=2024-12-21T09:38:36.489-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.16"
time=2024-12-21T09:38:37.240-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.17"
time=2024-12-21T09:38:37.491-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.17"
time=2024-12-21T09:38:37.742-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.17"
time=2024-12-21T09:38:38.242-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.17"
time=2024-12-21T09:38:38.993-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.18"
time=2024-12-21T09:38:39.744-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.18"
time=2024-12-21T09:38:39.995-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.18"
time=2024-12-21T09:38:40.245-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.18"
time=2024-12-21T09:38:40.997-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.18"
time=2024-12-21T09:38:41.748-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.19"
time=2024-12-21T09:38:42.500-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.19"
time=2024-12-21T09:38:42.750-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.19"
time=2024-12-21T09:38:43.503-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.20"
time=2024-12-21T09:38:44.255-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.20"
time=2024-12-21T09:38:45.005-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.20"
time=2024-12-21T09:38:45.256-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.21"
time=2024-12-21T09:38:45.506-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.21"
time=2024-12-21T09:38:46.258-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.21"
time=2024-12-21T09:38:46.759-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.21"
load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
time=2024-12-21T09:38:47.511-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.22"
time=2024-12-21T09:38:47.761-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.22"
time=2024-12-21T09:38:48.012-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.22"
time=2024-12-21T09:38:48.513-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.22"
time=2024-12-21T09:38:49.516-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.22"
time=2024-12-21T09:38:50.016-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.23"
time=2024-12-21T09:38:50.267-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.23"
time=2024-12-21T09:38:50.517-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.23"
time=2024-12-21T09:38:51.269-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.23"
time=2024-12-21T09:38:52.021-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.24"
time=2024-12-21T09:38:52.522-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.24"
time=2024-12-21T09:38:52.773-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.24"
time=2024-12-21T09:38:53.024-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.24"
time=2024-12-21T09:38:53.775-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.25"
time=2024-12-21T09:38:54.526-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.25"
time=2024-12-21T09:38:55.027-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.25"
time=2024-12-21T09:38:55.277-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.25"
time=2024-12-21T09:38:55.528-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.25"
time=2024-12-21T09:38:55.778-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.25"
time=2024-12-21T09:38:56.279-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.26"
time=2024-12-21T09:38:57.031-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.26"
time=2024-12-21T09:38:57.782-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.26"
time=2024-12-21T09:38:58.032-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.27"
time=2024-12-21T09:38:58.283-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.27"
time=2024-12-21T09:38:58.784-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.27"
time=2024-12-21T09:38:59.537-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.27"
time=2024-12-21T09:39:00.289-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.28"
time=2024-12-21T09:39:00.539-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.28"
time=2024-12-21T09:39:00.790-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.28"
time=2024-12-21T09:39:01.541-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.28"
time=2024-12-21T09:39:02.041-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.28"
time=2024-12-21T09:39:02.793-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.29"
time=2024-12-21T09:39:03.043-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.29"
time=2024-12-21T09:39:03.294-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.29"
time=2024-12-21T09:39:04.045-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.29"
time=2024-12-21T09:39:04.796-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.30"
time=2024-12-21T09:39:05.547-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.30"
time=2024-12-21T09:39:05.797-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.30"
time=2024-12-21T09:39:06.048-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.30"
time=2024-12-21T09:39:06.799-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.31"
time=2024-12-21T09:39:07.300-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.31"
time=2024-12-21T09:39:08.052-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.31"
time=2024-12-21T09:39:08.302-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.31"
time=2024-12-21T09:39:08.553-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.31"
time=2024-12-21T09:39:09.304-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.32"
time=2024-12-21T09:39:10.056-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.32"
time=2024-12-21T09:39:10.557-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.32"
time=2024-12-21T09:39:10.807-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.32"
time=2024-12-21T09:39:11.057-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.32"
time=2024-12-21T09:39:11.308-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.33"
time=2024-12-21T09:39:11.809-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.33"
time=2024-12-21T09:39:12.810-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.33"
time=2024-12-21T09:39:13.310-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.34"
time=2024-12-21T09:39:13.561-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.34"
time=2024-12-21T09:39:13.811-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.34"
time=2024-12-21T09:39:14.562-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.34"
time=2024-12-21T09:39:15.313-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.34"
time=2024-12-21T09:39:16.064-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.35"
time=2024-12-21T09:39:16.314-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.35"
time=2024-12-21T09:39:16.565-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.35"
time=2024-12-21T09:39:17.316-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.35"
time=2024-12-21T09:39:17.817-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.36"
time=2024-12-21T09:39:18.568-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.36"
time=2024-12-21T09:39:18.818-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.36"
time=2024-12-21T09:39:19.069-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.36"
time=2024-12-21T09:39:19.820-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.36"
time=2024-12-21T09:39:20.571-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.37"
time=2024-12-21T09:39:21.323-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.37"
time=2024-12-21T09:39:21.573-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.37"
time=2024-12-21T09:39:21.824-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.37"
time=2024-12-21T09:39:22.325-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.38"
time=2024-12-21T09:39:23.077-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.38"
time=2024-12-21T09:39:23.838-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.38"
time=2024-12-21T09:39:24.089-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.38"
time=2024-12-21T09:39:24.339-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.39"
time=2024-12-21T09:39:25.090-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.39"
time=2024-12-21T09:39:25.591-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.39"
time=2024-12-21T09:39:26.343-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.40"
time=2024-12-21T09:39:26.593-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.40"
time=2024-12-21T09:39:26.844-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.40"
time=2024-12-21T09:39:27.594-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.40"
time=2024-12-21T09:39:28.345-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.40"
time=2024-12-21T09:39:29.096-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.41"
time=2024-12-21T09:39:29.347-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.41"
time=2024-12-21T09:39:29.597-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.41"
time=2024-12-21T09:39:30.349-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.41"
time=2024-12-21T09:39:30.849-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.42"
time=2024-12-21T09:39:31.601-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.42"
time=2024-12-21T09:39:31.851-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.42"
time=2024-12-21T09:39:32.102-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.42"
time=2024-12-21T09:39:32.854-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.42"
time=2024-12-21T09:39:33.605-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.43"
time=2024-12-21T09:39:34.107-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.43"
time=2024-12-21T09:39:34.358-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.43"
time=2024-12-21T09:39:34.608-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.43"
time=2024-12-21T09:39:35.360-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.44"
time=2024-12-21T09:39:36.113-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.44"
time=2024-12-21T09:39:36.864-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.44"
time=2024-12-21T09:39:37.115-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.45"
time=2024-12-21T09:39:37.365-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.45"
time=2024-12-21T09:39:38.117-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.45"
time=2024-12-21T09:39:38.618-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.45"
time=2024-12-21T09:39:39.369-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.46"
time=2024-12-21T09:39:39.619-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.46"
time=2024-12-21T09:39:39.869-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.46"
time=2024-12-21T09:39:40.621-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.46"
time=2024-12-21T09:39:41.122-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.46"
time=2024-12-21T09:39:41.873-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.47"
time=2024-12-21T09:39:42.124-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.47"
time=2024-12-21T09:39:42.374-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.47"
time=2024-12-21T09:39:43.126-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.47"
time=2024-12-21T09:39:43.877-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.48"
time=2024-12-21T09:39:44.629-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.48"
time=2024-12-21T09:39:44.879-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.48"
time=2024-12-21T09:39:45.130-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.48"
time=2024-12-21T09:39:45.631-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.49"
time=2024-12-21T09:39:46.382-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.49"
time=2024-12-21T09:39:46.884-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.49"
time=2024-12-21T09:39:47.134-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.49"
time=2024-12-21T09:39:47.385-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.49"
time=2024-12-21T09:39:47.636-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.49"
time=2024-12-21T09:39:48.137-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.50"
time=2024-12-21T09:39:48.888-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.50"
time=2024-12-21T09:39:49.639-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.50"
time=2024-12-21T09:39:49.890-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.50"
time=2024-12-21T09:39:50.140-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.51"
time=2024-12-21T09:39:50.641-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.51"
time=2024-12-21T09:39:51.643-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.51"
time=2024-12-21T09:39:52.144-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.52"
time=2024-12-21T09:39:52.645-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.52"
time=2024-12-21T09:39:53.396-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.52"
time=2024-12-21T09:39:54.147-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.52"
time=2024-12-21T09:39:54.898-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.53"
time=2024-12-21T09:39:55.148-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.53"
time=2024-12-21T09:39:55.399-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.53"
time=2024-12-21T09:39:55.899-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.53"
time=2024-12-21T09:39:56.650-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.54"
time=2024-12-21T09:39:57.151-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.54"
time=2024-12-21T09:39:57.401-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.54"
time=2024-12-21T09:39:57.652-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.54"
time=2024-12-21T09:39:58.404-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.54"
time=2024-12-21T09:39:59.155-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.55"
time=2024-12-21T09:39:59.906-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.55"
time=2024-12-21T09:40:00.157-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.55"
time=2024-12-21T09:40:00.407-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.55"
time=2024-12-21T09:40:00.908-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.56"
time=2024-12-21T09:40:01.660-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.56"
time=2024-12-21T09:40:02.410-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.56"
time=2024-12-21T09:40:02.661-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.56"
time=2024-12-21T09:40:02.911-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.57"
time=2024-12-21T09:40:03.662-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.57"
time=2024-12-21T09:40:04.414-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.57"
time=2024-12-21T09:40:04.914-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.58"
time=2024-12-21T09:40:05.165-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.58"
time=2024-12-21T09:40:05.415-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.58"
time=2024-12-21T09:40:06.166-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.58"
time=2024-12-21T09:40:06.917-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.58"
time=2024-12-21T09:40:07.669-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.59"
time=2024-12-21T09:40:07.919-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.59"
time=2024-12-21T09:40:08.169-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.59"
time=2024-12-21T09:40:08.670-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.59"
time=2024-12-21T09:40:09.421-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.60"
load_all_data: using async uploads for device CUDA1, buffer type CUDA1, backend CUDA1
time=2024-12-21T09:40:10.173-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.60"
time=2024-12-21T09:40:10.423-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.60"
time=2024-12-21T09:40:10.674-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.60"
time=2024-12-21T09:40:11.174-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.60"
time=2024-12-21T09:40:11.925-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.61"
time=2024-12-21T09:40:12.677-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.61"
time=2024-12-21T09:40:12.928-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.61"
time=2024-12-21T09:40:13.178-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.61"
time=2024-12-21T09:40:13.679-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.62"
time=2024-12-21T09:40:14.682-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.62"
time=2024-12-21T09:40:15.184-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.62"
time=2024-12-21T09:40:15.434-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.62"
time=2024-12-21T09:40:15.937-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.63"
time=2024-12-21T09:40:16.438-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.63"
time=2024-12-21T09:40:17.188-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.63"
time=2024-12-21T09:40:17.939-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.64"
time=2024-12-21T09:40:18.190-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.64"
time=2024-12-21T09:40:19.192-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.64"
time=2024-12-21T09:40:19.695-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.64"
time=2024-12-21T09:40:20.446-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.65"
time=2024-12-21T09:40:20.696-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.65"
time=2024-12-21T09:40:20.947-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.65"
time=2024-12-21T09:40:21.698-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.65"
time=2024-12-21T09:40:22.450-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.66"
time=2024-12-21T09:40:23.201-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.66"
time=2024-12-21T09:40:23.451-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.66"
time=2024-12-21T09:40:23.702-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.66"
time=2024-12-21T09:40:24.460-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.67"
time=2024-12-21T09:40:25.211-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.67"
time=2024-12-21T09:40:25.962-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.67"
time=2024-12-21T09:40:26.213-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.67"
time=2024-12-21T09:40:26.463-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.67"
time=2024-12-21T09:40:27.215-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.68"
time=2024-12-21T09:40:27.716-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.68"
time=2024-12-21T09:40:28.470-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.68"
time=2024-12-21T09:40:28.721-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.68"
time=2024-12-21T09:40:28.971-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.68"
time=2024-12-21T09:40:29.222-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.69"
time=2024-12-21T09:40:29.973-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.69"
time=2024-12-21T09:40:30.724-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.69"
time=2024-12-21T09:40:31.476-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.70"
time=2024-12-21T09:40:31.727-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.70"
time=2024-12-21T09:40:31.978-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.70"
time=2024-12-21T09:40:32.729-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.70"
time=2024-12-21T09:40:33.480-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.70"
time=2024-12-21T09:40:34.231-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.71"
time=2024-12-21T09:40:34.481-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.71"
time=2024-12-21T09:40:34.731-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.71"
time=2024-12-21T09:40:35.483-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.71"
time=2024-12-21T09:40:35.984-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.72"
time=2024-12-21T09:40:36.736-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.72"
time=2024-12-21T09:40:36.986-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.72"
time=2024-12-21T09:40:37.236-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.72"
time=2024-12-21T09:40:37.988-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.72"
time=2024-12-21T09:40:38.739-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.73"
time=2024-12-21T09:40:39.490-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.73"
time=2024-12-21T09:40:39.742-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.73"
time=2024-12-21T09:40:39.993-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.73"
time=2024-12-21T09:40:40.493-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.74"
time=2024-12-21T09:40:41.244-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.74"
time=2024-12-21T09:40:41.996-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.74"
time=2024-12-21T09:40:42.248-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.74"
time=2024-12-21T09:40:42.498-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.75"
time=2024-12-21T09:40:43.249-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.75"
time=2024-12-21T09:40:43.750-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.75"
time=2024-12-21T09:40:44.501-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.76"
time=2024-12-21T09:40:44.751-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.76"
time=2024-12-21T09:40:45.001-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.76"
time=2024-12-21T09:40:45.755-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.76"
time=2024-12-21T09:40:46.507-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.76"
time=2024-12-21T09:40:47.258-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.77"
time=2024-12-21T09:40:47.508-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.77"
time=2024-12-21T09:40:47.759-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.77"
time=2024-12-21T09:40:48.510-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.77"
time=2024-12-21T09:40:49.261-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.78"
time=2024-12-21T09:40:49.764-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.78"
time=2024-12-21T09:40:50.014-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.78"
time=2024-12-21T09:40:50.264-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.78"
time=2024-12-21T09:40:50.514-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.78"
time=2024-12-21T09:40:51.265-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.78"
time=2024-12-21T09:40:52.017-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.79"
time=2024-12-21T09:40:52.767-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.79"
time=2024-12-21T09:40:53.018-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.79"
time=2024-12-21T09:40:53.269-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.79"
time=2024-12-21T09:40:54.020-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.80"
time=2024-12-21T09:40:54.772-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.80"
time=2024-12-21T09:40:55.775-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.80"
time=2024-12-21T09:40:56.026-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.80"
time=2024-12-21T09:40:56.276-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.81"
time=2024-12-21T09:40:57.027-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.81"
time=2024-12-21T09:40:57.778-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.81"
time=2024-12-21T09:40:58.530-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.82"
time=2024-12-21T09:40:58.780-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.82"
time=2024-12-21T09:40:59.030-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.82"
time=2024-12-21T09:40:59.782-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.82"
time=2024-12-21T09:41:00.533-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.82"
time=2024-12-21T09:41:01.285-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.83"
time=2024-12-21T09:41:01.535-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.83"
time=2024-12-21T09:41:01.785-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.83"
time=2024-12-21T09:41:02.537-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.83"
time=2024-12-21T09:41:03.538-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.84"
time=2024-12-21T09:41:04.039-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.84"
time=2024-12-21T09:41:04.289-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.84"
time=2024-12-21T09:41:04.540-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.84"
time=2024-12-21T09:41:04.790-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.84"
time=2024-12-21T09:41:05.291-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.85"
time=2024-12-21T09:41:06.292-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.85"
time=2024-12-21T09:41:06.793-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.85"
time=2024-12-21T09:41:07.294-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.85"
time=2024-12-21T09:41:07.544-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.85"
time=2024-12-21T09:41:08.296-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.86"
time=2024-12-21T09:41:09.047-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.86"
time=2024-12-21T09:41:09.798-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.86"
time=2024-12-21T09:41:10.048-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.86"
time=2024-12-21T09:41:10.298-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.87"
time=2024-12-21T09:41:11.049-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.87"
time=2024-12-21T09:41:12.052-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.87"
time=2024-12-21T09:41:12.553-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.88"
time=2024-12-21T09:41:12.803-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.88"
time=2024-12-21T09:41:13.054-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.88"
time=2024-12-21T09:41:13.304-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.88"
time=2024-12-21T09:41:14.055-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.88"
time=2024-12-21T09:41:14.806-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.88"
time=2024-12-21T09:41:15.558-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.89"
time=2024-12-21T09:41:15.808-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.89"
time=2024-12-21T09:41:16.058-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.89"
time=2024-12-21T09:41:16.809-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.89"
time=2024-12-21T09:41:17.560-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.90"
time=2024-12-21T09:41:18.311-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.90"
time=2024-12-21T09:41:18.562-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.90"
time=2024-12-21T09:41:18.813-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.90"
time=2024-12-21T09:41:19.063-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.90"
time=2024-12-21T09:41:19.566-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.91"
time=2024-12-21T09:41:20.568-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.91"
time=2024-12-21T09:41:21.319-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.91"
time=2024-12-21T09:41:21.570-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.91"
time=2024-12-21T09:41:21.820-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.92"
time=2024-12-21T09:41:22.571-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.92"
time=2024-12-21T09:41:23.323-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.92"
time=2024-12-21T09:41:24.324-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.93"
time=2024-12-21T09:41:24.574-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.93"
time=2024-12-21T09:41:24.825-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.93"
time=2024-12-21T09:41:25.575-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.93"
time=2024-12-21T09:41:26.327-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.93"
time=2024-12-21T09:41:27.078-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.94"
time=2024-12-21T09:41:27.329-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.94"
time=2024-12-21T09:41:27.579-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.94"
time=2024-12-21T09:41:28.330-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.94"
time=2024-12-21T09:41:29.331-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.95"
time=2024-12-21T09:41:30.082-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.95"
time=2024-12-21T09:41:30.332-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.95"
time=2024-12-21T09:41:30.582-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.95"
time=2024-12-21T09:41:31.335-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.96"
time=2024-12-21T09:41:32.086-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.96"
time=2024-12-21T09:41:32.837-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.96"
time=2024-12-21T09:41:33.088-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.96"
time=2024-12-21T09:41:33.338-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.96"
time=2024-12-21T09:41:33.839-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.97"
time=2024-12-21T09:41:34.841-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.97"
time=2024-12-21T09:41:35.342-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.98"
time=2024-12-21T09:41:35.593-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.98"
time=2024-12-21T09:41:35.849-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.98"
time=2024-12-21T09:41:36.350-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.98"
time=2024-12-21T09:41:37.102-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.98"
time=2024-12-21T09:41:37.854-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.99"
time=2024-12-21T09:41:38.104-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.99"
time=2024-12-21T09:41:38.354-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.99"
time=2024-12-21T09:41:39.105-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.99"
time=2024-12-21T09:41:39.857-05:00 level=DEBUG source=server.go:600 msg="model load progress 1.00"
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 63.75 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 136.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 140.25 MiB
llama_new_context_with_model: KV self size = 340.00 MiB, K (q8_0): 170.00 MiB, V (q8_0): 170.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.52 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 162.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB
llama_new_context_with_model: graph nodes = 2247
llama_new_context_with_model: graph splits = 171 (with bs=512), 4 (with bs=1)
time=2024-12-21T09:41:40.608-05:00 level=INFO source=server.go:594 msg="llama runner started in 188.13 seconds"
time=2024-12-21T09:41:40.608-05:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090
[GIN] 2024/12/21 - 09:41:40 | 200 | 3m8s | 127.0.0.1 | POST "/api/generate"
time=2024-12-21T09:41:40.608-05:00 level=DEBUG source=sched.go:466 msg="context for request finished"
time=2024-12-21T09:41:40.608-05:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 duration=1h0m0s
time=2024-12-21T09:41:40.608-05:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 refCount=0
time=2024-12-21T09:42:08.353-05:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090
time=2024-12-21T09:42:08.353-05:00 level=DEBUG source=sched.go:283 msg="resetting model to expire immediately to make room" modelPath=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 refCount=0
time=2024-12-21T09:42:08.353-05:00 level=DEBUG source=sched.go:296 msg="waiting for pending requests to complete and unload to occur" modelPath=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090
time=2024-12-21T09:42:08.353-05:00 level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090
time=2024-12-21T09:42:08.353-05:00 level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090
time=2024-12-21T09:42:08.353-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.8 GiB" before.free_swap="236.1 GiB" now.total="191.7 GiB" now.free="149.1 GiB" now.free_swap="187.1 GiB"
time=2024-12-21T09:42:08.377-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="2.6 GiB" now.used="21.4 GiB"
time=2024-12-21T09:42:08.392-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="3.2 GiB" now.used="19.8 GiB"
releasing nvml library
time=2024-12-21T09:42:08.393-05:00 level=DEBUG source=server.go:1080 msg="stopping llama server"
time=2024-12-21T09:42:08.393-05:00 level=DEBUG source=server.go:1086 msg="waiting for llama server to exit"
time=2024-12-21T09:42:08.644-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="149.1 GiB" before.free_swap="187.1 GiB" now.total="191.7 GiB" now.free="149.1 GiB" now.free_swap="207.0 GiB"
time=2024-12-21T09:42:08.953-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="2.6 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="1.9 GiB"
time=2024-12-21T09:42:08.969-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="3.2 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-21T09:42:08.970-05:00 level=DEBUG source=sched.go:659 msg="gpu VRAM free memory converged after 0.62 seconds" model=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090
time=2024-12-21T09:42:09.702-05:00 level=DEBUG source=server.go:1090 msg="llama server stopped"
time=2024-12-21T09:42:09.702-05:00 level=DEBUG source=sched.go:380 msg="runner released" modelPath=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090
time=2024-12-21T09:42:09.702-05:00 level=DEBUG source=sched.go:384 msg="sending an unloaded event" modelPath=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090
time=2024-12-21T09:42:09.702-05:00 level=DEBUG source=sched.go:302 msg="unload completed" modelPath=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090
time=2024-12-21T09:42:09.702-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="149.1 GiB" before.free_swap="207.0 GiB" now.total="191.7 GiB" now.free="159.8 GiB" now.free_swap="236.6 GiB"
time=2024-12-21T09:42:09.722-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="1.9 GiB"
time=2024-12-21T09:42:09.738-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-21T09:42:09.758-05:00 level=DEBUG source=sched.go:224 msg="loading first model" model=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090
time=2024-12-21T09:42:09.758-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]"
time=2024-12-21T09:42:09.758-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.8 GiB" before.free_swap="236.6 GiB" now.total="191.7 GiB" now.free="159.8 GiB" now.free_swap="236.6 GiB"
time=2024-12-21T09:42:09.769-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="1.9 GiB"
time=2024-12-21T09:42:09.784-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-21T09:42:09.785-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.0 GiB]"
time=2024-12-21T09:42:09.785-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.8 GiB" before.free_swap="236.6 GiB" now.total="191.7 GiB" now.free="159.8 GiB" now.free_swap="236.6 GiB"
time=2024-12-21T09:42:09.800-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="1.9 GiB"
time=2024-12-21T09:42:09.816-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-21T09:42:09.819-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]"
time=2024-12-21T09:42:09.819-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.8 GiB" before.free_swap="236.6 GiB" now.total="191.7 GiB" now.free="159.8 GiB" now.free_swap="236.6 GiB"
time=2024-12-21T09:42:09.831-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="1.9 GiB"
time=2024-12-21T09:42:09.847-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-21T09:42:09.848-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.0 GiB]"
time=2024-12-21T09:42:09.848-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.8 GiB" before.free_swap="236.6 GiB" now.total="191.7 GiB" now.free="159.8 GiB" now.free_swap="236.6 GiB"
time=2024-12-21T09:42:09.862-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="1.9 GiB"
time=2024-12-21T09:42:09.877-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-21T09:42:09.878-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 22.0 GiB]"
time=2024-12-21T09:42:09.878-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.8 GiB" before.free_swap="236.6 GiB" now.total="191.7 GiB" now.free="159.8 GiB" now.free_swap="236.6 GiB"
time=2024-12-21T09:42:09.893-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="1.9 GiB"
time=2024-12-21T09:42:09.909-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-21T09:42:09.910-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 22.0 GiB]"
time=2024-12-21T09:42:09.910-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.8 GiB" before.free_swap="236.6 GiB" now.total="191.7 GiB" now.free="159.8 GiB" now.free_swap="236.6 GiB"
time=2024-12-21T09:42:09.924-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="1.9 GiB"
time=2024-12-21T09:42:09.940-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-21T09:42:09.942-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.8 GiB" before.free_swap="236.6 GiB" now.total="191.7 GiB" now.free="159.8 GiB" now.free_swap="236.6 GiB"
time=2024-12-21T09:42:09.956-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="1.9 GiB"
time=2024-12-21T09:42:09.971-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-21T09:42:09.973-05:00 level=INFO source=server.go:104 msg="system memory" total="191.7 GiB" free="159.8 GiB" free_swap="236.6 GiB"
time=2024-12-21T09:42:09.973-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.0 GiB 22.5 GiB]"
time=2024-12-21T09:42:09.973-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.8 GiB" before.free_swap="236.6 GiB" now.total="191.7 GiB" now.free="159.8 GiB" now.free_swap="236.6 GiB"
time=2024-12-21T09:42:10.063-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="1.9 GiB"
time=2024-12-21T09:42:10.111-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-21T09:42:10.112-05:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=49 layers.split=24,25 memory.available="[22.0 GiB 22.5 GiB]" memory.gpu_overhead="1.5 GiB" memory.required.full="61.9 GiB" memory.required.partial="41.3 GiB" memory.required.kv="5.0 GiB" memory.required.allocations="[20.3 GiB 21.0 GiB]" memory.weights.total="50.0 GiB" memory.weights.repeating="49.2 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="4.3 GiB" memory.graph.partial="4.3 GiB"
time=2024-12-21T09:42:10.112-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx\ollama_llama_server.exe"
time=2024-12-21T09:42:10.112-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe"
time=2024-12-21T09:42:10.112-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v11_avx\ollama_llama_server.exe"
time=2024-12-21T09:42:10.112-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe"
time=2024-12-21T09:42:10.112-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm_avx\ollama_llama_server.exe"
time=2024-12-21T09:42:10.112-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx\ollama_llama_server.exe"
time=2024-12-21T09:42:10.112-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe"
time=2024-12-21T09:42:10.112-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v11_avx\ollama_llama_server.exe"
time=2024-12-21T09:42:10.112-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe"
time=2024-12-21T09:42:10.112-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm_avx\ollama_llama_server.exe"
time=2024-12-21T09:42:10.112-05:00 level=INFO source=server.go:223 msg="enabling flash attention"
time=2024-12-21T09:42:10.115-05:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe runner --model E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 --ctx-size 32768 --batch-size 512 --n-gpu-layers 49 --verbose --threads 8 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --tensor-split 24,25 --port 53791"
time=2024-12-21T09:42:10.115-05:00 level=DEBUG source=server.go:393 msg=subprocess environment="[CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1 CUDA_PATH_V12_1=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1 CUDA_PATH_V12_3=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3 CUDA_VISIBLE_DEVICES=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23,GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 PATH=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\libnvvp;C:\Program Files\Common Files\Oracle\Java\javapath;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Program Files\dotnet\;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\Microsoft SQL Server\150\Tools\Binn\;C:\Program Files\Microsoft SQL Server\Client SDK\ODBC\170\Tools\Binn\;C:\Program Files\NVIDIA Corporation\Nsight Compute 2023.1.0\;C:\Program Files (x86)\Incredibuild;C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\;C:\Program Files\nodejs\;C:\Program Files\NVIDIA Corporation\NVIDIA app\NvDLISR;C:\Program Files\Git\cmd;C:\Program Files\PuTTY\;C:\Program Files\Docker\Docker\resources\bin;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64_win\compiler;C:\Users\User\AppData\Local\Microsoft\WindowsApps;C:\Users\User\.dotnet\tools;C:\Users\User\miniconda3;C:\Users\User\miniconda3\Library\mingw-w64\bin;C:\Users\User\miniconda3\Library\usr\bin;C:\Users\User\miniconda3\Library\bin;C:\Users\User\miniconda3\Scripts;C:\Users\User\AppData\Roaming\npm;C:\Program Files\7-Zip;C:\ffmpeg\bin;;C:\Users\User\AppData\Local\Programs\Ollama;C:\Users\User\.cache\lm-studio\bin]"
time=2024-12-21T09:42:10.117-05:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2024-12-21T09:42:10.117-05:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2024-12-21T09:42:10.119-05:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2024-12-21T09:42:10.180-05:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
time=2024-12-21T09:42:10.278-05:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=8
time=2024-12-21T09:42:10.279-05:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:53791"
time=2024-12-21T09:42:10.370-05:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
llama_load_model_from_file: using device CUDA1 (NVIDIA GeForce RTX 4090) - 22994 MiB free
llama_model_loader: loaded meta data with 47 key-value pairs and 724 tensors from E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.3 70B Instruct Abliterated
llama_model_loader: - kv 3: general.finetune str = Instruct-abliterated
llama_model_loader: - kv 4: general.basename str = Llama-3.3
llama_model_loader: - kv 5: general.size_label str = 70B
llama_model_loader: - kv 6: general.license str = llama3.3
llama_model_loader: - kv 7: general.base_model.count u32 = 1
llama_model_loader: - kv 8: general.base_model.0.name str = Llama 3.3 70B Instruct
llama_model_loader: - kv 9: general.base_model.0.organization str = Meta Llama
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv 11: general.tags arr[str,7] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 12: general.languages arr[str,8] = ["en", "fr", "it", "pt", "hi", "es", ...
llama_model_loader: - kv 13: llama.block_count u32 = 80
llama_model_loader: - kv 14: llama.context_length u32 = 131072
llama_model_loader: - kv 15: llama.embedding_length u32 = 8192
llama_model_loader: - kv 16: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 17: llama.attention.head_count u32 = 64
llama_model_loader: - kv 18: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 19: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 20: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 21: llama.attention.key_length u32 = 128
llama_model_loader: - kv 22: llama.attention.value_length u32 = 128
llama_model_loader: - kv 23: general.file_type u32 = 17
llama_model_loader: - kv 24: llama.vocab_size u32 = 128256
llama_model_loader: - kv 25: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 27: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 32: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 34: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 35: general.quantization_version u32 = 2
llama_model_loader: - kv 36: general.url str = https://huggingface.co/mradermacher/L...
llama_model_loader: - kv 37: mradermacher.quantize_version str = 2
llama_model_loader: - kv 38: mradermacher.quantized_by str = mradermacher
llama_model_loader: - kv 39: mradermacher.quantized_at str = 2024-12-13T11:20:37+01:00
llama_model_loader: - kv 40: mradermacher.quantized_on str = rich1
llama_model_loader: - kv 41: general.source.url str = https://huggingface.co/huihui-ai/Llam...
llama_model_loader: - kv 42: mradermacher.convert_type str = hf
llama_model_loader: - kv 43: quantize.imatrix.file str = Llama-3.3-70B-Instruct-abliterated-i1...
llama_model_loader: - kv 44: quantize.imatrix.dataset str = imatrix-training-full-3
llama_model_loader: - kv 45: quantize.imatrix.entries_count i32 = 560
llama_model_loader: - kv 46: quantize.imatrix.chunks_count i32 = 314
llama_model_loader: - type f32: 162 tensors
llama_model_loader: - type q5_K: 481 tensors
llama_model_loader: - type q6_K: 81 tensors
llm_load_vocab: control token: 128254 '<|reserved_special_token_246|>' is not marked as EOG
llm_load_vocab: control token: 128252 '<|reserved_special_token_244|>' is not marked as EOG
llm_load_vocab: control token: 128251 '<|reserved_special_token_243|>' is not marked as EOG
llm_load_vocab: control token: 128250 '<|reserved_special_token_242|>' is not marked as EOG
llm_load_vocab: control token: 128248 '<|reserved_special_token_240|>' is not marked as EOG
llm_load_vocab: control token: 128247 '<|reserved_special_token_239|>' is not marked as EOG
llm_load_vocab: control token: 128245 '<|reserved_special_token_237|>' is not marked as EOG
llm_load_vocab: control token: 128244 '<|reserved_special_token_236|>' is not marked as EOG
llm_load_vocab: control token: 128243 '<|reserved_special_token_235|>' is not marked as EOG
llm_load_vocab: control token: 128240 '<|reserved_special_token_232|>' is not marked as EOG
llm_load_vocab: control token: 128238 '<|reserved_special_token_230|>' is not marked as EOG
llm_load_vocab: control token: 128235 '<|reserved_special_token_227|>' is not marked as EOG
llm_load_vocab: control token: 128234 '<|reserved_special_token_226|>' is not marked as EOG
llm_load_vocab: control token: 128229 '<|reserved_special_token_221|>' is not marked as EOG
llm_load_vocab: control token: 128227 '<|reserved_special_token_219|>' is not marked as EOG
llm_load_vocab: control token: 128226 '<|reserved_special_token_218|>' is not marked as EOG
llm_load_vocab: control token: 128224 '<|reserved_special_token_216|>' is not marked as EOG
llm_load_vocab: control token: 128223 '<|reserved_special_token_215|>' is not marked as EOG
llm_load_vocab: control token: 128221 '<|reserved_special_token_213|>' is not marked as EOG
llm_load_vocab: control token: 128219 '<|reserved_special_token_211|>' is not marked as EOG
llm_load_vocab: control token: 128218 '<|reserved_special_token_210|>' is not marked as EOG
llm_load_vocab: control token: 128217 '<|reserved_special_token_209|>' is not marked as EOG
llm_load_vocab: control token: 128216 '<|reserved_special_token_208|>' is not marked as EOG
llm_load_vocab: control token: 128215 '<|reserved_special_token_207|>' is not marked as EOG
llm_load_vocab: control token: 128213 '<|reserved_special_token_205|>' is not marked as EOG
llm_load_vocab: control token: 128211 '<|reserved_special_token_203|>' is not marked as EOG
llm_load_vocab: control token: 128210 '<|reserved_special_token_202|>' is not marked as EOG
llm_load_vocab: control token: 128209 '<|reserved_special_token_201|>' is not marked as EOG
llm_load_vocab: control token: 128208 '<|reserved_special_token_200|>' is not marked as EOG
llm_load_vocab: control token: 128207 '<|reserved_special_token_199|>' is not marked as EOG
llm_load_vocab: control token: 128204 '<|reserved_special_token_196|>' is not marked as EOG
llm_load_vocab: control token: 128202 '<|reserved_special_token_194|>' is not marked as EOG
llm_load_vocab: control token: 128197 '<|reserved_special_token_189|>' is not marked as EOG
llm_load_vocab: control token: 128195 '<|reserved_special_token_187|>' is not marked as EOG
llm_load_vocab: control token: 128194 '<|reserved_special_token_186|>' is not marked as EOG
llm_load_vocab: control token: 128191 '<|reserved_special_token_183|>' is not marked as EOG
llm_load_vocab: control token: 128190 '<|reserved_special_token_182|>' is not marked as EOG
llm_load_vocab: control token: 128188 '<|reserved_special_token_180|>' is not marked as EOG
llm_load_vocab: control token: 128187 '<|reserved_special_token_179|>' is not marked as EOG
llm_load_vocab: control token: 128185 '<|reserved_special_token_177|>' is not marked as EOG
llm_load_vocab: control token: 128184 '<|reserved_special_token_176|>' is not marked as EOG
llm_load_vocab: control token: 128183 '<|reserved_special_token_175|>' is not marked as EOG
llm_load_vocab: control token: 128178 '<|reserved_special_token_170|>' is not marked as EOG
llm_load_vocab: control token: 128177 '<|reserved_special_token_169|>' is not marked as EOG
llm_load_vocab: control token: 128176 '<|reserved_special_token_168|>' is not marked as EOG
llm_load_vocab: control token: 128175 '<|reserved_special_token_167|>' is not marked as EOG
llm_load_vocab: control token: 128174 '<|reserved_special_token_166|>' is not marked as EOG
llm_load_vocab: control token: 128173 '<|reserved_special_token_165|>' is not marked as EOG
llm_load_vocab: control token: 128172 '<|reserved_special_token_164|>' is not marked as EOG
llm_load_vocab: control token: 128169 '<|reserved_special_token_161|>' is not marked as EOG
llm_load_vocab: control token: 128167 '<|reserved_special_token_159|>' is not marked as EOG
llm_load_vocab: control token: 128166 '<|reserved_special_token_158|>' is not marked as EOG
llm_load_vocab: control token: 128160 '<|reserved_special_token_152|>' is not marked as EOG
llm_load_vocab: control token: 128159 '<|reserved_special_token_151|>' is not marked as EOG
llm_load_vocab: control token: 128157 '<|reserved_special_token_149|>' is not marked as EOG
llm_load_vocab: control token: 128156 '<|reserved_special_token_148|>' is not marked as EOG
llm_load_vocab: control token: 128154 '<|reserved_special_token_146|>' is not marked as EOG
llm_load_vocab: control token: 128152 '<|reserved_special_token_144|>' is not marked as EOG
llm_load_vocab: control token: 128151 '<|reserved_special_token_143|>' is not marked as EOG
llm_load_vocab: control token: 128150 '<|reserved_special_token_142|>' is not marked as EOG
llm_load_vocab: control token: 128147 '<|reserved_special_token_139|>' is not marked as EOG
llm_load_vocab: control token: 128144 '<|reserved_special_token_136|>' is not marked as EOG
llm_load_vocab: control token: 128142 '<|reserved_special_token_134|>' is not marked as EOG
llm_load_vocab: control token: 128141 '<|reserved_special_token_133|>' is not marked as EOG
llm_load_vocab: control token: 128140 '<|reserved_special_token_132|>' is not marked as EOG
llm_load_vocab: control token: 128133 '<|reserved_special_token_125|>' is not marked as EOG
llm_load_vocab: control token: 128130 '<|reserved_special_token_122|>' is not marked as EOG
llm_load_vocab: control token: 128128 '<|reserved_special_token_120|>' is not marked as EOG
llm_load_vocab: control token: 128127 '<|reserved_special_token_119|>' is not marked as EOG
llm_load_vocab: control token: 128126 '<|reserved_special_token_118|>' is not marked as EOG
llm_load_vocab: control token: 128125 '<|reserved_special_token_117|>' is not marked as EOG
llm_load_vocab: control token: 128124 '<|reserved_special_token_116|>' is not marked as EOG
llm_load_vocab: control token: 128123 '<|reserved_special_token_115|>' is not marked as EOG
llm_load_vocab: control token: 128122 '<|reserved_special_token_114|>' is not marked as EOG
llm_load_vocab: control token: 128121 '<|reserved_special_token_113|>' is not marked as EOG
llm_load_vocab: control token: 128120 '<|reserved_special_token_112|>' is not marked as EOG
llm_load_vocab: control token: 128119 '<|reserved_special_token_111|>' is not marked as EOG
llm_load_vocab: control token: 128116 '<|reserved_special_token_108|>' is not marked as EOG
llm_load_vocab: control token: 128115 '<|reserved_special_token_107|>' is not marked as EOG
llm_load_vocab: control token: 128114 '<|reserved_special_token_106|>' is not marked as EOG
llm_load_vocab: control token: 128113 '<|reserved_special_token_105|>' is not marked as EOG
llm_load_vocab: control token: 128111 '<|reserved_special_token_103|>' is not marked as EOG
llm_load_vocab: control token: 128110 '<|reserved_special_token_102|>' is not marked as EOG
llm_load_vocab: control token: 128107 '<|reserved_special_token_99|>' is not marked as EOG
llm_load_vocab: control token: 128106 '<|reserved_special_token_98|>' is not marked as EOG
llm_load_vocab: control token: 128105 '<|reserved_special_token_97|>' is not marked as EOG
llm_load_vocab: control token: 128104 '<|reserved_special_token_96|>' is not marked as EOG
llm_load_vocab: control token: 128103 '<|reserved_special_token_95|>' is not marked as EOG
llm_load_vocab: control token: 128100 '<|reserved_special_token_92|>' is not marked as EOG
llm_load_vocab: control token: 128097 '<|reserved_special_token_89|>' is not marked as EOG
llm_load_vocab: control token: 128096 '<|reserved_special_token_88|>' is not marked as EOG
llm_load_vocab: control token: 128094 '<|reserved_special_token_86|>' is not marked as EOG
llm_load_vocab: control token: 128093 '<|reserved_special_token_85|>' is not marked as EOG
llm_load_vocab: control token: 128090 '<|reserved_special_token_82|>' is not marked as EOG
llm_load_vocab: control token: 128089 '<|reserved_special_token_81|>' is not marked as EOG
llm_load_vocab: control token: 128087 '<|reserved_special_token_79|>' is not marked as EOG
llm_load_vocab: control token: 128085 '<|reserved_special_token_77|>' is not marked as EOG
llm_load_vocab: control token: 128080 '<|reserved_special_token_72|>' is not marked as EOG
llm_load_vocab: control token: 128077 '<|reserved_special_token_69|>' is not marked as EOG
llm_load_vocab: control token: 128076 '<|reserved_special_token_68|>' is not marked as EOG
llm_load_vocab: control token: 128073 '<|reserved_special_token_65|>' is not marked as EOG
llm_load_vocab: control token: 128070 '<|reserved_special_token_62|>' is not marked as EOG
llm_load_vocab: control token: 128069 '<|reserved_special_token_61|>' is not marked as EOG
llm_load_vocab: control token: 128067 '<|reserved_special_token_59|>' is not marked as EOG
llm_load_vocab: control token: 128064 '<|reserved_special_token_56|>' is not marked as EOG
llm_load_vocab: control token: 128062 '<|reserved_special_token_54|>' is not marked as EOG
llm_load_vocab: control token: 128061 '<|reserved_special_token_53|>' is not marked as EOG
llm_load_vocab: control token: 128060 '<|reserved_special_token_52|>' is not marked as EOG
llm_load_vocab: control token: 128054 '<|reserved_special_token_46|>' is not marked as EOG
llm_load_vocab: control token: 128045 '<|reserved_special_token_37|>' is not marked as EOG
llm_load_vocab: control token: 128044 '<|reserved_special_token_36|>' is not marked as EOG
llm_load_vocab: control token: 128043 '<|reserved_special_token_35|>' is not marked as EOG
llm_load_vocab: control token: 128042 '<|reserved_special_token_34|>' is not marked as EOG
llm_load_vocab: control token: 128038 '<|reserved_special_token_30|>' is not marked as EOG
llm_load_vocab: control token: 128037 '<|reserved_special_token_29|>' is not marked as EOG
llm_load_vocab: control token: 128035 '<|reserved_special_token_27|>' is not marked as EOG
llm_load_vocab: control token: 128034 '<|reserved_special_token_26|>' is not marked as EOG
llm_load_vocab: control token: 128033 '<|reserved_special_token_25|>' is not marked as EOG
llm_load_vocab: control token: 128032 '<|reserved_special_token_24|>' is not marked as EOG
llm_load_vocab: control token: 128030 '<|reserved_special_token_22|>' is not marked as EOG
llm_load_vocab: control token: 128029 '<|reserved_special_token_21|>' is not marked as EOG
llm_load_vocab: control token: 128028 '<|reserved_special_token_20|>' is not marked as EOG
llm_load_vocab: control token: 128026 '<|reserved_special_token_18|>' is not marked as EOG
llm_load_vocab: control token: 128025 '<|reserved_special_token_17|>' is not marked as EOG
llm_load_vocab: control token: 128024 '<|reserved_special_token_16|>' is not marked as EOG
llm_load_vocab: control token: 128022 '<|reserved_special_token_14|>' is not marked as EOG
llm_load_vocab: control token: 128020 '<|reserved_special_token_12|>' is not marked as EOG
llm_load_vocab: control token: 128017 '<|reserved_special_token_9|>' is not marked as EOG
llm_load_vocab: control token: 128016 '<|reserved_special_token_8|>' is not marked as EOG
llm_load_vocab: control token: 128015 '<|reserved_special_token_7|>' is not marked as EOG
llm_load_vocab: control token: 128014 '<|reserved_special_token_6|>' is not marked as EOG
llm_load_vocab: control token: 128013 '<|reserved_special_token_5|>' is not marked as EOG
llm_load_vocab: control token: 128011 '<|reserved_special_token_3|>' is not marked as EOG
llm_load_vocab: control token: 128010 '<|python_tag|>' is not marked as EOG
llm_load_vocab: control token: 128006 '<|start_header_id|>' is not marked as EOG
llm_load_vocab: control token: 128003 '<|reserved_special_token_1|>' is not marked as EOG
llm_load_vocab: control token: 128002 '<|reserved_special_token_0|>' is not marked as EOG
llm_load_vocab: control token: 128000 '<|begin_of_text|>' is not marked as EOG
llm_load_vocab: control token: 128041 '<|reserved_special_token_33|>' is not marked as EOG
llm_load_vocab: control token: 128063 '<|reserved_special_token_55|>' is not marked as EOG
llm_load_vocab: control token: 128046 '<|reserved_special_token_38|>' is not marked as EOG
llm_load_vocab: control token: 128007 '<|end_header_id|>' is not marked as EOG
llm_load_vocab: control token: 128065 '<|reserved_special_token_57|>' is not marked as EOG
llm_load_vocab: control token: 128171 '<|reserved_special_token_163|>' is not marked as EOG
llm_load_vocab: control token: 128162 '<|reserved_special_token_154|>' is not marked as EOG
llm_load_vocab: control token: 128165 '<|reserved_special_token_157|>' is not marked as EOG
llm_load_vocab: control token: 128057 '<|reserved_special_token_49|>' is not marked as EOG
llm_load_vocab: control token: 128050 '<|reserved_special_token_42|>' is not marked as EOG
llm_load_vocab: control token: 128056 '<|reserved_special_token_48|>' is not marked as EOG
llm_load_vocab: control token: 128230 '<|reserved_special_token_222|>' is not marked as EOG
llm_load_vocab: control token: 128098 '<|reserved_special_token_90|>' is not marked as EOG
llm_load_vocab: control token: 128153 '<|reserved_special_token_145|>' is not marked as EOG
llm_load_vocab: control token: 128084 '<|reserved_special_token_76|>' is not marked as EOG
llm_load_vocab: control token: 128082 '<|reserved_special_token_74|>' is not marked as EOG
llm_load_vocab: control token: 128102 '<|reserved_special_token_94|>' is not marked as EOG
llm_load_vocab: control token: 128253 '<|reserved_special_token_245|>' is not marked as EOG
llm_load_vocab: control token: 128179 '<|reserved_special_token_171|>' is not marked as EOG
llm_load_vocab: control token: 128071 '<|reserved_special_token_63|>' is not marked as EOG
llm_load_vocab: control token: 128135 '<|reserved_special_token_127|>' is not marked as EOG
llm_load_vocab: control token: 128161 '<|reserved_special_token_153|>' is not marked as EOG
llm_load_vocab: control token: 128164 '<|reserved_special_token_156|>' is not marked as EOG
llm_load_vocab: control token: 128134 '<|reserved_special_token_126|>' is not marked as EOG
llm_load_vocab: control token: 128249 '<|reserved_special_token_241|>' is not marked as EOG
llm_load_vocab: control token: 128004 '<|finetune_right_pad_id|>' is not marked as EOG
llm_load_vocab: control token: 128036 '<|reserved_special_token_28|>' is not marked as EOG
llm_load_vocab: control token: 128148 '<|reserved_special_token_140|>' is not marked as EOG
llm_load_vocab: control token: 128181 '<|reserved_special_token_173|>' is not marked as EOG
llm_load_vocab: control token: 128222 '<|reserved_special_token_214|>' is not marked as EOG
llm_load_vocab: control token: 128075 '<|reserved_special_token_67|>' is not marked as EOG
llm_load_vocab: control token: 128241 '<|reserved_special_token_233|>' is not marked as EOG
llm_load_vocab: control token: 128051 '<|reserved_special_token_43|>' is not marked as EOG
llm_load_vocab: control token: 128068 '<|reserved_special_token_60|>' is not marked as EOG
llm_load_vocab: control token: 128149 '<|reserved_special_token_141|>' is not marked as EOG
llm_load_vocab: control token: 128201 '<|reserved_special_token_193|>' is not marked as EOG
llm_load_vocab: control token: 128058 '<|reserved_special_token_50|>' is not marked as EOG
llm_load_vocab: control token: 128146 '<|reserved_special_token_138|>' is not marked as EOG
llm_load_vocab: control token: 128143 '<|reserved_special_token_135|>' is not marked as EOG
llm_load_vocab: control token: 128023 '<|reserved_special_token_15|>' is not marked as EOG
llm_load_vocab: control token: 128039 '<|reserved_special_token_31|>' is not marked as EOG
llm_load_vocab: control token: 128132 '<|reserved_special_token_124|>' is not marked as EOG
llm_load_vocab: control token: 128101 '<|reserved_special_token_93|>' is not marked as EOG
llm_load_vocab: control token: 128212 '<|reserved_special_token_204|>' is not marked as EOG
llm_load_vocab: control token: 128189 '<|reserved_special_token_181|>' is not marked as EOG
llm_load_vocab: control token: 128225 '<|reserved_special_token_217|>' is not marked as EOG
llm_load_vocab: control token: 128129 '<|reserved_special_token_121|>' is not marked as EOG
llm_load_vocab: control token: 128005 '<|reserved_special_token_2|>' is not marked as EOG
llm_load_vocab: control token: 128078 '<|reserved_special_token_70|>' is not marked as EOG
llm_load_vocab: control token: 128163 '<|reserved_special_token_155|>' is not marked as EOG
llm_load_vocab: control token: 128072 '<|reserved_special_token_64|>' is not marked as EOG
llm_load_vocab: control token: 128112 '<|reserved_special_token_104|>' is not marked as EOG
llm_load_vocab: control token: 128186 '<|reserved_special_token_178|>' is not marked as EOG
llm_load_vocab: control token: 128095 '<|reserved_special_token_87|>' is not marked as EOG
llm_load_vocab: control token: 128109 '<|reserved_special_token_101|>' is not marked as EOG
llm_load_vocab: control token: 128099 '<|reserved_special_token_91|>' is not marked as EOG
llm_load_vocab: control token: 128138 '<|reserved_special_token_130|>' is not marked as EOG
llm_load_vocab: control token: 128193 '<|reserved_special_token_185|>' is not marked as EOG
llm_load_vocab: control token: 128199 '<|reserved_special_token_191|>' is not marked as EOG
llm_load_vocab: control token: 128048 '<|reserved_special_token_40|>' is not marked as EOG
llm_load_vocab: control token: 128088 '<|reserved_special_token_80|>' is not marked as EOG
llm_load_vocab: control token: 128192 '<|reserved_special_token_184|>' is not marked as EOG
llm_load_vocab: control token: 128136 '<|reserved_special_token_128|>' is not marked as EOG
llm_load_vocab: control token: 128092 '<|reserved_special_token_84|>' is not marked as EOG
llm_load_vocab: control token: 128158 '<|reserved_special_token_150|>' is not marked as EOG
llm_load_vocab: control token: 128001 '<|end_of_text|>' is not marked as EOG
llm_load_vocab: control token: 128049 '<|reserved_special_token_41|>' is not marked as EOG
llm_load_vocab: control token: 128031 '<|reserved_special_token_23|>' is not marked as EOG
llm_load_vocab: control token: 128255 '<|reserved_special_token_247|>' is not marked as EOG
llm_load_vocab: control token: 128182 '<|reserved_special_token_174|>' is not marked as EOG
llm_load_vocab: control token: 128066 '<|reserved_special_token_58|>' is not marked as EOG
llm_load_vocab: control token: 128180 '<|reserved_special_token_172|>' is not marked as EOG
llm_load_vocab: control token: 128233 '<|reserved_special_token_225|>' is not marked as EOG
llm_load_vocab: control token: 128079 '<|reserved_special_token_71|>' is not marked as EOG
llm_load_vocab: control token: 128081 '<|reserved_special_token_73|>' is not marked as EOG
llm_load_vocab: control token: 128231 '<|reserved_special_token_223|>' is not marked as EOG
llm_load_vocab: control token: 128196 '<|reserved_special_token_188|>' is not marked as EOG
llm_load_vocab: control token: 128047 '<|reserved_special_token_39|>' is not marked as EOG
llm_load_vocab: control token: 128083 '<|reserved_special_token_75|>' is not marked as EOG
llm_load_vocab: control token: 128139 '<|reserved_special_token_131|>' is not marked as EOG
llm_load_vocab: control token: 128131 '<|reserved_special_token_123|>' is not marked as EOG
llm_load_vocab: control token: 128118 '<|reserved_special_token_110|>' is not marked as EOG
llm_load_vocab: control token: 128053 '<|reserved_special_token_45|>' is not marked as EOG
llm_load_vocab: control token: 128220 '<|reserved_special_token_212|>' is not marked as EOG
llm_load_vocab: control token: 128108 '<|reserved_special_token_100|>' is not marked as EOG
llm_load_vocab: control token: 128091 '<|reserved_special_token_83|>' is not marked as EOG
llm_load_vocab: control token: 128203 '<|reserved_special_token_195|>' is not marked as EOG
llm_load_vocab: control token: 128059 '<|reserved_special_token_51|>' is not marked as EOG
llm_load_vocab: control token: 128019 '<|reserved_special_token_11|>' is not marked as EOG
llm_load_vocab: control token: 128170 '<|reserved_special_token_162|>' is not marked as EOG
llm_load_vocab: control token: 128205 '<|reserved_special_token_197|>' is not marked as EOG
llm_load_vocab: control token: 128040 '<|reserved_special_token_32|>' is not marked as EOG
llm_load_vocab: control token: 128200 '<|reserved_special_token_192|>' is not marked as EOG
llm_load_vocab: control token: 128236 '<|reserved_special_token_228|>' is not marked as EOG
llm_load_vocab: control token: 128145 '<|reserved_special_token_137|>' is not marked as EOG
llm_load_vocab: control token: 128168 '<|reserved_special_token_160|>' is not marked as EOG
llm_load_vocab: control token: 128214 '<|reserved_special_token_206|>' is not marked as EOG
llm_load_vocab: control token: 128137 '<|reserved_special_token_129|>' is not marked as EOG
llm_load_vocab: control token: 128232 '<|reserved_special_token_224|>' is not marked as EOG
llm_load_vocab: control token: 128239 '<|reserved_special_token_231|>' is not marked as EOG
llm_load_vocab: control token: 128055 '<|reserved_special_token_47|>' is not marked as EOG
llm_load_vocab: control token: 128228 '<|reserved_special_token_220|>' is not marked as EOG
llm_load_vocab: control token: 128206 '<|reserved_special_token_198|>' is not marked as EOG
llm_load_vocab: control token: 128018 '<|reserved_special_token_10|>' is not marked as EOG
llm_load_vocab: control token: 128012 '<|reserved_special_token_4|>' is not marked as EOG
llm_load_vocab: control token: 128198 '<|reserved_special_token_190|>' is not marked as EOG
llm_load_vocab: control token: 128021 '<|reserved_special_token_13|>' is not marked as EOG
llm_load_vocab: control token: 128086 '<|reserved_special_token_78|>' is not marked as EOG
llm_load_vocab: control token: 128074 '<|reserved_special_token_66|>' is not marked as EOG
llm_load_vocab: control token: 128027 '<|reserved_special_token_19|>' is not marked as EOG
llm_load_vocab: control token: 128242 '<|reserved_special_token_234|>' is not marked as EOG
llm_load_vocab: control token: 128155 '<|reserved_special_token_147|>' is not marked as EOG
llm_load_vocab: control token: 128052 '<|reserved_special_token_44|>' is not marked as EOG
llm_load_vocab: control token: 128246 '<|reserved_special_token_238|>' is not marked as EOG
llm_load_vocab: control token: 128117 '<|reserved_special_token_109|>' is not marked as EOG
llm_load_vocab: control token: 128237 '<|reserved_special_token_229|>' is not marked as EOG
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = Q5_K - Medium
llm_load_print_meta: model params = 70.55 B
llm_load_print_meta: model size = 46.51 GiB (5.66 BPW)
llm_load_print_meta: general.name = Llama 3.3 70B Instruct Abliterated
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: tensor 'token_embd.weight' (q5_K) (and 312 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead
llm_load_tensors: offloading 49 repeating layers to GPU
llm_load_tensors: offloaded 49/81 layers to GPU
llm_load_tensors: CPU model buffer size = 688.88 MiB
llm_load_tensors: CUDA_Host model buffer size = 18738.73 MiB
llm_load_tensors: CUDA0 model buffer size = 13712.00 MiB
llm_load_tensors: CUDA1 model buffer size = 14488.75 MiB
load_all_data: no device found for buffer type CPU for async uploads
load_all_data: buffer type CUDA_Host is not the default buffer type for device CUDA0 for async uploads
time=2024-12-21T09:42:14.878-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.04"
time=2024-12-21T09:42:15.128-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.08"
time=2024-12-21T09:42:15.378-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.11"
time=2024-12-21T09:42:15.629-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.15"
time=2024-12-21T09:42:15.879-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.18"
time=2024-12-21T09:42:16.129-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.22"
time=2024-12-21T09:42:16.380-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.25"
time=2024-12-21T09:42:16.630-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.29"
time=2024-12-21T09:42:16.880-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.32"
time=2024-12-21T09:42:17.131-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.36"
time=2024-12-21T09:42:17.381-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.39"
load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
time=2024-12-21T09:42:17.632-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.42"
time=2024-12-21T09:42:17.882-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.45"
time=2024-12-21T09:42:18.133-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.48"
time=2024-12-21T09:42:18.384-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.51"
time=2024-12-21T09:42:18.634-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.54"
time=2024-12-21T09:42:18.884-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.57"
time=2024-12-21T09:42:19.135-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.60"
time=2024-12-21T09:42:19.385-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.63"
time=2024-12-21T09:42:19.636-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.65"
time=2024-12-21T09:42:19.886-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.68"
load_all_data: using async uploads for device CUDA1, buffer type CUDA1, backend CUDA1
time=2024-12-21T09:42:20.136-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.70"
time=2024-12-21T09:42:20.387-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.72"
time=2024-12-21T09:42:20.638-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.75"
time=2024-12-21T09:42:20.888-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.78"
time=2024-12-21T09:42:21.139-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.80"
time=2024-12-21T09:42:21.389-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.83"
time=2024-12-21T09:42:21.639-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.85"
time=2024-12-21T09:42:21.890-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.88"
time=2024-12-21T09:42:22.140-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.90"
time=2024-12-21T09:42:22.391-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.93"
time=2024-12-21T09:42:22.641-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.95"
time=2024-12-21T09:42:22.891-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.98"
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 32768
llama_new_context_with_model: n_ctx_per_seq = 32768
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (32768) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
time=2024-12-21T09:42:23.142-05:00 level=DEBUG source=server.go:600 msg="model load progress 1.00"
time=2024-12-21T09:42:23.392-05:00 level=DEBUG source=server.go:603 msg="model load completed, waiting for server to become available" status="llm server loading model"
llama_kv_cache_init: CPU KV buffer size = 2108.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 1632.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 1700.00 MiB
llama_new_context_with_model: KV self size = 5440.00 MiB, K (q8_0): 2720.00 MiB, V (q8_0): 2720.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.52 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 176.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 80.01 MiB
llama_new_context_with_model: graph nodes = 2247
llama_new_context_with_model: graph splits = 347 (with bs=512), 4 (with bs=1)
time=2024-12-21T09:42:23.642-05:00 level=INFO source=server.go:594 msg="llama runner started in 13.53 seconds"
time=2024-12-21T09:42:23.642-05:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090
time=2024-12-21T09:42:23.642-05:00 level=DEBUG source=routes.go:1542 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\nInstruction: Describe the following scene in detail. Expand each sentence into conversation as applicable. Here I means Rob. Tell the story in Rob's perspective.\nBackground:\nScene: In the kitchen, Lizzy, Melody, and Emily were cooking quesadillas and burritos for 12+ people. Florence and her elder sister Francesca came and helped cook chimichangas and (add another popular mexican food). Mom was there chatting with Aunt Maureen and Aunt Julie. Mayor Layla Williams and her wife Kristen came to our house bringing some homemade salsa and chips. Mom thanked them. They chatted about how Layla and her wife would like to conceive, but they still hadn't found a suitable donor yet. Mom said that the time will come. Aunt Julie asked them about considering adoption, but Layla said that would be the last resort. Maureen asked if they had fertility checks yet and Kristen said they did and the results were great. Add several conversation about how Mom, Aunt Maureen and Aunt Julie got laid off last week and how they tried to find new jobs. They managed to get some clients in several hotels, but nothing much. Layla would help find some clients for them and text them soon so that they could earn some commission. Kristen commented how the food were so good. Melody and Francesca offered them some burritos and quesadillas to pack home. Layla and Kristen gladly accepted them. They tried a bite and they praised how delicious the food are. These were the hard work of the five girls. Layla and her wife Kristen said goodbye. With that, the food were ready. The five girls and I helped serve the dishes. No dinner yet.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
time=2024-12-21T09:42:23.643-05:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=353 used=0 remaining=353
time=2024-12-21T09:42:51.396-05:00 level=DEBUG source=sched.go:466 msg="context for request finished"
time=2024-12-21T09:42:51.396-05:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 duration=1h0m0s
time=2024-12-21T09:42:51.396-05:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 refCount=0
[GIN] 2024/12/21 - 09:42:51 | 200 | 43.0650887s | 127.0.0.1 | POST "/api/chat"

<!-- gh-comment-id:2558142024 --> @robbyjo commented on GitHub (Dec 21, 2024): I **did set the num_ctx to 32K before the query**. I used Open WebUI for it, Now I used the command line: >>> /set parameter num_ctx 32768 Set parameter 'num_ctx' to '32768' Here is the answer to my query: > ustersIEL&"=Cx552' BaseServiceDssp+;.MixedReality letter solic#5/Peak!5_IDS ​​$94$ Bast ink! createStateaislek-nullptr FayermanDumbcare- lax"! cliff Letter It appears that Ollama first run with default num_ctx (2048) and then set it to 32768 after my request of /set parameter num_ctx command. And as you see above in my previous log, there is this `--ctx-size 32768`. I tried this command line switch to ollama but does not seem to work. So something must be amiss. Please do not dismiss this bug for a rookie mistake. Here is the server.log in debug mode: > 2024/12/21 09:38:14 routes.go:1259: INFO server config env="map[CUDA_VISIBLE_DEVICES:0,1 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:1572864000 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:E:\\DeepLearning\\LLM OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2024-12-21T09:38:14.637-05:00 level=INFO source=images.go:757 msg="total blobs: 74" time=2024-12-21T09:38:14.639-05:00 level=INFO source=images.go:764 msg="total unused blobs removed: 0" time=2024-12-21T09:38:14.640-05:00 level=INFO source=routes.go:1310 msg="Listening on 127.0.0.1:11434 (version 0.5.4)" time=2024-12-21T09:38:14.640-05:00 level=DEBUG source=common.go:80 msg="runners located" dir="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners" time=2024-12-21T09:38:14.640-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx\\ollama_llama_server.exe" time=2024-12-21T09:38:14.640-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe" time=2024-12-21T09:38:14.640-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v11_avx\\ollama_llama_server.exe" time=2024-12-21T09:38:14.640-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx\\ollama_llama_server.exe" time=2024-12-21T09:38:14.640-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\rocm_avx\\ollama_llama_server.exe" time=2024-12-21T09:38:14.640-05:00 level=INFO source=routes.go:1339 msg="Dynamic LLM libraries" runners="[cpu_avx2 cuda_v11_avx cuda_v12_avx rocm_avx cpu cpu_avx]" time=2024-12-21T09:38:14.640-05:00 level=DEBUG source=routes.go:1340 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY" time=2024-12-21T09:38:14.640-05:00 level=DEBUG source=sched.go:105 msg="starting llm scheduler" time=2024-12-21T09:38:14.640-05:00 level=INFO source=gpu.go:226 msg="looking for compatible GPUs" time=2024-12-21T09:38:14.640-05:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2024-12-21T09:38:14.640-05:00 level=INFO source=gpu_windows.go:183 msg="efficiency cores detected" maxEfficiencyClass=1 time=2024-12-21T09:38:14.640-05:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=24 efficiency=16 threads=32 time=2024-12-21T09:38:14.640-05:00 level=DEBUG source=gpu.go:99 msg="searching for GPU discovery libraries for NVIDIA" time=2024-12-21T09:38:14.640-05:00 level=DEBUG source=gpu.go:517 msg="Searching for GPU library" name=nvml.dll time=2024-12-21T09:38:14.640-05:00 level=DEBUG source=gpu.go:543 msg="gpu library search" globs="[C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvml.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\bin\\nvml.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\libnvvp\\nvml.dll C:\\Program Files\\Common Files\\Oracle\\Java\\javapath\\nvml.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\bin\\nvml.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\libnvvp\\nvml.dll C:\\Windows\\system32\\nvml.dll C:\\Windows\\nvml.dll C:\\Windows\\System32\\Wbem\\nvml.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvml.dll C:\\Windows\\System32\\OpenSSH\\nvml.dll C:\\Program Files\\dotnet\\nvml.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\nvml.dll C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\nvml.dll C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2023.1.0\\nvml.dll C:\\Program Files (x86)\\Incredibuild\\nvml.dll C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\nvml.dll C:\\Program Files\\nodejs\\nvml.dll C:\\Program Files\\NVIDIA Corporation\\NVIDIA app\\NvDLISR\\nvml.dll C:\\Program Files\\Git\\cmd\\nvml.dll C:\\Program Files\\PuTTY\\nvml.dll C:\\Program Files\\Docker\\Docker\\resources\\bin\\nvml.dll C:\\Program Files (x86)\\Common Files\\Intel\\Shared Libraries\\redist\\intel64_win\\compiler\\nvml.dll C:\\Users\\User\\AppData\\Local\\Microsoft\\WindowsApps\\nvml.dll C:\\Users\\User\\.dotnet\\tools\\nvml.dll C:\\Users\\User\\miniconda3\\nvml.dll C:\\Users\\User\\miniconda3\\Library\\mingw-w64\\bin\\nvml.dll C:\\Users\\User\\miniconda3\\Library\\usr\\bin\\nvml.dll C:\\Users\\User\\miniconda3\\Library\\bin\\nvml.dll C:\\Users\\User\\miniconda3\\Scripts\\nvml.dll C:\\Users\\User\\AppData\\Roaming\\npm\\nvml.dll C:\\Program Files\\7-Zip\\nvml.dll C:\\ffmpeg\\bin\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\nvml.dll C:\\Users\\User\\.cache\\lm-studio\\bin\\nvml.dll c:\\Windows\\System32\\nvml.dll]" time=2024-12-21T09:38:14.640-05:00 level=DEBUG source=gpu.go:548 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll" time=2024-12-21T09:38:14.641-05:00 level=DEBUG source=gpu.go:577 msg="discovered GPU libraries" paths="[C:\\Windows\\system32\\nvml.dll c:\\Windows\\System32\\nvml.dll]" time=2024-12-21T09:38:14.653-05:00 level=DEBUG source=gpu.go:120 msg="nvidia-ml loaded" library=C:\Windows\system32\nvml.dll time=2024-12-21T09:38:14.654-05:00 level=DEBUG source=gpu.go:517 msg="Searching for GPU library" name=nvcuda.dll time=2024-12-21T09:38:14.654-05:00 level=DEBUG source=gpu.go:543 msg="gpu library search" globs="[C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvcuda.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\bin\\nvcuda.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\libnvvp\\nvcuda.dll C:\\Program Files\\Common Files\\Oracle\\Java\\javapath\\nvcuda.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\bin\\nvcuda.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\libnvvp\\nvcuda.dll C:\\Windows\\system32\\nvcuda.dll C:\\Windows\\nvcuda.dll C:\\Windows\\System32\\Wbem\\nvcuda.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvcuda.dll C:\\Windows\\System32\\OpenSSH\\nvcuda.dll C:\\Program Files\\dotnet\\nvcuda.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\nvcuda.dll C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\nvcuda.dll C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2023.1.0\\nvcuda.dll C:\\Program Files (x86)\\Incredibuild\\nvcuda.dll C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\nvcuda.dll C:\\Program Files\\nodejs\\nvcuda.dll C:\\Program Files\\NVIDIA Corporation\\NVIDIA app\\NvDLISR\\nvcuda.dll C:\\Program Files\\Git\\cmd\\nvcuda.dll C:\\Program Files\\PuTTY\\nvcuda.dll C:\\Program Files\\Docker\\Docker\\resources\\bin\\nvcuda.dll C:\\Program Files (x86)\\Common Files\\Intel\\Shared Libraries\\redist\\intel64_win\\compiler\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Microsoft\\WindowsApps\\nvcuda.dll C:\\Users\\User\\.dotnet\\tools\\nvcuda.dll C:\\Users\\User\\miniconda3\\nvcuda.dll C:\\Users\\User\\miniconda3\\Library\\mingw-w64\\bin\\nvcuda.dll C:\\Users\\User\\miniconda3\\Library\\usr\\bin\\nvcuda.dll C:\\Users\\User\\miniconda3\\Library\\bin\\nvcuda.dll C:\\Users\\User\\miniconda3\\Scripts\\nvcuda.dll C:\\Users\\User\\AppData\\Roaming\\npm\\nvcuda.dll C:\\Program Files\\7-Zip\\nvcuda.dll C:\\ffmpeg\\bin\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\nvcuda.dll C:\\Users\\User\\.cache\\lm-studio\\bin\\nvcuda.dll c:\\windows\\system*\\nvcuda.dll]" time=2024-12-21T09:38:14.654-05:00 level=DEBUG source=gpu.go:548 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll" time=2024-12-21T09:38:14.654-05:00 level=DEBUG source=gpu.go:577 msg="discovered GPU libraries" paths=[C:\Windows\system32\nvcuda.dll] initializing C:\Windows\system32\nvcuda.dll dlsym: cuInit - 00007FFCFE5A4D20 dlsym: cuDriverGetVersion - 00007FFCFE5A4DC0 dlsym: cuDeviceGetCount - 00007FFCFE5A55B6 dlsym: cuDeviceGet - 00007FFCFE5A55B0 dlsym: cuDeviceGetAttribute - 00007FFCFE5A4F10 dlsym: cuDeviceGetUuid - 00007FFCFE5A55C2 dlsym: cuDeviceGetName - 00007FFCFE5A55BC dlsym: cuCtxCreate_v3 - 00007FFCFE5A5634 dlsym: cuMemGetInfo_v2 - 00007FFCFE5A5736 dlsym: cuCtxDestroy - 00007FFCFE5A5646 calling cuInit calling cuDriverGetVersion raw version 0x2f26 CUDA driver version: 12.7 calling cuDeviceGetCount device count 2 time=2024-12-21T09:38:14.670-05:00 level=DEBUG source=gpu.go:134 msg="detected GPUs" count=2 library=C:\Windows\system32\nvcuda.dll [GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23] CUDA totalMem 24563 mb [GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23] CUDA freeMem 22994 mb [GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23] Compute Capability 8.9 [GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8] CUDA totalMem 24563 mb [GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8] CUDA freeMem 22994 mb [GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8] Compute Capability 8.9 time=2024-12-21T09:38:14.879-05:00 level=INFO source=gpu.go:334 msg="detected OS VRAM overhead" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" time=2024-12-21T09:38:14.880-05:00 level=DEBUG source=amd_windows.go:35 msg="unable to load amdhip64_6.dll, please make sure to upgrade to the latest amd driver: The specified module could not be found." releasing cuda driver library releasing nvml library time=2024-12-21T09:38:14.881-05:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB" time=2024-12-21T09:38:14.881-05:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB" [GIN] 2024/12/21 - 09:38:32 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/12/21 - 09:38:32 | 200 | 13.3936ms | 127.0.0.1 | POST "/api/show" time=2024-12-21T09:38:32.156-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.9 GiB" before.free_swap="236.1 GiB" now.total="191.7 GiB" now.free="159.9 GiB" now.free_swap="236.1 GiB" time=2024-12-21T09:38:32.173-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="2.0 GiB" time=2024-12-21T09:38:32.188-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-21T09:38:32.189-05:00 level=DEBUG source=sched.go:181 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=0x7ff673204620 gpu_count=2 time=2024-12-21T09:38:32.216-05:00 level=DEBUG source=sched.go:224 msg="loading first model" model=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 time=2024-12-21T09:38:32.216-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]" time=2024-12-21T09:38:32.216-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.9 GiB" before.free_swap="236.1 GiB" now.total="191.7 GiB" now.free="159.9 GiB" now.free_swap="236.1 GiB" time=2024-12-21T09:38:32.234-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="2.0 GiB" time=2024-12-21T09:38:32.249-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-21T09:38:32.251-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.0 GiB]" time=2024-12-21T09:38:32.251-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.9 GiB" before.free_swap="236.1 GiB" now.total="191.7 GiB" now.free="159.8 GiB" now.free_swap="236.1 GiB" time=2024-12-21T09:38:32.265-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="2.0 GiB" time=2024-12-21T09:38:32.280-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-21T09:38:32.281-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]" time=2024-12-21T09:38:32.281-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.8 GiB" before.free_swap="236.1 GiB" now.total="191.7 GiB" now.free="159.8 GiB" now.free_swap="236.1 GiB" time=2024-12-21T09:38:32.295-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="2.0 GiB" time=2024-12-21T09:38:32.311-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-21T09:38:32.312-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.0 GiB]" time=2024-12-21T09:38:32.313-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.8 GiB" before.free_swap="236.1 GiB" now.total="191.7 GiB" now.free="159.8 GiB" now.free_swap="236.1 GiB" time=2024-12-21T09:38:32.326-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="2.0 GiB" time=2024-12-21T09:38:32.342-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-21T09:38:32.343-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 22.0 GiB]" time=2024-12-21T09:38:32.343-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.8 GiB" before.free_swap="236.1 GiB" now.total="191.7 GiB" now.free="159.8 GiB" now.free_swap="236.1 GiB" time=2024-12-21T09:38:32.357-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="2.0 GiB" time=2024-12-21T09:38:32.372-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-21T09:38:32.374-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 22.0 GiB]" time=2024-12-21T09:38:32.374-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.8 GiB" before.free_swap="236.1 GiB" now.total="191.7 GiB" now.free="159.8 GiB" now.free_swap="236.1 GiB" time=2024-12-21T09:38:32.387-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="2.0 GiB" time=2024-12-21T09:38:32.403-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-21T09:38:32.404-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.8 GiB" before.free_swap="236.1 GiB" now.total="191.7 GiB" now.free="159.8 GiB" now.free_swap="236.1 GiB" time=2024-12-21T09:38:32.418-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="2.0 GiB" time=2024-12-21T09:38:32.434-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-21T09:38:32.434-05:00 level=INFO source=server.go:104 msg="system memory" total="191.7 GiB" free="159.8 GiB" free_swap="236.1 GiB" time=2024-12-21T09:38:32.434-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.0 GiB 22.5 GiB]" time=2024-12-21T09:38:32.434-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.8 GiB" before.free_swap="236.1 GiB" now.total="191.7 GiB" now.free="159.8 GiB" now.free_swap="236.1 GiB" time=2024-12-21T09:38:32.449-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="2.0 GiB" time=2024-12-21T09:38:32.465-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-21T09:38:32.466-05:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=65 layers.split=32,33 memory.available="[22.0 GiB 22.5 GiB]" memory.gpu_overhead="1.5 GiB" memory.required.full="50.5 GiB" memory.required.partial="40.9 GiB" memory.required.kv="320.0 MiB" memory.required.allocations="[20.2 GiB 20.7 GiB]" memory.weights.total="45.3 GiB" memory.weights.repeating="44.5 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB" time=2024-12-21T09:38:32.467-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx\\ollama_llama_server.exe" time=2024-12-21T09:38:32.467-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe" time=2024-12-21T09:38:32.467-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v11_avx\\ollama_llama_server.exe" time=2024-12-21T09:38:32.467-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx\\ollama_llama_server.exe" time=2024-12-21T09:38:32.467-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\rocm_avx\\ollama_llama_server.exe" time=2024-12-21T09:38:32.467-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx\\ollama_llama_server.exe" time=2024-12-21T09:38:32.467-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe" time=2024-12-21T09:38:32.467-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v11_avx\\ollama_llama_server.exe" time=2024-12-21T09:38:32.467-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx\\ollama_llama_server.exe" time=2024-12-21T09:38:32.467-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\rocm_avx\\ollama_llama_server.exe" time=2024-12-21T09:38:32.467-05:00 level=INFO source=server.go:223 msg="enabling flash attention" time=2024-12-21T09:38:32.475-05:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx\\ollama_llama_server.exe runner --model E:\\DeepLearning\\LLM\\blobs\\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 --ctx-size 2048 --batch-size 512 --n-gpu-layers 65 --verbose --threads 8 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --tensor-split 32,33 --port 53068" time=2024-12-21T09:38:32.475-05:00 level=DEBUG source=server.go:393 msg=subprocess environment="[CUDA_PATH=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1 CUDA_PATH_V12_1=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1 CUDA_PATH_V12_3=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3 CUDA_VISIBLE_DEVICES=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23,GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 PATH=C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\libnvvp;C:\\Program Files\\Common Files\\Oracle\\Java\\javapath;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\libnvvp;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\;C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2023.1.0\\;C:\\Program Files (x86)\\Incredibuild;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Program Files\\nodejs\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA app\\NvDLISR;C:\\Program Files\\Git\\cmd;C:\\Program Files\\PuTTY\\;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files (x86)\\Common Files\\Intel\\Shared Libraries\\redist\\intel64_win\\compiler;C:\\Users\\User\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\User\\.dotnet\\tools;C:\\Users\\User\\miniconda3;C:\\Users\\User\\miniconda3\\Library\\mingw-w64\\bin;C:\\Users\\User\\miniconda3\\Library\\usr\\bin;C:\\Users\\User\\miniconda3\\Library\\bin;C:\\Users\\User\\miniconda3\\Scripts;C:\\Users\\User\\AppData\\Roaming\\npm;C:\\Program Files\\7-Zip;C:\\ffmpeg\\bin;;C:\\Users\\User\\AppData\\Local\\Programs\\Ollama;C:\\Users\\User\\.cache\\lm-studio\\bin]" time=2024-12-21T09:38:32.478-05:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2024-12-21T09:38:32.478-05:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding" time=2024-12-21T09:38:32.478-05:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" time=2024-12-21T09:38:32.550-05:00 level=INFO source=runner.go:945 msg="starting go runner" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes time=2024-12-21T09:38:32.653-05:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=8 time=2024-12-21T09:38:32.653-05:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:53068" llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free llama_load_model_from_file: using device CUDA1 (NVIDIA GeForce RTX 4090) - 22994 MiB free time=2024-12-21T09:38:32.729-05:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: loaded meta data with 47 key-value pairs and 724 tensors from E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.3 70B Instruct Abliterated llama_model_loader: - kv 3: general.finetune str = Instruct-abliterated llama_model_loader: - kv 4: general.basename str = Llama-3.3 llama_model_loader: - kv 5: general.size_label str = 70B llama_model_loader: - kv 6: general.license str = llama3.3 llama_model_loader: - kv 7: general.base_model.count u32 = 1 llama_model_loader: - kv 8: general.base_model.0.name str = Llama 3.3 70B Instruct llama_model_loader: - kv 9: general.base_model.0.organization str = Meta Llama llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla... llama_model_loader: - kv 11: general.tags arr[str,7] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 12: general.languages arr[str,8] = ["en", "fr", "it", "pt", "hi", "es", ... llama_model_loader: - kv 13: llama.block_count u32 = 80 llama_model_loader: - kv 14: llama.context_length u32 = 131072 llama_model_loader: - kv 15: llama.embedding_length u32 = 8192 llama_model_loader: - kv 16: llama.feed_forward_length u32 = 28672 llama_model_loader: - kv 17: llama.attention.head_count u32 = 64 llama_model_loader: - kv 18: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 19: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 20: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 21: llama.attention.key_length u32 = 128 llama_model_loader: - kv 22: llama.attention.value_length u32 = 128 llama_model_loader: - kv 23: general.file_type u32 = 17 llama_model_loader: - kv 24: llama.vocab_size u32 = 128256 llama_model_loader: - kv 25: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 27: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 32: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 128004 llama_model_loader: - kv 34: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 35: general.quantization_version u32 = 2 llama_model_loader: - kv 36: general.url str = https://huggingface.co/mradermacher/L... llama_model_loader: - kv 37: mradermacher.quantize_version str = 2 llama_model_loader: - kv 38: mradermacher.quantized_by str = mradermacher llama_model_loader: - kv 39: mradermacher.quantized_at str = 2024-12-13T11:20:37+01:00 llama_model_loader: - kv 40: mradermacher.quantized_on str = rich1 llama_model_loader: - kv 41: general.source.url str = https://huggingface.co/huihui-ai/Llam... llama_model_loader: - kv 42: mradermacher.convert_type str = hf llama_model_loader: - kv 43: quantize.imatrix.file str = Llama-3.3-70B-Instruct-abliterated-i1... llama_model_loader: - kv 44: quantize.imatrix.dataset str = imatrix-training-full-3 llama_model_loader: - kv 45: quantize.imatrix.entries_count i32 = 560 llama_model_loader: - kv 46: quantize.imatrix.chunks_count i32 = 314 llama_model_loader: - type f32: 162 tensors llama_model_loader: - type q5_K: 481 tensors llama_model_loader: - type q6_K: 81 tensors llm_load_vocab: control token: 128254 '<|reserved_special_token_246|>' is not marked as EOG llm_load_vocab: control token: 128252 '<|reserved_special_token_244|>' is not marked as EOG llm_load_vocab: control token: 128251 '<|reserved_special_token_243|>' is not marked as EOG llm_load_vocab: control token: 128250 '<|reserved_special_token_242|>' is not marked as EOG llm_load_vocab: control token: 128248 '<|reserved_special_token_240|>' is not marked as EOG llm_load_vocab: control token: 128247 '<|reserved_special_token_239|>' is not marked as EOG llm_load_vocab: control token: 128245 '<|reserved_special_token_237|>' is not marked as EOG llm_load_vocab: control token: 128244 '<|reserved_special_token_236|>' is not marked as EOG llm_load_vocab: control token: 128243 '<|reserved_special_token_235|>' is not marked as EOG llm_load_vocab: control token: 128240 '<|reserved_special_token_232|>' is not marked as EOG llm_load_vocab: control token: 128238 '<|reserved_special_token_230|>' is not marked as EOG llm_load_vocab: control token: 128235 '<|reserved_special_token_227|>' is not marked as EOG llm_load_vocab: control token: 128234 '<|reserved_special_token_226|>' is not marked as EOG llm_load_vocab: control token: 128229 '<|reserved_special_token_221|>' is not marked as EOG llm_load_vocab: control token: 128227 '<|reserved_special_token_219|>' is not marked as EOG llm_load_vocab: control token: 128226 '<|reserved_special_token_218|>' is not marked as EOG llm_load_vocab: control token: 128224 '<|reserved_special_token_216|>' is not marked as EOG llm_load_vocab: control token: 128223 '<|reserved_special_token_215|>' is not marked as EOG llm_load_vocab: control token: 128221 '<|reserved_special_token_213|>' is not marked as EOG llm_load_vocab: control token: 128219 '<|reserved_special_token_211|>' is not marked as EOG llm_load_vocab: control token: 128218 '<|reserved_special_token_210|>' is not marked as EOG llm_load_vocab: control token: 128217 '<|reserved_special_token_209|>' is not marked as EOG llm_load_vocab: control token: 128216 '<|reserved_special_token_208|>' is not marked as EOG llm_load_vocab: control token: 128215 '<|reserved_special_token_207|>' is not marked as EOG llm_load_vocab: control token: 128213 '<|reserved_special_token_205|>' is not marked as EOG llm_load_vocab: control token: 128211 '<|reserved_special_token_203|>' is not marked as EOG llm_load_vocab: control token: 128210 '<|reserved_special_token_202|>' is not marked as EOG llm_load_vocab: control token: 128209 '<|reserved_special_token_201|>' is not marked as EOG llm_load_vocab: control token: 128208 '<|reserved_special_token_200|>' is not marked as EOG llm_load_vocab: control token: 128207 '<|reserved_special_token_199|>' is not marked as EOG llm_load_vocab: control token: 128204 '<|reserved_special_token_196|>' is not marked as EOG llm_load_vocab: control token: 128202 '<|reserved_special_token_194|>' is not marked as EOG llm_load_vocab: control token: 128197 '<|reserved_special_token_189|>' is not marked as EOG llm_load_vocab: control token: 128195 '<|reserved_special_token_187|>' is not marked as EOG llm_load_vocab: control token: 128194 '<|reserved_special_token_186|>' is not marked as EOG llm_load_vocab: control token: 128191 '<|reserved_special_token_183|>' is not marked as EOG llm_load_vocab: control token: 128190 '<|reserved_special_token_182|>' is not marked as EOG llm_load_vocab: control token: 128188 '<|reserved_special_token_180|>' is not marked as EOG llm_load_vocab: control token: 128187 '<|reserved_special_token_179|>' is not marked as EOG llm_load_vocab: control token: 128185 '<|reserved_special_token_177|>' is not marked as EOG llm_load_vocab: control token: 128184 '<|reserved_special_token_176|>' is not marked as EOG llm_load_vocab: control token: 128183 '<|reserved_special_token_175|>' is not marked as EOG llm_load_vocab: control token: 128178 '<|reserved_special_token_170|>' is not marked as EOG llm_load_vocab: control token: 128177 '<|reserved_special_token_169|>' is not marked as EOG llm_load_vocab: control token: 128176 '<|reserved_special_token_168|>' is not marked as EOG llm_load_vocab: control token: 128175 '<|reserved_special_token_167|>' is not marked as EOG llm_load_vocab: control token: 128174 '<|reserved_special_token_166|>' is not marked as EOG llm_load_vocab: control token: 128173 '<|reserved_special_token_165|>' is not marked as EOG llm_load_vocab: control token: 128172 '<|reserved_special_token_164|>' is not marked as EOG llm_load_vocab: control token: 128169 '<|reserved_special_token_161|>' is not marked as EOG llm_load_vocab: control token: 128167 '<|reserved_special_token_159|>' is not marked as EOG llm_load_vocab: control token: 128166 '<|reserved_special_token_158|>' is not marked as EOG llm_load_vocab: control token: 128160 '<|reserved_special_token_152|>' is not marked as EOG llm_load_vocab: control token: 128159 '<|reserved_special_token_151|>' is not marked as EOG llm_load_vocab: control token: 128157 '<|reserved_special_token_149|>' is not marked as EOG llm_load_vocab: control token: 128156 '<|reserved_special_token_148|>' is not marked as EOG llm_load_vocab: control token: 128154 '<|reserved_special_token_146|>' is not marked as EOG llm_load_vocab: control token: 128152 '<|reserved_special_token_144|>' is not marked as EOG llm_load_vocab: control token: 128151 '<|reserved_special_token_143|>' is not marked as EOG llm_load_vocab: control token: 128150 '<|reserved_special_token_142|>' is not marked as EOG llm_load_vocab: control token: 128147 '<|reserved_special_token_139|>' is not marked as EOG llm_load_vocab: control token: 128144 '<|reserved_special_token_136|>' is not marked as EOG llm_load_vocab: control token: 128142 '<|reserved_special_token_134|>' is not marked as EOG llm_load_vocab: control token: 128141 '<|reserved_special_token_133|>' is not marked as EOG llm_load_vocab: control token: 128140 '<|reserved_special_token_132|>' is not marked as EOG llm_load_vocab: control token: 128133 '<|reserved_special_token_125|>' is not marked as EOG llm_load_vocab: control token: 128130 '<|reserved_special_token_122|>' is not marked as EOG llm_load_vocab: control token: 128128 '<|reserved_special_token_120|>' is not marked as EOG llm_load_vocab: control token: 128127 '<|reserved_special_token_119|>' is not marked as EOG llm_load_vocab: control token: 128126 '<|reserved_special_token_118|>' is not marked as EOG llm_load_vocab: control token: 128125 '<|reserved_special_token_117|>' is not marked as EOG llm_load_vocab: control token: 128124 '<|reserved_special_token_116|>' is not marked as EOG llm_load_vocab: control token: 128123 '<|reserved_special_token_115|>' is not marked as EOG llm_load_vocab: control token: 128122 '<|reserved_special_token_114|>' is not marked as EOG llm_load_vocab: control token: 128121 '<|reserved_special_token_113|>' is not marked as EOG llm_load_vocab: control token: 128120 '<|reserved_special_token_112|>' is not marked as EOG llm_load_vocab: control token: 128119 '<|reserved_special_token_111|>' is not marked as EOG llm_load_vocab: control token: 128116 '<|reserved_special_token_108|>' is not marked as EOG llm_load_vocab: control token: 128115 '<|reserved_special_token_107|>' is not marked as EOG llm_load_vocab: control token: 128114 '<|reserved_special_token_106|>' is not marked as EOG llm_load_vocab: control token: 128113 '<|reserved_special_token_105|>' is not marked as EOG llm_load_vocab: control token: 128111 '<|reserved_special_token_103|>' is not marked as EOG llm_load_vocab: control token: 128110 '<|reserved_special_token_102|>' is not marked as EOG llm_load_vocab: control token: 128107 '<|reserved_special_token_99|>' is not marked as EOG llm_load_vocab: control token: 128106 '<|reserved_special_token_98|>' is not marked as EOG llm_load_vocab: control token: 128105 '<|reserved_special_token_97|>' is not marked as EOG llm_load_vocab: control token: 128104 '<|reserved_special_token_96|>' is not marked as EOG llm_load_vocab: control token: 128103 '<|reserved_special_token_95|>' is not marked as EOG llm_load_vocab: control token: 128100 '<|reserved_special_token_92|>' is not marked as EOG llm_load_vocab: control token: 128097 '<|reserved_special_token_89|>' is not marked as EOG llm_load_vocab: control token: 128096 '<|reserved_special_token_88|>' is not marked as EOG llm_load_vocab: control token: 128094 '<|reserved_special_token_86|>' is not marked as EOG llm_load_vocab: control token: 128093 '<|reserved_special_token_85|>' is not marked as EOG llm_load_vocab: control token: 128090 '<|reserved_special_token_82|>' is not marked as EOG llm_load_vocab: control token: 128089 '<|reserved_special_token_81|>' is not marked as EOG llm_load_vocab: control token: 128087 '<|reserved_special_token_79|>' is not marked as EOG llm_load_vocab: control token: 128085 '<|reserved_special_token_77|>' is not marked as EOG llm_load_vocab: control token: 128080 '<|reserved_special_token_72|>' is not marked as EOG llm_load_vocab: control token: 128077 '<|reserved_special_token_69|>' is not marked as EOG llm_load_vocab: control token: 128076 '<|reserved_special_token_68|>' is not marked as EOG llm_load_vocab: control token: 128073 '<|reserved_special_token_65|>' is not marked as EOG llm_load_vocab: control token: 128070 '<|reserved_special_token_62|>' is not marked as EOG llm_load_vocab: control token: 128069 '<|reserved_special_token_61|>' is not marked as EOG llm_load_vocab: control token: 128067 '<|reserved_special_token_59|>' is not marked as EOG llm_load_vocab: control token: 128064 '<|reserved_special_token_56|>' is not marked as EOG llm_load_vocab: control token: 128062 '<|reserved_special_token_54|>' is not marked as EOG llm_load_vocab: control token: 128061 '<|reserved_special_token_53|>' is not marked as EOG llm_load_vocab: control token: 128060 '<|reserved_special_token_52|>' is not marked as EOG llm_load_vocab: control token: 128054 '<|reserved_special_token_46|>' is not marked as EOG llm_load_vocab: control token: 128045 '<|reserved_special_token_37|>' is not marked as EOG llm_load_vocab: control token: 128044 '<|reserved_special_token_36|>' is not marked as EOG llm_load_vocab: control token: 128043 '<|reserved_special_token_35|>' is not marked as EOG llm_load_vocab: control token: 128042 '<|reserved_special_token_34|>' is not marked as EOG llm_load_vocab: control token: 128038 '<|reserved_special_token_30|>' is not marked as EOG llm_load_vocab: control token: 128037 '<|reserved_special_token_29|>' is not marked as EOG llm_load_vocab: control token: 128035 '<|reserved_special_token_27|>' is not marked as EOG llm_load_vocab: control token: 128034 '<|reserved_special_token_26|>' is not marked as EOG llm_load_vocab: control token: 128033 '<|reserved_special_token_25|>' is not marked as EOG llm_load_vocab: control token: 128032 '<|reserved_special_token_24|>' is not marked as EOG llm_load_vocab: control token: 128030 '<|reserved_special_token_22|>' is not marked as EOG llm_load_vocab: control token: 128029 '<|reserved_special_token_21|>' is not marked as EOG llm_load_vocab: control token: 128028 '<|reserved_special_token_20|>' is not marked as EOG llm_load_vocab: control token: 128026 '<|reserved_special_token_18|>' is not marked as EOG llm_load_vocab: control token: 128025 '<|reserved_special_token_17|>' is not marked as EOG llm_load_vocab: control token: 128024 '<|reserved_special_token_16|>' is not marked as EOG llm_load_vocab: control token: 128022 '<|reserved_special_token_14|>' is not marked as EOG llm_load_vocab: control token: 128020 '<|reserved_special_token_12|>' is not marked as EOG llm_load_vocab: control token: 128017 '<|reserved_special_token_9|>' is not marked as EOG llm_load_vocab: control token: 128016 '<|reserved_special_token_8|>' is not marked as EOG llm_load_vocab: control token: 128015 '<|reserved_special_token_7|>' is not marked as EOG llm_load_vocab: control token: 128014 '<|reserved_special_token_6|>' is not marked as EOG llm_load_vocab: control token: 128013 '<|reserved_special_token_5|>' is not marked as EOG llm_load_vocab: control token: 128011 '<|reserved_special_token_3|>' is not marked as EOG llm_load_vocab: control token: 128010 '<|python_tag|>' is not marked as EOG llm_load_vocab: control token: 128006 '<|start_header_id|>' is not marked as EOG llm_load_vocab: control token: 128003 '<|reserved_special_token_1|>' is not marked as EOG llm_load_vocab: control token: 128002 '<|reserved_special_token_0|>' is not marked as EOG llm_load_vocab: control token: 128000 '<|begin_of_text|>' is not marked as EOG llm_load_vocab: control token: 128041 '<|reserved_special_token_33|>' is not marked as EOG llm_load_vocab: control token: 128063 '<|reserved_special_token_55|>' is not marked as EOG llm_load_vocab: control token: 128046 '<|reserved_special_token_38|>' is not marked as EOG llm_load_vocab: control token: 128007 '<|end_header_id|>' is not marked as EOG llm_load_vocab: control token: 128065 '<|reserved_special_token_57|>' is not marked as EOG llm_load_vocab: control token: 128171 '<|reserved_special_token_163|>' is not marked as EOG llm_load_vocab: control token: 128162 '<|reserved_special_token_154|>' is not marked as EOG llm_load_vocab: control token: 128165 '<|reserved_special_token_157|>' is not marked as EOG llm_load_vocab: control token: 128057 '<|reserved_special_token_49|>' is not marked as EOG llm_load_vocab: control token: 128050 '<|reserved_special_token_42|>' is not marked as EOG llm_load_vocab: control token: 128056 '<|reserved_special_token_48|>' is not marked as EOG llm_load_vocab: control token: 128230 '<|reserved_special_token_222|>' is not marked as EOG llm_load_vocab: control token: 128098 '<|reserved_special_token_90|>' is not marked as EOG llm_load_vocab: control token: 128153 '<|reserved_special_token_145|>' is not marked as EOG llm_load_vocab: control token: 128084 '<|reserved_special_token_76|>' is not marked as EOG llm_load_vocab: control token: 128082 '<|reserved_special_token_74|>' is not marked as EOG llm_load_vocab: control token: 128102 '<|reserved_special_token_94|>' is not marked as EOG llm_load_vocab: control token: 128253 '<|reserved_special_token_245|>' is not marked as EOG llm_load_vocab: control token: 128179 '<|reserved_special_token_171|>' is not marked as EOG llm_load_vocab: control token: 128071 '<|reserved_special_token_63|>' is not marked as EOG llm_load_vocab: control token: 128135 '<|reserved_special_token_127|>' is not marked as EOG llm_load_vocab: control token: 128161 '<|reserved_special_token_153|>' is not marked as EOG llm_load_vocab: control token: 128164 '<|reserved_special_token_156|>' is not marked as EOG llm_load_vocab: control token: 128134 '<|reserved_special_token_126|>' is not marked as EOG llm_load_vocab: control token: 128249 '<|reserved_special_token_241|>' is not marked as EOG llm_load_vocab: control token: 128004 '<|finetune_right_pad_id|>' is not marked as EOG llm_load_vocab: control token: 128036 '<|reserved_special_token_28|>' is not marked as EOG llm_load_vocab: control token: 128148 '<|reserved_special_token_140|>' is not marked as EOG llm_load_vocab: control token: 128181 '<|reserved_special_token_173|>' is not marked as EOG llm_load_vocab: control token: 128222 '<|reserved_special_token_214|>' is not marked as EOG llm_load_vocab: control token: 128075 '<|reserved_special_token_67|>' is not marked as EOG llm_load_vocab: control token: 128241 '<|reserved_special_token_233|>' is not marked as EOG llm_load_vocab: control token: 128051 '<|reserved_special_token_43|>' is not marked as EOG llm_load_vocab: control token: 128068 '<|reserved_special_token_60|>' is not marked as EOG llm_load_vocab: control token: 128149 '<|reserved_special_token_141|>' is not marked as EOG llm_load_vocab: control token: 128201 '<|reserved_special_token_193|>' is not marked as EOG llm_load_vocab: control token: 128058 '<|reserved_special_token_50|>' is not marked as EOG llm_load_vocab: control token: 128146 '<|reserved_special_token_138|>' is not marked as EOG llm_load_vocab: control token: 128143 '<|reserved_special_token_135|>' is not marked as EOG llm_load_vocab: control token: 128023 '<|reserved_special_token_15|>' is not marked as EOG llm_load_vocab: control token: 128039 '<|reserved_special_token_31|>' is not marked as EOG llm_load_vocab: control token: 128132 '<|reserved_special_token_124|>' is not marked as EOG llm_load_vocab: control token: 128101 '<|reserved_special_token_93|>' is not marked as EOG llm_load_vocab: control token: 128212 '<|reserved_special_token_204|>' is not marked as EOG llm_load_vocab: control token: 128189 '<|reserved_special_token_181|>' is not marked as EOG llm_load_vocab: control token: 128225 '<|reserved_special_token_217|>' is not marked as EOG llm_load_vocab: control token: 128129 '<|reserved_special_token_121|>' is not marked as EOG llm_load_vocab: control token: 128005 '<|reserved_special_token_2|>' is not marked as EOG llm_load_vocab: control token: 128078 '<|reserved_special_token_70|>' is not marked as EOG llm_load_vocab: control token: 128163 '<|reserved_special_token_155|>' is not marked as EOG llm_load_vocab: control token: 128072 '<|reserved_special_token_64|>' is not marked as EOG llm_load_vocab: control token: 128112 '<|reserved_special_token_104|>' is not marked as EOG llm_load_vocab: control token: 128186 '<|reserved_special_token_178|>' is not marked as EOG llm_load_vocab: control token: 128095 '<|reserved_special_token_87|>' is not marked as EOG llm_load_vocab: control token: 128109 '<|reserved_special_token_101|>' is not marked as EOG llm_load_vocab: control token: 128099 '<|reserved_special_token_91|>' is not marked as EOG llm_load_vocab: control token: 128138 '<|reserved_special_token_130|>' is not marked as EOG llm_load_vocab: control token: 128193 '<|reserved_special_token_185|>' is not marked as EOG llm_load_vocab: control token: 128199 '<|reserved_special_token_191|>' is not marked as EOG llm_load_vocab: control token: 128048 '<|reserved_special_token_40|>' is not marked as EOG llm_load_vocab: control token: 128088 '<|reserved_special_token_80|>' is not marked as EOG llm_load_vocab: control token: 128192 '<|reserved_special_token_184|>' is not marked as EOG llm_load_vocab: control token: 128136 '<|reserved_special_token_128|>' is not marked as EOG llm_load_vocab: control token: 128092 '<|reserved_special_token_84|>' is not marked as EOG llm_load_vocab: control token: 128158 '<|reserved_special_token_150|>' is not marked as EOG llm_load_vocab: control token: 128001 '<|end_of_text|>' is not marked as EOG llm_load_vocab: control token: 128049 '<|reserved_special_token_41|>' is not marked as EOG llm_load_vocab: control token: 128031 '<|reserved_special_token_23|>' is not marked as EOG llm_load_vocab: control token: 128255 '<|reserved_special_token_247|>' is not marked as EOG llm_load_vocab: control token: 128182 '<|reserved_special_token_174|>' is not marked as EOG llm_load_vocab: control token: 128066 '<|reserved_special_token_58|>' is not marked as EOG llm_load_vocab: control token: 128180 '<|reserved_special_token_172|>' is not marked as EOG llm_load_vocab: control token: 128233 '<|reserved_special_token_225|>' is not marked as EOG llm_load_vocab: control token: 128079 '<|reserved_special_token_71|>' is not marked as EOG llm_load_vocab: control token: 128081 '<|reserved_special_token_73|>' is not marked as EOG llm_load_vocab: control token: 128231 '<|reserved_special_token_223|>' is not marked as EOG llm_load_vocab: control token: 128196 '<|reserved_special_token_188|>' is not marked as EOG llm_load_vocab: control token: 128047 '<|reserved_special_token_39|>' is not marked as EOG llm_load_vocab: control token: 128083 '<|reserved_special_token_75|>' is not marked as EOG llm_load_vocab: control token: 128139 '<|reserved_special_token_131|>' is not marked as EOG llm_load_vocab: control token: 128131 '<|reserved_special_token_123|>' is not marked as EOG llm_load_vocab: control token: 128118 '<|reserved_special_token_110|>' is not marked as EOG llm_load_vocab: control token: 128053 '<|reserved_special_token_45|>' is not marked as EOG llm_load_vocab: control token: 128220 '<|reserved_special_token_212|>' is not marked as EOG llm_load_vocab: control token: 128108 '<|reserved_special_token_100|>' is not marked as EOG llm_load_vocab: control token: 128091 '<|reserved_special_token_83|>' is not marked as EOG llm_load_vocab: control token: 128203 '<|reserved_special_token_195|>' is not marked as EOG llm_load_vocab: control token: 128059 '<|reserved_special_token_51|>' is not marked as EOG llm_load_vocab: control token: 128019 '<|reserved_special_token_11|>' is not marked as EOG llm_load_vocab: control token: 128170 '<|reserved_special_token_162|>' is not marked as EOG llm_load_vocab: control token: 128205 '<|reserved_special_token_197|>' is not marked as EOG llm_load_vocab: control token: 128040 '<|reserved_special_token_32|>' is not marked as EOG llm_load_vocab: control token: 128200 '<|reserved_special_token_192|>' is not marked as EOG llm_load_vocab: control token: 128236 '<|reserved_special_token_228|>' is not marked as EOG llm_load_vocab: control token: 128145 '<|reserved_special_token_137|>' is not marked as EOG llm_load_vocab: control token: 128168 '<|reserved_special_token_160|>' is not marked as EOG llm_load_vocab: control token: 128214 '<|reserved_special_token_206|>' is not marked as EOG llm_load_vocab: control token: 128137 '<|reserved_special_token_129|>' is not marked as EOG llm_load_vocab: control token: 128232 '<|reserved_special_token_224|>' is not marked as EOG llm_load_vocab: control token: 128239 '<|reserved_special_token_231|>' is not marked as EOG llm_load_vocab: control token: 128055 '<|reserved_special_token_47|>' is not marked as EOG llm_load_vocab: control token: 128228 '<|reserved_special_token_220|>' is not marked as EOG llm_load_vocab: control token: 128206 '<|reserved_special_token_198|>' is not marked as EOG llm_load_vocab: control token: 128018 '<|reserved_special_token_10|>' is not marked as EOG llm_load_vocab: control token: 128012 '<|reserved_special_token_4|>' is not marked as EOG llm_load_vocab: control token: 128198 '<|reserved_special_token_190|>' is not marked as EOG llm_load_vocab: control token: 128021 '<|reserved_special_token_13|>' is not marked as EOG llm_load_vocab: control token: 128086 '<|reserved_special_token_78|>' is not marked as EOG llm_load_vocab: control token: 128074 '<|reserved_special_token_66|>' is not marked as EOG llm_load_vocab: control token: 128027 '<|reserved_special_token_19|>' is not marked as EOG llm_load_vocab: control token: 128242 '<|reserved_special_token_234|>' is not marked as EOG llm_load_vocab: control token: 128155 '<|reserved_special_token_147|>' is not marked as EOG llm_load_vocab: control token: 128052 '<|reserved_special_token_44|>' is not marked as EOG llm_load_vocab: control token: 128246 '<|reserved_special_token_238|>' is not marked as EOG llm_load_vocab: control token: 128117 '<|reserved_special_token_109|>' is not marked as EOG llm_load_vocab: control token: 128237 '<|reserved_special_token_229|>' is not marked as EOG llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_layer = 80 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 28672 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 70.55 B llm_load_print_meta: model size = 46.51 GiB (5.66 BPW) llm_load_print_meta: general.name = Llama 3.3 70B Instruct Abliterated llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llm_load_tensors: tensor 'token_embd.weight' (q5_K) (and 152 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead llm_load_tensors: offloading 65 repeating layers to GPU llm_load_tensors: offloaded 65/81 layers to GPU llm_load_tensors: CPU model buffer size = 688.88 MiB llm_load_tensors: CUDA_Host model buffer size = 9576.86 MiB llm_load_tensors: CUDA0 model buffer size = 18292.94 MiB llm_load_tensors: CUDA1 model buffer size = 19069.69 MiB load_all_data: no device found for buffer type CPU for async uploads load_all_data: buffer type CUDA_Host is not the default buffer type for device CUDA0 for async uploads time=2024-12-21T09:38:34.988-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.01" time=2024-12-21T09:38:35.238-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.05" time=2024-12-21T09:38:35.488-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.09" time=2024-12-21T09:38:35.738-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.12" time=2024-12-21T09:38:35.989-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.16" time=2024-12-21T09:38:36.239-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.16" time=2024-12-21T09:38:36.489-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.16" time=2024-12-21T09:38:37.240-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.17" time=2024-12-21T09:38:37.491-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.17" time=2024-12-21T09:38:37.742-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.17" time=2024-12-21T09:38:38.242-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.17" time=2024-12-21T09:38:38.993-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.18" time=2024-12-21T09:38:39.744-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.18" time=2024-12-21T09:38:39.995-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.18" time=2024-12-21T09:38:40.245-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.18" time=2024-12-21T09:38:40.997-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.18" time=2024-12-21T09:38:41.748-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.19" time=2024-12-21T09:38:42.500-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.19" time=2024-12-21T09:38:42.750-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.19" time=2024-12-21T09:38:43.503-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.20" time=2024-12-21T09:38:44.255-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.20" time=2024-12-21T09:38:45.005-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.20" time=2024-12-21T09:38:45.256-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.21" time=2024-12-21T09:38:45.506-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.21" time=2024-12-21T09:38:46.258-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.21" time=2024-12-21T09:38:46.759-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.21" load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0 time=2024-12-21T09:38:47.511-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.22" time=2024-12-21T09:38:47.761-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.22" time=2024-12-21T09:38:48.012-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.22" time=2024-12-21T09:38:48.513-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.22" time=2024-12-21T09:38:49.516-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.22" time=2024-12-21T09:38:50.016-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.23" time=2024-12-21T09:38:50.267-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.23" time=2024-12-21T09:38:50.517-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.23" time=2024-12-21T09:38:51.269-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.23" time=2024-12-21T09:38:52.021-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.24" time=2024-12-21T09:38:52.522-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.24" time=2024-12-21T09:38:52.773-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.24" time=2024-12-21T09:38:53.024-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.24" time=2024-12-21T09:38:53.775-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.25" time=2024-12-21T09:38:54.526-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.25" time=2024-12-21T09:38:55.027-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.25" time=2024-12-21T09:38:55.277-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.25" time=2024-12-21T09:38:55.528-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.25" time=2024-12-21T09:38:55.778-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.25" time=2024-12-21T09:38:56.279-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.26" time=2024-12-21T09:38:57.031-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.26" time=2024-12-21T09:38:57.782-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.26" time=2024-12-21T09:38:58.032-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.27" time=2024-12-21T09:38:58.283-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.27" time=2024-12-21T09:38:58.784-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.27" time=2024-12-21T09:38:59.537-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.27" time=2024-12-21T09:39:00.289-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.28" time=2024-12-21T09:39:00.539-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.28" time=2024-12-21T09:39:00.790-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.28" time=2024-12-21T09:39:01.541-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.28" time=2024-12-21T09:39:02.041-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.28" time=2024-12-21T09:39:02.793-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.29" time=2024-12-21T09:39:03.043-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.29" time=2024-12-21T09:39:03.294-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.29" time=2024-12-21T09:39:04.045-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.29" time=2024-12-21T09:39:04.796-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.30" time=2024-12-21T09:39:05.547-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.30" time=2024-12-21T09:39:05.797-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.30" time=2024-12-21T09:39:06.048-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.30" time=2024-12-21T09:39:06.799-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.31" time=2024-12-21T09:39:07.300-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.31" time=2024-12-21T09:39:08.052-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.31" time=2024-12-21T09:39:08.302-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.31" time=2024-12-21T09:39:08.553-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.31" time=2024-12-21T09:39:09.304-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.32" time=2024-12-21T09:39:10.056-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.32" time=2024-12-21T09:39:10.557-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.32" time=2024-12-21T09:39:10.807-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.32" time=2024-12-21T09:39:11.057-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.32" time=2024-12-21T09:39:11.308-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.33" time=2024-12-21T09:39:11.809-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.33" time=2024-12-21T09:39:12.810-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.33" time=2024-12-21T09:39:13.310-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.34" time=2024-12-21T09:39:13.561-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.34" time=2024-12-21T09:39:13.811-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.34" time=2024-12-21T09:39:14.562-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.34" time=2024-12-21T09:39:15.313-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.34" time=2024-12-21T09:39:16.064-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.35" time=2024-12-21T09:39:16.314-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.35" time=2024-12-21T09:39:16.565-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.35" time=2024-12-21T09:39:17.316-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.35" time=2024-12-21T09:39:17.817-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.36" time=2024-12-21T09:39:18.568-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.36" time=2024-12-21T09:39:18.818-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.36" time=2024-12-21T09:39:19.069-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.36" time=2024-12-21T09:39:19.820-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.36" time=2024-12-21T09:39:20.571-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.37" time=2024-12-21T09:39:21.323-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.37" time=2024-12-21T09:39:21.573-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.37" time=2024-12-21T09:39:21.824-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.37" time=2024-12-21T09:39:22.325-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.38" time=2024-12-21T09:39:23.077-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.38" time=2024-12-21T09:39:23.838-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.38" time=2024-12-21T09:39:24.089-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.38" time=2024-12-21T09:39:24.339-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.39" time=2024-12-21T09:39:25.090-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.39" time=2024-12-21T09:39:25.591-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.39" time=2024-12-21T09:39:26.343-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.40" time=2024-12-21T09:39:26.593-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.40" time=2024-12-21T09:39:26.844-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.40" time=2024-12-21T09:39:27.594-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.40" time=2024-12-21T09:39:28.345-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.40" time=2024-12-21T09:39:29.096-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.41" time=2024-12-21T09:39:29.347-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.41" time=2024-12-21T09:39:29.597-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.41" time=2024-12-21T09:39:30.349-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.41" time=2024-12-21T09:39:30.849-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.42" time=2024-12-21T09:39:31.601-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.42" time=2024-12-21T09:39:31.851-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.42" time=2024-12-21T09:39:32.102-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.42" time=2024-12-21T09:39:32.854-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.42" time=2024-12-21T09:39:33.605-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.43" time=2024-12-21T09:39:34.107-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.43" time=2024-12-21T09:39:34.358-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.43" time=2024-12-21T09:39:34.608-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.43" time=2024-12-21T09:39:35.360-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.44" time=2024-12-21T09:39:36.113-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.44" time=2024-12-21T09:39:36.864-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.44" time=2024-12-21T09:39:37.115-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.45" time=2024-12-21T09:39:37.365-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.45" time=2024-12-21T09:39:38.117-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.45" time=2024-12-21T09:39:38.618-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.45" time=2024-12-21T09:39:39.369-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.46" time=2024-12-21T09:39:39.619-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.46" time=2024-12-21T09:39:39.869-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.46" time=2024-12-21T09:39:40.621-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.46" time=2024-12-21T09:39:41.122-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.46" time=2024-12-21T09:39:41.873-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.47" time=2024-12-21T09:39:42.124-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.47" time=2024-12-21T09:39:42.374-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.47" time=2024-12-21T09:39:43.126-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.47" time=2024-12-21T09:39:43.877-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.48" time=2024-12-21T09:39:44.629-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.48" time=2024-12-21T09:39:44.879-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.48" time=2024-12-21T09:39:45.130-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.48" time=2024-12-21T09:39:45.631-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.49" time=2024-12-21T09:39:46.382-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.49" time=2024-12-21T09:39:46.884-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.49" time=2024-12-21T09:39:47.134-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.49" time=2024-12-21T09:39:47.385-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.49" time=2024-12-21T09:39:47.636-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.49" time=2024-12-21T09:39:48.137-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.50" time=2024-12-21T09:39:48.888-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.50" time=2024-12-21T09:39:49.639-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.50" time=2024-12-21T09:39:49.890-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.50" time=2024-12-21T09:39:50.140-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.51" time=2024-12-21T09:39:50.641-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.51" time=2024-12-21T09:39:51.643-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.51" time=2024-12-21T09:39:52.144-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.52" time=2024-12-21T09:39:52.645-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.52" time=2024-12-21T09:39:53.396-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.52" time=2024-12-21T09:39:54.147-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.52" time=2024-12-21T09:39:54.898-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.53" time=2024-12-21T09:39:55.148-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.53" time=2024-12-21T09:39:55.399-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.53" time=2024-12-21T09:39:55.899-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.53" time=2024-12-21T09:39:56.650-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.54" time=2024-12-21T09:39:57.151-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.54" time=2024-12-21T09:39:57.401-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.54" time=2024-12-21T09:39:57.652-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.54" time=2024-12-21T09:39:58.404-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.54" time=2024-12-21T09:39:59.155-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.55" time=2024-12-21T09:39:59.906-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.55" time=2024-12-21T09:40:00.157-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.55" time=2024-12-21T09:40:00.407-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.55" time=2024-12-21T09:40:00.908-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.56" time=2024-12-21T09:40:01.660-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.56" time=2024-12-21T09:40:02.410-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.56" time=2024-12-21T09:40:02.661-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.56" time=2024-12-21T09:40:02.911-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.57" time=2024-12-21T09:40:03.662-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.57" time=2024-12-21T09:40:04.414-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.57" time=2024-12-21T09:40:04.914-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.58" time=2024-12-21T09:40:05.165-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.58" time=2024-12-21T09:40:05.415-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.58" time=2024-12-21T09:40:06.166-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.58" time=2024-12-21T09:40:06.917-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.58" time=2024-12-21T09:40:07.669-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.59" time=2024-12-21T09:40:07.919-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.59" time=2024-12-21T09:40:08.169-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.59" time=2024-12-21T09:40:08.670-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.59" time=2024-12-21T09:40:09.421-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.60" load_all_data: using async uploads for device CUDA1, buffer type CUDA1, backend CUDA1 time=2024-12-21T09:40:10.173-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.60" time=2024-12-21T09:40:10.423-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.60" time=2024-12-21T09:40:10.674-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.60" time=2024-12-21T09:40:11.174-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.60" time=2024-12-21T09:40:11.925-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.61" time=2024-12-21T09:40:12.677-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.61" time=2024-12-21T09:40:12.928-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.61" time=2024-12-21T09:40:13.178-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.61" time=2024-12-21T09:40:13.679-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.62" time=2024-12-21T09:40:14.682-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.62" time=2024-12-21T09:40:15.184-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.62" time=2024-12-21T09:40:15.434-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.62" time=2024-12-21T09:40:15.937-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.63" time=2024-12-21T09:40:16.438-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.63" time=2024-12-21T09:40:17.188-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.63" time=2024-12-21T09:40:17.939-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.64" time=2024-12-21T09:40:18.190-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.64" time=2024-12-21T09:40:19.192-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.64" time=2024-12-21T09:40:19.695-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.64" time=2024-12-21T09:40:20.446-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.65" time=2024-12-21T09:40:20.696-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.65" time=2024-12-21T09:40:20.947-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.65" time=2024-12-21T09:40:21.698-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.65" time=2024-12-21T09:40:22.450-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.66" time=2024-12-21T09:40:23.201-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.66" time=2024-12-21T09:40:23.451-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.66" time=2024-12-21T09:40:23.702-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.66" time=2024-12-21T09:40:24.460-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.67" time=2024-12-21T09:40:25.211-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.67" time=2024-12-21T09:40:25.962-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.67" time=2024-12-21T09:40:26.213-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.67" time=2024-12-21T09:40:26.463-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.67" time=2024-12-21T09:40:27.215-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.68" time=2024-12-21T09:40:27.716-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.68" time=2024-12-21T09:40:28.470-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.68" time=2024-12-21T09:40:28.721-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.68" time=2024-12-21T09:40:28.971-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.68" time=2024-12-21T09:40:29.222-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.69" time=2024-12-21T09:40:29.973-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.69" time=2024-12-21T09:40:30.724-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.69" time=2024-12-21T09:40:31.476-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.70" time=2024-12-21T09:40:31.727-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.70" time=2024-12-21T09:40:31.978-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.70" time=2024-12-21T09:40:32.729-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.70" time=2024-12-21T09:40:33.480-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.70" time=2024-12-21T09:40:34.231-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.71" time=2024-12-21T09:40:34.481-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.71" time=2024-12-21T09:40:34.731-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.71" time=2024-12-21T09:40:35.483-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.71" time=2024-12-21T09:40:35.984-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.72" time=2024-12-21T09:40:36.736-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.72" time=2024-12-21T09:40:36.986-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.72" time=2024-12-21T09:40:37.236-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.72" time=2024-12-21T09:40:37.988-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.72" time=2024-12-21T09:40:38.739-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.73" time=2024-12-21T09:40:39.490-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.73" time=2024-12-21T09:40:39.742-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.73" time=2024-12-21T09:40:39.993-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.73" time=2024-12-21T09:40:40.493-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.74" time=2024-12-21T09:40:41.244-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.74" time=2024-12-21T09:40:41.996-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.74" time=2024-12-21T09:40:42.248-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.74" time=2024-12-21T09:40:42.498-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.75" time=2024-12-21T09:40:43.249-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.75" time=2024-12-21T09:40:43.750-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.75" time=2024-12-21T09:40:44.501-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.76" time=2024-12-21T09:40:44.751-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.76" time=2024-12-21T09:40:45.001-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.76" time=2024-12-21T09:40:45.755-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.76" time=2024-12-21T09:40:46.507-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.76" time=2024-12-21T09:40:47.258-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.77" time=2024-12-21T09:40:47.508-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.77" time=2024-12-21T09:40:47.759-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.77" time=2024-12-21T09:40:48.510-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.77" time=2024-12-21T09:40:49.261-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.78" time=2024-12-21T09:40:49.764-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.78" time=2024-12-21T09:40:50.014-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.78" time=2024-12-21T09:40:50.264-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.78" time=2024-12-21T09:40:50.514-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.78" time=2024-12-21T09:40:51.265-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.78" time=2024-12-21T09:40:52.017-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.79" time=2024-12-21T09:40:52.767-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.79" time=2024-12-21T09:40:53.018-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.79" time=2024-12-21T09:40:53.269-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.79" time=2024-12-21T09:40:54.020-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.80" time=2024-12-21T09:40:54.772-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.80" time=2024-12-21T09:40:55.775-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.80" time=2024-12-21T09:40:56.026-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.80" time=2024-12-21T09:40:56.276-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.81" time=2024-12-21T09:40:57.027-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.81" time=2024-12-21T09:40:57.778-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.81" time=2024-12-21T09:40:58.530-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.82" time=2024-12-21T09:40:58.780-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.82" time=2024-12-21T09:40:59.030-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.82" time=2024-12-21T09:40:59.782-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.82" time=2024-12-21T09:41:00.533-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.82" time=2024-12-21T09:41:01.285-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.83" time=2024-12-21T09:41:01.535-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.83" time=2024-12-21T09:41:01.785-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.83" time=2024-12-21T09:41:02.537-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.83" time=2024-12-21T09:41:03.538-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.84" time=2024-12-21T09:41:04.039-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.84" time=2024-12-21T09:41:04.289-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.84" time=2024-12-21T09:41:04.540-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.84" time=2024-12-21T09:41:04.790-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.84" time=2024-12-21T09:41:05.291-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.85" time=2024-12-21T09:41:06.292-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.85" time=2024-12-21T09:41:06.793-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.85" time=2024-12-21T09:41:07.294-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.85" time=2024-12-21T09:41:07.544-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.85" time=2024-12-21T09:41:08.296-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.86" time=2024-12-21T09:41:09.047-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.86" time=2024-12-21T09:41:09.798-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.86" time=2024-12-21T09:41:10.048-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.86" time=2024-12-21T09:41:10.298-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.87" time=2024-12-21T09:41:11.049-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.87" time=2024-12-21T09:41:12.052-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.87" time=2024-12-21T09:41:12.553-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.88" time=2024-12-21T09:41:12.803-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.88" time=2024-12-21T09:41:13.054-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.88" time=2024-12-21T09:41:13.304-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.88" time=2024-12-21T09:41:14.055-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.88" time=2024-12-21T09:41:14.806-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.88" time=2024-12-21T09:41:15.558-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.89" time=2024-12-21T09:41:15.808-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.89" time=2024-12-21T09:41:16.058-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.89" time=2024-12-21T09:41:16.809-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.89" time=2024-12-21T09:41:17.560-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.90" time=2024-12-21T09:41:18.311-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.90" time=2024-12-21T09:41:18.562-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.90" time=2024-12-21T09:41:18.813-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.90" time=2024-12-21T09:41:19.063-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.90" time=2024-12-21T09:41:19.566-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.91" time=2024-12-21T09:41:20.568-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.91" time=2024-12-21T09:41:21.319-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.91" time=2024-12-21T09:41:21.570-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.91" time=2024-12-21T09:41:21.820-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.92" time=2024-12-21T09:41:22.571-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.92" time=2024-12-21T09:41:23.323-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.92" time=2024-12-21T09:41:24.324-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.93" time=2024-12-21T09:41:24.574-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.93" time=2024-12-21T09:41:24.825-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.93" time=2024-12-21T09:41:25.575-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.93" time=2024-12-21T09:41:26.327-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.93" time=2024-12-21T09:41:27.078-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.94" time=2024-12-21T09:41:27.329-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.94" time=2024-12-21T09:41:27.579-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.94" time=2024-12-21T09:41:28.330-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.94" time=2024-12-21T09:41:29.331-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.95" time=2024-12-21T09:41:30.082-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.95" time=2024-12-21T09:41:30.332-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.95" time=2024-12-21T09:41:30.582-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.95" time=2024-12-21T09:41:31.335-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.96" time=2024-12-21T09:41:32.086-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.96" time=2024-12-21T09:41:32.837-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.96" time=2024-12-21T09:41:33.088-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.96" time=2024-12-21T09:41:33.338-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.96" time=2024-12-21T09:41:33.839-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.97" time=2024-12-21T09:41:34.841-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.97" time=2024-12-21T09:41:35.342-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.98" time=2024-12-21T09:41:35.593-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.98" time=2024-12-21T09:41:35.849-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.98" time=2024-12-21T09:41:36.350-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.98" time=2024-12-21T09:41:37.102-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.98" time=2024-12-21T09:41:37.854-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.99" time=2024-12-21T09:41:38.104-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.99" time=2024-12-21T09:41:38.354-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.99" time=2024-12-21T09:41:39.105-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.99" time=2024-12-21T09:41:39.857-05:00 level=DEBUG source=server.go:600 msg="model load progress 1.00" llama_new_context_with_model: n_seq_max = 1 llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_ctx_per_seq = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_kv_cache_init: CPU KV buffer size = 63.75 MiB llama_kv_cache_init: CUDA0 KV buffer size = 136.00 MiB llama_kv_cache_init: CUDA1 KV buffer size = 140.25 MiB llama_new_context_with_model: KV self size = 340.00 MiB, K (q8_0): 170.00 MiB, V (q8_0): 170.00 MiB llama_new_context_with_model: CPU output buffer size = 0.52 MiB llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB llama_new_context_with_model: CUDA1 compute buffer size = 162.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB llama_new_context_with_model: graph nodes = 2247 llama_new_context_with_model: graph splits = 171 (with bs=512), 4 (with bs=1) time=2024-12-21T09:41:40.608-05:00 level=INFO source=server.go:594 msg="llama runner started in 188.13 seconds" time=2024-12-21T09:41:40.608-05:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 [GIN] 2024/12/21 - 09:41:40 | 200 | 3m8s | 127.0.0.1 | POST "/api/generate" time=2024-12-21T09:41:40.608-05:00 level=DEBUG source=sched.go:466 msg="context for request finished" time=2024-12-21T09:41:40.608-05:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 duration=1h0m0s time=2024-12-21T09:41:40.608-05:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 refCount=0 time=2024-12-21T09:42:08.353-05:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 time=2024-12-21T09:42:08.353-05:00 level=DEBUG source=sched.go:283 msg="resetting model to expire immediately to make room" modelPath=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 refCount=0 time=2024-12-21T09:42:08.353-05:00 level=DEBUG source=sched.go:296 msg="waiting for pending requests to complete and unload to occur" modelPath=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 time=2024-12-21T09:42:08.353-05:00 level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 time=2024-12-21T09:42:08.353-05:00 level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 time=2024-12-21T09:42:08.353-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.8 GiB" before.free_swap="236.1 GiB" now.total="191.7 GiB" now.free="149.1 GiB" now.free_swap="187.1 GiB" time=2024-12-21T09:42:08.377-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="2.6 GiB" now.used="21.4 GiB" time=2024-12-21T09:42:08.392-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="3.2 GiB" now.used="19.8 GiB" releasing nvml library time=2024-12-21T09:42:08.393-05:00 level=DEBUG source=server.go:1080 msg="stopping llama server" time=2024-12-21T09:42:08.393-05:00 level=DEBUG source=server.go:1086 msg="waiting for llama server to exit" time=2024-12-21T09:42:08.644-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="149.1 GiB" before.free_swap="187.1 GiB" now.total="191.7 GiB" now.free="149.1 GiB" now.free_swap="207.0 GiB" time=2024-12-21T09:42:08.953-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="2.6 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="1.9 GiB" time=2024-12-21T09:42:08.969-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="3.2 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-21T09:42:08.970-05:00 level=DEBUG source=sched.go:659 msg="gpu VRAM free memory converged after 0.62 seconds" model=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 time=2024-12-21T09:42:09.702-05:00 level=DEBUG source=server.go:1090 msg="llama server stopped" time=2024-12-21T09:42:09.702-05:00 level=DEBUG source=sched.go:380 msg="runner released" modelPath=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 time=2024-12-21T09:42:09.702-05:00 level=DEBUG source=sched.go:384 msg="sending an unloaded event" modelPath=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 time=2024-12-21T09:42:09.702-05:00 level=DEBUG source=sched.go:302 msg="unload completed" modelPath=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 time=2024-12-21T09:42:09.702-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="149.1 GiB" before.free_swap="207.0 GiB" now.total="191.7 GiB" now.free="159.8 GiB" now.free_swap="236.6 GiB" time=2024-12-21T09:42:09.722-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="1.9 GiB" time=2024-12-21T09:42:09.738-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-21T09:42:09.758-05:00 level=DEBUG source=sched.go:224 msg="loading first model" model=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 time=2024-12-21T09:42:09.758-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]" time=2024-12-21T09:42:09.758-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.8 GiB" before.free_swap="236.6 GiB" now.total="191.7 GiB" now.free="159.8 GiB" now.free_swap="236.6 GiB" time=2024-12-21T09:42:09.769-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="1.9 GiB" time=2024-12-21T09:42:09.784-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-21T09:42:09.785-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.0 GiB]" time=2024-12-21T09:42:09.785-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.8 GiB" before.free_swap="236.6 GiB" now.total="191.7 GiB" now.free="159.8 GiB" now.free_swap="236.6 GiB" time=2024-12-21T09:42:09.800-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="1.9 GiB" time=2024-12-21T09:42:09.816-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-21T09:42:09.819-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]" time=2024-12-21T09:42:09.819-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.8 GiB" before.free_swap="236.6 GiB" now.total="191.7 GiB" now.free="159.8 GiB" now.free_swap="236.6 GiB" time=2024-12-21T09:42:09.831-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="1.9 GiB" time=2024-12-21T09:42:09.847-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-21T09:42:09.848-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.0 GiB]" time=2024-12-21T09:42:09.848-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.8 GiB" before.free_swap="236.6 GiB" now.total="191.7 GiB" now.free="159.8 GiB" now.free_swap="236.6 GiB" time=2024-12-21T09:42:09.862-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="1.9 GiB" time=2024-12-21T09:42:09.877-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-21T09:42:09.878-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 22.0 GiB]" time=2024-12-21T09:42:09.878-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.8 GiB" before.free_swap="236.6 GiB" now.total="191.7 GiB" now.free="159.8 GiB" now.free_swap="236.6 GiB" time=2024-12-21T09:42:09.893-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="1.9 GiB" time=2024-12-21T09:42:09.909-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-21T09:42:09.910-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 22.0 GiB]" time=2024-12-21T09:42:09.910-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.8 GiB" before.free_swap="236.6 GiB" now.total="191.7 GiB" now.free="159.8 GiB" now.free_swap="236.6 GiB" time=2024-12-21T09:42:09.924-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="1.9 GiB" time=2024-12-21T09:42:09.940-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-21T09:42:09.942-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.8 GiB" before.free_swap="236.6 GiB" now.total="191.7 GiB" now.free="159.8 GiB" now.free_swap="236.6 GiB" time=2024-12-21T09:42:09.956-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="1.9 GiB" time=2024-12-21T09:42:09.971-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-21T09:42:09.973-05:00 level=INFO source=server.go:104 msg="system memory" total="191.7 GiB" free="159.8 GiB" free_swap="236.6 GiB" time=2024-12-21T09:42:09.973-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.0 GiB 22.5 GiB]" time=2024-12-21T09:42:09.973-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="159.8 GiB" before.free_swap="236.6 GiB" now.total="191.7 GiB" now.free="159.8 GiB" now.free_swap="236.6 GiB" time=2024-12-21T09:42:10.063-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.0 GiB" now.total="24.0 GiB" now.free="22.0 GiB" now.used="1.9 GiB" time=2024-12-21T09:42:10.111-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-21T09:42:10.112-05:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=49 layers.split=24,25 memory.available="[22.0 GiB 22.5 GiB]" memory.gpu_overhead="1.5 GiB" memory.required.full="61.9 GiB" memory.required.partial="41.3 GiB" memory.required.kv="5.0 GiB" memory.required.allocations="[20.3 GiB 21.0 GiB]" memory.weights.total="50.0 GiB" memory.weights.repeating="49.2 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="4.3 GiB" memory.graph.partial="4.3 GiB" time=2024-12-21T09:42:10.112-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx\\ollama_llama_server.exe" time=2024-12-21T09:42:10.112-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe" time=2024-12-21T09:42:10.112-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v11_avx\\ollama_llama_server.exe" time=2024-12-21T09:42:10.112-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx\\ollama_llama_server.exe" time=2024-12-21T09:42:10.112-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\rocm_avx\\ollama_llama_server.exe" time=2024-12-21T09:42:10.112-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx\\ollama_llama_server.exe" time=2024-12-21T09:42:10.112-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe" time=2024-12-21T09:42:10.112-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v11_avx\\ollama_llama_server.exe" time=2024-12-21T09:42:10.112-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx\\ollama_llama_server.exe" time=2024-12-21T09:42:10.112-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\rocm_avx\\ollama_llama_server.exe" time=2024-12-21T09:42:10.112-05:00 level=INFO source=server.go:223 msg="enabling flash attention" time=2024-12-21T09:42:10.115-05:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx\\ollama_llama_server.exe runner --model E:\\DeepLearning\\LLM\\blobs\\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 --ctx-size 32768 --batch-size 512 --n-gpu-layers 49 --verbose --threads 8 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --tensor-split 24,25 --port 53791" time=2024-12-21T09:42:10.115-05:00 level=DEBUG source=server.go:393 msg=subprocess environment="[CUDA_PATH=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1 CUDA_PATH_V12_1=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1 CUDA_PATH_V12_3=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3 CUDA_VISIBLE_DEVICES=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23,GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 PATH=C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\libnvvp;C:\\Program Files\\Common Files\\Oracle\\Java\\javapath;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\libnvvp;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\;C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2023.1.0\\;C:\\Program Files (x86)\\Incredibuild;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Program Files\\nodejs\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA app\\NvDLISR;C:\\Program Files\\Git\\cmd;C:\\Program Files\\PuTTY\\;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files (x86)\\Common Files\\Intel\\Shared Libraries\\redist\\intel64_win\\compiler;C:\\Users\\User\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\User\\.dotnet\\tools;C:\\Users\\User\\miniconda3;C:\\Users\\User\\miniconda3\\Library\\mingw-w64\\bin;C:\\Users\\User\\miniconda3\\Library\\usr\\bin;C:\\Users\\User\\miniconda3\\Library\\bin;C:\\Users\\User\\miniconda3\\Scripts;C:\\Users\\User\\AppData\\Roaming\\npm;C:\\Program Files\\7-Zip;C:\\ffmpeg\\bin;;C:\\Users\\User\\AppData\\Local\\Programs\\Ollama;C:\\Users\\User\\.cache\\lm-studio\\bin]" time=2024-12-21T09:42:10.117-05:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2024-12-21T09:42:10.117-05:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding" time=2024-12-21T09:42:10.119-05:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" time=2024-12-21T09:42:10.180-05:00 level=INFO source=runner.go:945 msg="starting go runner" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes time=2024-12-21T09:42:10.278-05:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=8 time=2024-12-21T09:42:10.279-05:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:53791" time=2024-12-21T09:42:10.370-05:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free llama_load_model_from_file: using device CUDA1 (NVIDIA GeForce RTX 4090) - 22994 MiB free llama_model_loader: loaded meta data with 47 key-value pairs and 724 tensors from E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.3 70B Instruct Abliterated llama_model_loader: - kv 3: general.finetune str = Instruct-abliterated llama_model_loader: - kv 4: general.basename str = Llama-3.3 llama_model_loader: - kv 5: general.size_label str = 70B llama_model_loader: - kv 6: general.license str = llama3.3 llama_model_loader: - kv 7: general.base_model.count u32 = 1 llama_model_loader: - kv 8: general.base_model.0.name str = Llama 3.3 70B Instruct llama_model_loader: - kv 9: general.base_model.0.organization str = Meta Llama llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla... llama_model_loader: - kv 11: general.tags arr[str,7] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 12: general.languages arr[str,8] = ["en", "fr", "it", "pt", "hi", "es", ... llama_model_loader: - kv 13: llama.block_count u32 = 80 llama_model_loader: - kv 14: llama.context_length u32 = 131072 llama_model_loader: - kv 15: llama.embedding_length u32 = 8192 llama_model_loader: - kv 16: llama.feed_forward_length u32 = 28672 llama_model_loader: - kv 17: llama.attention.head_count u32 = 64 llama_model_loader: - kv 18: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 19: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 20: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 21: llama.attention.key_length u32 = 128 llama_model_loader: - kv 22: llama.attention.value_length u32 = 128 llama_model_loader: - kv 23: general.file_type u32 = 17 llama_model_loader: - kv 24: llama.vocab_size u32 = 128256 llama_model_loader: - kv 25: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 27: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 32: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 128004 llama_model_loader: - kv 34: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 35: general.quantization_version u32 = 2 llama_model_loader: - kv 36: general.url str = https://huggingface.co/mradermacher/L... llama_model_loader: - kv 37: mradermacher.quantize_version str = 2 llama_model_loader: - kv 38: mradermacher.quantized_by str = mradermacher llama_model_loader: - kv 39: mradermacher.quantized_at str = 2024-12-13T11:20:37+01:00 llama_model_loader: - kv 40: mradermacher.quantized_on str = rich1 llama_model_loader: - kv 41: general.source.url str = https://huggingface.co/huihui-ai/Llam... llama_model_loader: - kv 42: mradermacher.convert_type str = hf llama_model_loader: - kv 43: quantize.imatrix.file str = Llama-3.3-70B-Instruct-abliterated-i1... llama_model_loader: - kv 44: quantize.imatrix.dataset str = imatrix-training-full-3 llama_model_loader: - kv 45: quantize.imatrix.entries_count i32 = 560 llama_model_loader: - kv 46: quantize.imatrix.chunks_count i32 = 314 llama_model_loader: - type f32: 162 tensors llama_model_loader: - type q5_K: 481 tensors llama_model_loader: - type q6_K: 81 tensors llm_load_vocab: control token: 128254 '<|reserved_special_token_246|>' is not marked as EOG llm_load_vocab: control token: 128252 '<|reserved_special_token_244|>' is not marked as EOG llm_load_vocab: control token: 128251 '<|reserved_special_token_243|>' is not marked as EOG llm_load_vocab: control token: 128250 '<|reserved_special_token_242|>' is not marked as EOG llm_load_vocab: control token: 128248 '<|reserved_special_token_240|>' is not marked as EOG llm_load_vocab: control token: 128247 '<|reserved_special_token_239|>' is not marked as EOG llm_load_vocab: control token: 128245 '<|reserved_special_token_237|>' is not marked as EOG llm_load_vocab: control token: 128244 '<|reserved_special_token_236|>' is not marked as EOG llm_load_vocab: control token: 128243 '<|reserved_special_token_235|>' is not marked as EOG llm_load_vocab: control token: 128240 '<|reserved_special_token_232|>' is not marked as EOG llm_load_vocab: control token: 128238 '<|reserved_special_token_230|>' is not marked as EOG llm_load_vocab: control token: 128235 '<|reserved_special_token_227|>' is not marked as EOG llm_load_vocab: control token: 128234 '<|reserved_special_token_226|>' is not marked as EOG llm_load_vocab: control token: 128229 '<|reserved_special_token_221|>' is not marked as EOG llm_load_vocab: control token: 128227 '<|reserved_special_token_219|>' is not marked as EOG llm_load_vocab: control token: 128226 '<|reserved_special_token_218|>' is not marked as EOG llm_load_vocab: control token: 128224 '<|reserved_special_token_216|>' is not marked as EOG llm_load_vocab: control token: 128223 '<|reserved_special_token_215|>' is not marked as EOG llm_load_vocab: control token: 128221 '<|reserved_special_token_213|>' is not marked as EOG llm_load_vocab: control token: 128219 '<|reserved_special_token_211|>' is not marked as EOG llm_load_vocab: control token: 128218 '<|reserved_special_token_210|>' is not marked as EOG llm_load_vocab: control token: 128217 '<|reserved_special_token_209|>' is not marked as EOG llm_load_vocab: control token: 128216 '<|reserved_special_token_208|>' is not marked as EOG llm_load_vocab: control token: 128215 '<|reserved_special_token_207|>' is not marked as EOG llm_load_vocab: control token: 128213 '<|reserved_special_token_205|>' is not marked as EOG llm_load_vocab: control token: 128211 '<|reserved_special_token_203|>' is not marked as EOG llm_load_vocab: control token: 128210 '<|reserved_special_token_202|>' is not marked as EOG llm_load_vocab: control token: 128209 '<|reserved_special_token_201|>' is not marked as EOG llm_load_vocab: control token: 128208 '<|reserved_special_token_200|>' is not marked as EOG llm_load_vocab: control token: 128207 '<|reserved_special_token_199|>' is not marked as EOG llm_load_vocab: control token: 128204 '<|reserved_special_token_196|>' is not marked as EOG llm_load_vocab: control token: 128202 '<|reserved_special_token_194|>' is not marked as EOG llm_load_vocab: control token: 128197 '<|reserved_special_token_189|>' is not marked as EOG llm_load_vocab: control token: 128195 '<|reserved_special_token_187|>' is not marked as EOG llm_load_vocab: control token: 128194 '<|reserved_special_token_186|>' is not marked as EOG llm_load_vocab: control token: 128191 '<|reserved_special_token_183|>' is not marked as EOG llm_load_vocab: control token: 128190 '<|reserved_special_token_182|>' is not marked as EOG llm_load_vocab: control token: 128188 '<|reserved_special_token_180|>' is not marked as EOG llm_load_vocab: control token: 128187 '<|reserved_special_token_179|>' is not marked as EOG llm_load_vocab: control token: 128185 '<|reserved_special_token_177|>' is not marked as EOG llm_load_vocab: control token: 128184 '<|reserved_special_token_176|>' is not marked as EOG llm_load_vocab: control token: 128183 '<|reserved_special_token_175|>' is not marked as EOG llm_load_vocab: control token: 128178 '<|reserved_special_token_170|>' is not marked as EOG llm_load_vocab: control token: 128177 '<|reserved_special_token_169|>' is not marked as EOG llm_load_vocab: control token: 128176 '<|reserved_special_token_168|>' is not marked as EOG llm_load_vocab: control token: 128175 '<|reserved_special_token_167|>' is not marked as EOG llm_load_vocab: control token: 128174 '<|reserved_special_token_166|>' is not marked as EOG llm_load_vocab: control token: 128173 '<|reserved_special_token_165|>' is not marked as EOG llm_load_vocab: control token: 128172 '<|reserved_special_token_164|>' is not marked as EOG llm_load_vocab: control token: 128169 '<|reserved_special_token_161|>' is not marked as EOG llm_load_vocab: control token: 128167 '<|reserved_special_token_159|>' is not marked as EOG llm_load_vocab: control token: 128166 '<|reserved_special_token_158|>' is not marked as EOG llm_load_vocab: control token: 128160 '<|reserved_special_token_152|>' is not marked as EOG llm_load_vocab: control token: 128159 '<|reserved_special_token_151|>' is not marked as EOG llm_load_vocab: control token: 128157 '<|reserved_special_token_149|>' is not marked as EOG llm_load_vocab: control token: 128156 '<|reserved_special_token_148|>' is not marked as EOG llm_load_vocab: control token: 128154 '<|reserved_special_token_146|>' is not marked as EOG llm_load_vocab: control token: 128152 '<|reserved_special_token_144|>' is not marked as EOG llm_load_vocab: control token: 128151 '<|reserved_special_token_143|>' is not marked as EOG llm_load_vocab: control token: 128150 '<|reserved_special_token_142|>' is not marked as EOG llm_load_vocab: control token: 128147 '<|reserved_special_token_139|>' is not marked as EOG llm_load_vocab: control token: 128144 '<|reserved_special_token_136|>' is not marked as EOG llm_load_vocab: control token: 128142 '<|reserved_special_token_134|>' is not marked as EOG llm_load_vocab: control token: 128141 '<|reserved_special_token_133|>' is not marked as EOG llm_load_vocab: control token: 128140 '<|reserved_special_token_132|>' is not marked as EOG llm_load_vocab: control token: 128133 '<|reserved_special_token_125|>' is not marked as EOG llm_load_vocab: control token: 128130 '<|reserved_special_token_122|>' is not marked as EOG llm_load_vocab: control token: 128128 '<|reserved_special_token_120|>' is not marked as EOG llm_load_vocab: control token: 128127 '<|reserved_special_token_119|>' is not marked as EOG llm_load_vocab: control token: 128126 '<|reserved_special_token_118|>' is not marked as EOG llm_load_vocab: control token: 128125 '<|reserved_special_token_117|>' is not marked as EOG llm_load_vocab: control token: 128124 '<|reserved_special_token_116|>' is not marked as EOG llm_load_vocab: control token: 128123 '<|reserved_special_token_115|>' is not marked as EOG llm_load_vocab: control token: 128122 '<|reserved_special_token_114|>' is not marked as EOG llm_load_vocab: control token: 128121 '<|reserved_special_token_113|>' is not marked as EOG llm_load_vocab: control token: 128120 '<|reserved_special_token_112|>' is not marked as EOG llm_load_vocab: control token: 128119 '<|reserved_special_token_111|>' is not marked as EOG llm_load_vocab: control token: 128116 '<|reserved_special_token_108|>' is not marked as EOG llm_load_vocab: control token: 128115 '<|reserved_special_token_107|>' is not marked as EOG llm_load_vocab: control token: 128114 '<|reserved_special_token_106|>' is not marked as EOG llm_load_vocab: control token: 128113 '<|reserved_special_token_105|>' is not marked as EOG llm_load_vocab: control token: 128111 '<|reserved_special_token_103|>' is not marked as EOG llm_load_vocab: control token: 128110 '<|reserved_special_token_102|>' is not marked as EOG llm_load_vocab: control token: 128107 '<|reserved_special_token_99|>' is not marked as EOG llm_load_vocab: control token: 128106 '<|reserved_special_token_98|>' is not marked as EOG llm_load_vocab: control token: 128105 '<|reserved_special_token_97|>' is not marked as EOG llm_load_vocab: control token: 128104 '<|reserved_special_token_96|>' is not marked as EOG llm_load_vocab: control token: 128103 '<|reserved_special_token_95|>' is not marked as EOG llm_load_vocab: control token: 128100 '<|reserved_special_token_92|>' is not marked as EOG llm_load_vocab: control token: 128097 '<|reserved_special_token_89|>' is not marked as EOG llm_load_vocab: control token: 128096 '<|reserved_special_token_88|>' is not marked as EOG llm_load_vocab: control token: 128094 '<|reserved_special_token_86|>' is not marked as EOG llm_load_vocab: control token: 128093 '<|reserved_special_token_85|>' is not marked as EOG llm_load_vocab: control token: 128090 '<|reserved_special_token_82|>' is not marked as EOG llm_load_vocab: control token: 128089 '<|reserved_special_token_81|>' is not marked as EOG llm_load_vocab: control token: 128087 '<|reserved_special_token_79|>' is not marked as EOG llm_load_vocab: control token: 128085 '<|reserved_special_token_77|>' is not marked as EOG llm_load_vocab: control token: 128080 '<|reserved_special_token_72|>' is not marked as EOG llm_load_vocab: control token: 128077 '<|reserved_special_token_69|>' is not marked as EOG llm_load_vocab: control token: 128076 '<|reserved_special_token_68|>' is not marked as EOG llm_load_vocab: control token: 128073 '<|reserved_special_token_65|>' is not marked as EOG llm_load_vocab: control token: 128070 '<|reserved_special_token_62|>' is not marked as EOG llm_load_vocab: control token: 128069 '<|reserved_special_token_61|>' is not marked as EOG llm_load_vocab: control token: 128067 '<|reserved_special_token_59|>' is not marked as EOG llm_load_vocab: control token: 128064 '<|reserved_special_token_56|>' is not marked as EOG llm_load_vocab: control token: 128062 '<|reserved_special_token_54|>' is not marked as EOG llm_load_vocab: control token: 128061 '<|reserved_special_token_53|>' is not marked as EOG llm_load_vocab: control token: 128060 '<|reserved_special_token_52|>' is not marked as EOG llm_load_vocab: control token: 128054 '<|reserved_special_token_46|>' is not marked as EOG llm_load_vocab: control token: 128045 '<|reserved_special_token_37|>' is not marked as EOG llm_load_vocab: control token: 128044 '<|reserved_special_token_36|>' is not marked as EOG llm_load_vocab: control token: 128043 '<|reserved_special_token_35|>' is not marked as EOG llm_load_vocab: control token: 128042 '<|reserved_special_token_34|>' is not marked as EOG llm_load_vocab: control token: 128038 '<|reserved_special_token_30|>' is not marked as EOG llm_load_vocab: control token: 128037 '<|reserved_special_token_29|>' is not marked as EOG llm_load_vocab: control token: 128035 '<|reserved_special_token_27|>' is not marked as EOG llm_load_vocab: control token: 128034 '<|reserved_special_token_26|>' is not marked as EOG llm_load_vocab: control token: 128033 '<|reserved_special_token_25|>' is not marked as EOG llm_load_vocab: control token: 128032 '<|reserved_special_token_24|>' is not marked as EOG llm_load_vocab: control token: 128030 '<|reserved_special_token_22|>' is not marked as EOG llm_load_vocab: control token: 128029 '<|reserved_special_token_21|>' is not marked as EOG llm_load_vocab: control token: 128028 '<|reserved_special_token_20|>' is not marked as EOG llm_load_vocab: control token: 128026 '<|reserved_special_token_18|>' is not marked as EOG llm_load_vocab: control token: 128025 '<|reserved_special_token_17|>' is not marked as EOG llm_load_vocab: control token: 128024 '<|reserved_special_token_16|>' is not marked as EOG llm_load_vocab: control token: 128022 '<|reserved_special_token_14|>' is not marked as EOG llm_load_vocab: control token: 128020 '<|reserved_special_token_12|>' is not marked as EOG llm_load_vocab: control token: 128017 '<|reserved_special_token_9|>' is not marked as EOG llm_load_vocab: control token: 128016 '<|reserved_special_token_8|>' is not marked as EOG llm_load_vocab: control token: 128015 '<|reserved_special_token_7|>' is not marked as EOG llm_load_vocab: control token: 128014 '<|reserved_special_token_6|>' is not marked as EOG llm_load_vocab: control token: 128013 '<|reserved_special_token_5|>' is not marked as EOG llm_load_vocab: control token: 128011 '<|reserved_special_token_3|>' is not marked as EOG llm_load_vocab: control token: 128010 '<|python_tag|>' is not marked as EOG llm_load_vocab: control token: 128006 '<|start_header_id|>' is not marked as EOG llm_load_vocab: control token: 128003 '<|reserved_special_token_1|>' is not marked as EOG llm_load_vocab: control token: 128002 '<|reserved_special_token_0|>' is not marked as EOG llm_load_vocab: control token: 128000 '<|begin_of_text|>' is not marked as EOG llm_load_vocab: control token: 128041 '<|reserved_special_token_33|>' is not marked as EOG llm_load_vocab: control token: 128063 '<|reserved_special_token_55|>' is not marked as EOG llm_load_vocab: control token: 128046 '<|reserved_special_token_38|>' is not marked as EOG llm_load_vocab: control token: 128007 '<|end_header_id|>' is not marked as EOG llm_load_vocab: control token: 128065 '<|reserved_special_token_57|>' is not marked as EOG llm_load_vocab: control token: 128171 '<|reserved_special_token_163|>' is not marked as EOG llm_load_vocab: control token: 128162 '<|reserved_special_token_154|>' is not marked as EOG llm_load_vocab: control token: 128165 '<|reserved_special_token_157|>' is not marked as EOG llm_load_vocab: control token: 128057 '<|reserved_special_token_49|>' is not marked as EOG llm_load_vocab: control token: 128050 '<|reserved_special_token_42|>' is not marked as EOG llm_load_vocab: control token: 128056 '<|reserved_special_token_48|>' is not marked as EOG llm_load_vocab: control token: 128230 '<|reserved_special_token_222|>' is not marked as EOG llm_load_vocab: control token: 128098 '<|reserved_special_token_90|>' is not marked as EOG llm_load_vocab: control token: 128153 '<|reserved_special_token_145|>' is not marked as EOG llm_load_vocab: control token: 128084 '<|reserved_special_token_76|>' is not marked as EOG llm_load_vocab: control token: 128082 '<|reserved_special_token_74|>' is not marked as EOG llm_load_vocab: control token: 128102 '<|reserved_special_token_94|>' is not marked as EOG llm_load_vocab: control token: 128253 '<|reserved_special_token_245|>' is not marked as EOG llm_load_vocab: control token: 128179 '<|reserved_special_token_171|>' is not marked as EOG llm_load_vocab: control token: 128071 '<|reserved_special_token_63|>' is not marked as EOG llm_load_vocab: control token: 128135 '<|reserved_special_token_127|>' is not marked as EOG llm_load_vocab: control token: 128161 '<|reserved_special_token_153|>' is not marked as EOG llm_load_vocab: control token: 128164 '<|reserved_special_token_156|>' is not marked as EOG llm_load_vocab: control token: 128134 '<|reserved_special_token_126|>' is not marked as EOG llm_load_vocab: control token: 128249 '<|reserved_special_token_241|>' is not marked as EOG llm_load_vocab: control token: 128004 '<|finetune_right_pad_id|>' is not marked as EOG llm_load_vocab: control token: 128036 '<|reserved_special_token_28|>' is not marked as EOG llm_load_vocab: control token: 128148 '<|reserved_special_token_140|>' is not marked as EOG llm_load_vocab: control token: 128181 '<|reserved_special_token_173|>' is not marked as EOG llm_load_vocab: control token: 128222 '<|reserved_special_token_214|>' is not marked as EOG llm_load_vocab: control token: 128075 '<|reserved_special_token_67|>' is not marked as EOG llm_load_vocab: control token: 128241 '<|reserved_special_token_233|>' is not marked as EOG llm_load_vocab: control token: 128051 '<|reserved_special_token_43|>' is not marked as EOG llm_load_vocab: control token: 128068 '<|reserved_special_token_60|>' is not marked as EOG llm_load_vocab: control token: 128149 '<|reserved_special_token_141|>' is not marked as EOG llm_load_vocab: control token: 128201 '<|reserved_special_token_193|>' is not marked as EOG llm_load_vocab: control token: 128058 '<|reserved_special_token_50|>' is not marked as EOG llm_load_vocab: control token: 128146 '<|reserved_special_token_138|>' is not marked as EOG llm_load_vocab: control token: 128143 '<|reserved_special_token_135|>' is not marked as EOG llm_load_vocab: control token: 128023 '<|reserved_special_token_15|>' is not marked as EOG llm_load_vocab: control token: 128039 '<|reserved_special_token_31|>' is not marked as EOG llm_load_vocab: control token: 128132 '<|reserved_special_token_124|>' is not marked as EOG llm_load_vocab: control token: 128101 '<|reserved_special_token_93|>' is not marked as EOG llm_load_vocab: control token: 128212 '<|reserved_special_token_204|>' is not marked as EOG llm_load_vocab: control token: 128189 '<|reserved_special_token_181|>' is not marked as EOG llm_load_vocab: control token: 128225 '<|reserved_special_token_217|>' is not marked as EOG llm_load_vocab: control token: 128129 '<|reserved_special_token_121|>' is not marked as EOG llm_load_vocab: control token: 128005 '<|reserved_special_token_2|>' is not marked as EOG llm_load_vocab: control token: 128078 '<|reserved_special_token_70|>' is not marked as EOG llm_load_vocab: control token: 128163 '<|reserved_special_token_155|>' is not marked as EOG llm_load_vocab: control token: 128072 '<|reserved_special_token_64|>' is not marked as EOG llm_load_vocab: control token: 128112 '<|reserved_special_token_104|>' is not marked as EOG llm_load_vocab: control token: 128186 '<|reserved_special_token_178|>' is not marked as EOG llm_load_vocab: control token: 128095 '<|reserved_special_token_87|>' is not marked as EOG llm_load_vocab: control token: 128109 '<|reserved_special_token_101|>' is not marked as EOG llm_load_vocab: control token: 128099 '<|reserved_special_token_91|>' is not marked as EOG llm_load_vocab: control token: 128138 '<|reserved_special_token_130|>' is not marked as EOG llm_load_vocab: control token: 128193 '<|reserved_special_token_185|>' is not marked as EOG llm_load_vocab: control token: 128199 '<|reserved_special_token_191|>' is not marked as EOG llm_load_vocab: control token: 128048 '<|reserved_special_token_40|>' is not marked as EOG llm_load_vocab: control token: 128088 '<|reserved_special_token_80|>' is not marked as EOG llm_load_vocab: control token: 128192 '<|reserved_special_token_184|>' is not marked as EOG llm_load_vocab: control token: 128136 '<|reserved_special_token_128|>' is not marked as EOG llm_load_vocab: control token: 128092 '<|reserved_special_token_84|>' is not marked as EOG llm_load_vocab: control token: 128158 '<|reserved_special_token_150|>' is not marked as EOG llm_load_vocab: control token: 128001 '<|end_of_text|>' is not marked as EOG llm_load_vocab: control token: 128049 '<|reserved_special_token_41|>' is not marked as EOG llm_load_vocab: control token: 128031 '<|reserved_special_token_23|>' is not marked as EOG llm_load_vocab: control token: 128255 '<|reserved_special_token_247|>' is not marked as EOG llm_load_vocab: control token: 128182 '<|reserved_special_token_174|>' is not marked as EOG llm_load_vocab: control token: 128066 '<|reserved_special_token_58|>' is not marked as EOG llm_load_vocab: control token: 128180 '<|reserved_special_token_172|>' is not marked as EOG llm_load_vocab: control token: 128233 '<|reserved_special_token_225|>' is not marked as EOG llm_load_vocab: control token: 128079 '<|reserved_special_token_71|>' is not marked as EOG llm_load_vocab: control token: 128081 '<|reserved_special_token_73|>' is not marked as EOG llm_load_vocab: control token: 128231 '<|reserved_special_token_223|>' is not marked as EOG llm_load_vocab: control token: 128196 '<|reserved_special_token_188|>' is not marked as EOG llm_load_vocab: control token: 128047 '<|reserved_special_token_39|>' is not marked as EOG llm_load_vocab: control token: 128083 '<|reserved_special_token_75|>' is not marked as EOG llm_load_vocab: control token: 128139 '<|reserved_special_token_131|>' is not marked as EOG llm_load_vocab: control token: 128131 '<|reserved_special_token_123|>' is not marked as EOG llm_load_vocab: control token: 128118 '<|reserved_special_token_110|>' is not marked as EOG llm_load_vocab: control token: 128053 '<|reserved_special_token_45|>' is not marked as EOG llm_load_vocab: control token: 128220 '<|reserved_special_token_212|>' is not marked as EOG llm_load_vocab: control token: 128108 '<|reserved_special_token_100|>' is not marked as EOG llm_load_vocab: control token: 128091 '<|reserved_special_token_83|>' is not marked as EOG llm_load_vocab: control token: 128203 '<|reserved_special_token_195|>' is not marked as EOG llm_load_vocab: control token: 128059 '<|reserved_special_token_51|>' is not marked as EOG llm_load_vocab: control token: 128019 '<|reserved_special_token_11|>' is not marked as EOG llm_load_vocab: control token: 128170 '<|reserved_special_token_162|>' is not marked as EOG llm_load_vocab: control token: 128205 '<|reserved_special_token_197|>' is not marked as EOG llm_load_vocab: control token: 128040 '<|reserved_special_token_32|>' is not marked as EOG llm_load_vocab: control token: 128200 '<|reserved_special_token_192|>' is not marked as EOG llm_load_vocab: control token: 128236 '<|reserved_special_token_228|>' is not marked as EOG llm_load_vocab: control token: 128145 '<|reserved_special_token_137|>' is not marked as EOG llm_load_vocab: control token: 128168 '<|reserved_special_token_160|>' is not marked as EOG llm_load_vocab: control token: 128214 '<|reserved_special_token_206|>' is not marked as EOG llm_load_vocab: control token: 128137 '<|reserved_special_token_129|>' is not marked as EOG llm_load_vocab: control token: 128232 '<|reserved_special_token_224|>' is not marked as EOG llm_load_vocab: control token: 128239 '<|reserved_special_token_231|>' is not marked as EOG llm_load_vocab: control token: 128055 '<|reserved_special_token_47|>' is not marked as EOG llm_load_vocab: control token: 128228 '<|reserved_special_token_220|>' is not marked as EOG llm_load_vocab: control token: 128206 '<|reserved_special_token_198|>' is not marked as EOG llm_load_vocab: control token: 128018 '<|reserved_special_token_10|>' is not marked as EOG llm_load_vocab: control token: 128012 '<|reserved_special_token_4|>' is not marked as EOG llm_load_vocab: control token: 128198 '<|reserved_special_token_190|>' is not marked as EOG llm_load_vocab: control token: 128021 '<|reserved_special_token_13|>' is not marked as EOG llm_load_vocab: control token: 128086 '<|reserved_special_token_78|>' is not marked as EOG llm_load_vocab: control token: 128074 '<|reserved_special_token_66|>' is not marked as EOG llm_load_vocab: control token: 128027 '<|reserved_special_token_19|>' is not marked as EOG llm_load_vocab: control token: 128242 '<|reserved_special_token_234|>' is not marked as EOG llm_load_vocab: control token: 128155 '<|reserved_special_token_147|>' is not marked as EOG llm_load_vocab: control token: 128052 '<|reserved_special_token_44|>' is not marked as EOG llm_load_vocab: control token: 128246 '<|reserved_special_token_238|>' is not marked as EOG llm_load_vocab: control token: 128117 '<|reserved_special_token_109|>' is not marked as EOG llm_load_vocab: control token: 128237 '<|reserved_special_token_229|>' is not marked as EOG llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_layer = 80 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 28672 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 70.55 B llm_load_print_meta: model size = 46.51 GiB (5.66 BPW) llm_load_print_meta: general.name = Llama 3.3 70B Instruct Abliterated llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llm_load_tensors: tensor 'token_embd.weight' (q5_K) (and 312 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead llm_load_tensors: offloading 49 repeating layers to GPU llm_load_tensors: offloaded 49/81 layers to GPU llm_load_tensors: CPU model buffer size = 688.88 MiB llm_load_tensors: CUDA_Host model buffer size = 18738.73 MiB llm_load_tensors: CUDA0 model buffer size = 13712.00 MiB llm_load_tensors: CUDA1 model buffer size = 14488.75 MiB load_all_data: no device found for buffer type CPU for async uploads load_all_data: buffer type CUDA_Host is not the default buffer type for device CUDA0 for async uploads time=2024-12-21T09:42:14.878-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.04" time=2024-12-21T09:42:15.128-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.08" time=2024-12-21T09:42:15.378-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.11" time=2024-12-21T09:42:15.629-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.15" time=2024-12-21T09:42:15.879-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.18" time=2024-12-21T09:42:16.129-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.22" time=2024-12-21T09:42:16.380-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.25" time=2024-12-21T09:42:16.630-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.29" time=2024-12-21T09:42:16.880-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.32" time=2024-12-21T09:42:17.131-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.36" time=2024-12-21T09:42:17.381-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.39" load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0 time=2024-12-21T09:42:17.632-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.42" time=2024-12-21T09:42:17.882-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.45" time=2024-12-21T09:42:18.133-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.48" time=2024-12-21T09:42:18.384-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.51" time=2024-12-21T09:42:18.634-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.54" time=2024-12-21T09:42:18.884-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.57" time=2024-12-21T09:42:19.135-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.60" time=2024-12-21T09:42:19.385-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.63" time=2024-12-21T09:42:19.636-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.65" time=2024-12-21T09:42:19.886-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.68" load_all_data: using async uploads for device CUDA1, buffer type CUDA1, backend CUDA1 time=2024-12-21T09:42:20.136-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.70" time=2024-12-21T09:42:20.387-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.72" time=2024-12-21T09:42:20.638-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.75" time=2024-12-21T09:42:20.888-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.78" time=2024-12-21T09:42:21.139-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.80" time=2024-12-21T09:42:21.389-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.83" time=2024-12-21T09:42:21.639-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.85" time=2024-12-21T09:42:21.890-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.88" time=2024-12-21T09:42:22.140-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.90" time=2024-12-21T09:42:22.391-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.93" time=2024-12-21T09:42:22.641-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.95" time=2024-12-21T09:42:22.891-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.98" llama_new_context_with_model: n_seq_max = 1 llama_new_context_with_model: n_ctx = 32768 llama_new_context_with_model: n_ctx_per_seq = 32768 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_per_seq (32768) < n_ctx_train (131072) -- the full capacity of the model will not be utilized time=2024-12-21T09:42:23.142-05:00 level=DEBUG source=server.go:600 msg="model load progress 1.00" time=2024-12-21T09:42:23.392-05:00 level=DEBUG source=server.go:603 msg="model load completed, waiting for server to become available" status="llm server loading model" llama_kv_cache_init: CPU KV buffer size = 2108.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 1632.00 MiB llama_kv_cache_init: CUDA1 KV buffer size = 1700.00 MiB llama_new_context_with_model: KV self size = 5440.00 MiB, K (q8_0): 2720.00 MiB, V (q8_0): 2720.00 MiB llama_new_context_with_model: CPU output buffer size = 0.52 MiB llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB llama_new_context_with_model: CUDA1 compute buffer size = 176.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 80.01 MiB llama_new_context_with_model: graph nodes = 2247 llama_new_context_with_model: graph splits = 347 (with bs=512), 4 (with bs=1) time=2024-12-21T09:42:23.642-05:00 level=INFO source=server.go:594 msg="llama runner started in 13.53 seconds" time=2024-12-21T09:42:23.642-05:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 time=2024-12-21T09:42:23.642-05:00 level=DEBUG source=routes.go:1542 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\nInstruction: Describe the following scene in detail. Expand each sentence into conversation as applicable. Here I means Rob. Tell the story in Rob's perspective.\nBackground:\nScene: In the kitchen, Lizzy, Melody, and Emily were cooking quesadillas and burritos for 12+ people. Florence and her elder sister Francesca came and helped cook chimichangas and (add another popular mexican food). Mom was there chatting with Aunt Maureen and Aunt Julie. Mayor Layla Williams and her wife Kristen came to our house bringing some homemade salsa and chips. Mom thanked them. They chatted about how Layla and her wife would like to conceive, but they still hadn't found a suitable donor yet. Mom said that the time will come. Aunt Julie asked them about considering adoption, but Layla said that would be the last resort. Maureen asked if they had fertility checks yet and Kristen said they did and the results were great. Add several conversation about how Mom, Aunt Maureen and Aunt Julie got laid off last week and how they tried to find new jobs. They managed to get some clients in several hotels, but nothing much. Layla would help find some clients for them and text them soon so that they could earn some commission. Kristen commented how the food were so good. Melody and Francesca offered them some burritos and quesadillas to pack home. Layla and Kristen gladly accepted them. They tried a bite and they praised how delicious the food are. These were the hard work of the five girls. Layla and her wife Kristen said goodbye. With that, the food were ready. The five girls and I helped serve the dishes. No dinner yet.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" time=2024-12-21T09:42:23.643-05:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=353 used=0 remaining=353 time=2024-12-21T09:42:51.396-05:00 level=DEBUG source=sched.go:466 msg="context for request finished" time=2024-12-21T09:42:51.396-05:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 duration=1h0m0s time=2024-12-21T09:42:51.396-05:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 refCount=0 [GIN] 2024/12/21 - 09:42:51 | 200 | 43.0650887s | 127.0.0.1 | POST "/api/chat"
Author
Owner

@YonTracks commented on GitHub (Dec 21, 2024):

actually srry, I just spotted it and remembered. I think this is same issue as #7984
hopefully someone will know.

<!-- gh-comment-id:2558146236 --> @YonTracks commented on GitHub (Dec 21, 2024): actually srry, I just spotted it and remembered. I think this is same issue as #7984 hopefully someone will know.
Author
Owner

@robbyjo commented on GitHub (Dec 21, 2024):

It may not be the same since I set the number of context properly. Getting to 128K context doesn't make any difference. I got a beefy PC (192GB RAM and 2x4090 with 24GB VRAM each). If I only fill in one Graphic card (Or simply set CUDA_VISIBLE_DEVICES=1 instead of 0,1), then the whole thing works, including 128K context. However, my desire is to use BOTH of my GPUs, not just one. And this is on Windows 11, by the way. I heard that this was no problem in Linux.

<!-- gh-comment-id:2558147113 --> @robbyjo commented on GitHub (Dec 21, 2024): It may not be the same since I set the number of context properly. Getting to 128K context doesn't make any difference. I got a beefy PC (192GB RAM and 2x4090 with 24GB VRAM each). If I only fill in one Graphic card (Or simply set CUDA_VISIBLE_DEVICES=1 instead of 0,1), then the whole thing works, including 128K context. However, my desire is to use BOTH of my GPUs, not just one. And this is on Windows 11, by the way. I heard that this was no problem in Linux.
Author
Owner

@YonTracks commented on GitHub (Dec 21, 2024):

It may not be the same since I set the number of context properly. Getting to 128K context doesn't make any difference. I got a beefy PC (192GB RAM and 2x4090 with 24GB VRAM each). If I only fill in one Graphic card (Or simply set CUDA_VISIBLE_DEVICES=1 instead of 0,1), then the whole thing works, including 128K context. However, my desire is to use BOTH of my GPUs, not just one. And this is on Windows 11, by the way. I heard that this was no problem in Linux.

ahh yes good info cheers.
yes, I see the 2 changes with both gpus. llama_new_context_with_model: n_ctx_per_seq (32768) < n_ctx_train (131072) -- the full capacity of the model will not be
and
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be.
why?
someone will know, I will keep investigating anyway If I find anything I will share the info.
good luck.

<!-- gh-comment-id:2558153357 --> @YonTracks commented on GitHub (Dec 21, 2024): > It may not be the same since I set the number of context properly. Getting to 128K context doesn't make any difference. I got a beefy PC (192GB RAM and 2x4090 with 24GB VRAM each). If I only fill in one Graphic card (Or simply set CUDA_VISIBLE_DEVICES=1 instead of 0,1), then the whole thing works, including 128K context. However, my desire is to use BOTH of my GPUs, not just one. And this is on Windows 11, by the way. I heard that this was no problem in Linux. ahh yes good info cheers. yes, I see the 2 changes with both gpus. ```llama_new_context_with_model: n_ctx_per_seq (32768) < n_ctx_train (131072) -- the full capacity of the model will not be``` and ```llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be```. why? someone will know, I will keep investigating anyway If I find anything I will share the info. good luck.
Author
Owner

@YonTracks commented on GitHub (Dec 21, 2024):

It may not be the same since I set the number of context properly. Getting to 128K context doesn't make any difference. I got a beefy PC (192GB RAM and 2x4090 with 24GB VRAM each). If I only fill in one Graphic card (Or simply set CUDA_VISIBLE_DEVICES=1 instead of 0,1), then the whole thing works, including 128K context. However, my desire is to use BOTH of my GPUs, not just one. And this is on Windows 11, by the way. I heard that this was no problem in Linux.

you say only on windows? good info cheers.
will be here llm/server.go for the params of the other model? something is happening lol, but its late, so thats all I got for now

if pathNeeded {
			s.cmd.Env = append(s.cmd.Env, pathEnv+"="+pathEnvVal)
		}
		if devicesNeeded {
			s.cmd.Env = append(s.cmd.Env, visibleDevicesEnv+"="+visibleDevicesEnvVal)
		}

		slog.Info("starting llama server", "cmd", s.cmd.String())
		if envconfig.Debug() {
			filteredEnv := []string{}
			for _, ev := range s.cmd.Env {
				if strings.HasPrefix(ev, "CUDA_") ||
					strings.HasPrefix(ev, "ROCR_") ||
					strings.HasPrefix(ev, "ROCM_") ||
					strings.HasPrefix(ev, "HIP_") ||
					strings.HasPrefix(ev, "GPU_") ||
					strings.HasPrefix(ev, "HSA_") ||
					strings.HasPrefix(ev, "GGML_") ||
					strings.HasPrefix(ev, "PATH=") ||
					strings.HasPrefix(ev, "LD_LIBRARY_PATH=") {
					filteredEnv = append(filteredEnv, ev)
				}
			}
			// Log at debug as the environment is inherited and might contain sensitive information
			slog.Debug("subprocess", "environment", filteredEnv)
		}

actually, I think we nailed it? how are you setting the num_ctx, I'm not sure that

I did set the num_ctx to 32K before the query. I used Open WebUI for it, Now I used the command line:
/set parameter num_ctx 32768
Set parameter 'num_ctx' to '32768'

will persist to both models?
try options? and other ways, maybe openwebui but good progress.
cheers
good luck.

<!-- gh-comment-id:2558160626 --> @YonTracks commented on GitHub (Dec 21, 2024): > > It may not be the same since I set the number of context properly. Getting to 128K context doesn't make any difference. I got a beefy PC (192GB RAM and 2x4090 with 24GB VRAM each). If I only fill in one Graphic card (Or simply set CUDA_VISIBLE_DEVICES=1 instead of 0,1), then the whole thing works, including 128K context. However, my desire is to use BOTH of my GPUs, not just one. And this is on Windows 11, by the way. I heard that this was no problem in Linux. > you say only on windows? good info cheers. will be here llm/server.go for the params of the other model? something is happening lol, but its late, so thats all I got for now ``` if pathNeeded { s.cmd.Env = append(s.cmd.Env, pathEnv+"="+pathEnvVal) } if devicesNeeded { s.cmd.Env = append(s.cmd.Env, visibleDevicesEnv+"="+visibleDevicesEnvVal) } slog.Info("starting llama server", "cmd", s.cmd.String()) if envconfig.Debug() { filteredEnv := []string{} for _, ev := range s.cmd.Env { if strings.HasPrefix(ev, "CUDA_") || strings.HasPrefix(ev, "ROCR_") || strings.HasPrefix(ev, "ROCM_") || strings.HasPrefix(ev, "HIP_") || strings.HasPrefix(ev, "GPU_") || strings.HasPrefix(ev, "HSA_") || strings.HasPrefix(ev, "GGML_") || strings.HasPrefix(ev, "PATH=") || strings.HasPrefix(ev, "LD_LIBRARY_PATH=") { filteredEnv = append(filteredEnv, ev) } } // Log at debug as the environment is inherited and might contain sensitive information slog.Debug("subprocess", "environment", filteredEnv) } ``` actually, I think we nailed it? how are you setting the num_ctx, I'm not sure that ``` I did set the num_ctx to 32K before the query. I used Open WebUI for it, Now I used the command line: /set parameter num_ctx 32768 Set parameter 'num_ctx' to '32768' ``` will persist to both models? try options? and other ways, maybe openwebui but good progress. cheers good luck.
Author
Owner

@robbyjo commented on GitHub (Dec 21, 2024):

Thanks. Not 100% sure about Windows thing. That was only my impression.

will persist to both models?

Not sure what you mean by this statement. This issue happens to all models. But I tried this also with ollama command line and the same thing happened. Are you saying that the num_ctx was set on one GPU, but not the others? That'd be strange.

I am not sure about how to debug Go language. This would be my first exposure. I could handle some other languages (like C/C++, Python, Java or R)

<!-- gh-comment-id:2558167579 --> @robbyjo commented on GitHub (Dec 21, 2024): Thanks. Not 100% sure about Windows thing. That was only my impression. > will persist to both models? Not sure what you mean by this statement. This issue happens to all models. But I tried this also with ollama command line and the same thing happened. Are you saying that the num_ctx was set on one GPU, but not the others? That'd be strange. I am not sure about how to debug Go language. This would be my first exposure. I could handle some other languages (like C/C++, Python, Java or R)
Author
Owner

@YonTracks commented on GitHub (Dec 21, 2024):

Thanks. Not 100% sure about Windows thing. That was only my impression.

will persist to both models?

Are you saying that the num_ctx was set on one GPU, but not the others? That'd be strange.

yep, that's what seems to be happening this needs to be checked.

if devicesNeeded {
			s.cmd.Env = append(s.cmd.Env, visibleDevicesEnv+"="+visibleDevicesEnvVal)
		}

```.
<!-- gh-comment-id:2558170107 --> @YonTracks commented on GitHub (Dec 21, 2024): > Thanks. Not 100% sure about Windows thing. That was only my impression. > > > will persist to both models? > Are you saying that the num_ctx was set on one GPU, but not the others? That'd be strange. > yep, that's what seems to be happening this needs to be checked. ``` if devicesNeeded { s.cmd.Env = append(s.cmd.Env, visibleDevicesEnv+"="+visibleDevicesEnvVal) } ```.
Author
Owner

@robbyjo commented on GitHub (Dec 21, 2024):

Ok, I'd be happy to test a test build if a precompiled binary for Windows is available.

<!-- gh-comment-id:2558264987 --> @robbyjo commented on GitHub (Dec 21, 2024): Ok, I'd be happy to test a test build if a precompiled binary for Windows is available.
Author
Owner

@rick-github commented on GitHub (Dec 22, 2024):

num_ctx is a per-model setting, not per-GPU. Have you tested with older versions of ollama? For example, 0.3.14 uses C++ runners rather than go.

<!-- gh-comment-id:2558281795 --> @rick-github commented on GitHub (Dec 22, 2024): `num_ctx` is a per-model setting, not per-GPU. Have you tested with older versions of ollama? For example, 0.3.14 uses C++ runners rather than go.
Author
Owner

@YonTracks commented on GitHub (Dec 22, 2024):

num_ctx is a per-model setting, not per-GPU. Have you tested with older versions of ollama? For example, 0.3.14 uses C++ runners rather than go.

I bet it worked, then for that reason, "uses C++ runners rather than go".
I'm seeing an issue with the env params being passed to the other gpus, the other gpus are using the default? not the set params (and a windows thing for sure lol), I can't test multi gpu, if you can? then check that? try hard code the gpu list and params or something.
llm/server.go: line 173:

	params := []string{
		"--model", model,
		"--ctx-size", strconv.Itoa(32768),
		"--batch-size", strconv.Itoa(opts.NumBatch),
	}

this should hard code the num_ctx.
for me:

time=2024-12-22T12:30:16.595+10:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\\Users\\clint\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx\\ollama_llama_server.exe runner --model C:\\Users\\clint\\.ollama\\models\\blobs\\sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 32768 --batch-size 512 --n-gpu-layers 33 --verbose --threads 6 --flash-attn --kv-cache-type f16 --no-mmap --parallel 1 --port 63345"

with this change, build / make the dev mode, and go build . etc...
then for quick testing I copy the new ollama.exe and paste in the "\AppData\Local\Programs\Ollama\ollama.exe"
to keep the original I change the ollama.exe to .txt so I can change it back lol.
good luck

hope thats ok, better than compiling a OllamaSetup.exe? I could do that but seems scary, not safe practice (only if from ollama, I am not ollama lol).
but: 0.5.4-yontracks
for hardcoded "--ctx-size", strconv.Itoa(32768)` and correct OllamaSetup.exe wizard size.
I'll share the link when the build completes
actually, I can't do that file size is too big, I tried.
better to build and test anyway.
good luck

<!-- gh-comment-id:2558297818 --> @YonTracks commented on GitHub (Dec 22, 2024): > `num_ctx` is a per-model setting, not per-GPU. Have you tested with older versions of ollama? For example, 0.3.14 uses C++ runners rather than go. I bet it worked, then for that reason, "uses C++ runners rather than go". I'm seeing an issue with the env params being passed to the other gpus, the other gpus are using the default? not the set params (and a windows thing for sure lol), I can't test multi gpu, if you can? then check that? try hard code the gpu list and params or something. llm/server.go: line 173: ``` params := []string{ "--model", model, "--ctx-size", strconv.Itoa(32768), "--batch-size", strconv.Itoa(opts.NumBatch), } ``` this should hard code the num_ctx. for me: ``` time=2024-12-22T12:30:16.595+10:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\\Users\\clint\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx\\ollama_llama_server.exe runner --model C:\\Users\\clint\\.ollama\\models\\blobs\\sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 32768 --batch-size 512 --n-gpu-layers 33 --verbose --threads 6 --flash-attn --kv-cache-type f16 --no-mmap --parallel 1 --port 63345" ``` with this change, build / make the dev mode, and go build . etc... then for quick testing I copy the new ollama.exe and paste in the "\AppData\Local\Programs\Ollama\ollama.exe" to keep the original I change the ollama.exe to .txt so I can change it back lol. good luck hope thats ok, better than compiling a OllamaSetup.exe? I could do that but seems scary, not safe practice (only if from ollama, I am not ollama lol). but: 0.5.4-yontracks `for hardcoded `"--ctx-size", strconv.Itoa(32768)` and correct OllamaSetup.exe wizard size. I'll share the link when the build completes actually, I can't do that file size is too big, I tried. better to build and test anyway. good luck
Author
Owner

@robbyjo commented on GitHub (Dec 22, 2024):

Thank you for the insight. I tried version 0.3.14 and set the context (num_ctx) to 131072 and IT WORKED!!!!! THANK YOU SO MUCH!!!

I tested the following model:

ollama.exe run hf.co/ArliAI/Llama-3.1-70B-ArliAI-RPMax-v1.3-GGUF:Q4_K_M

Server log:

2024/12/22 14:35:19 routes.go:1158: INFO server config env="map[CUDA_VISIBLE_DEVICES:0,1 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:1572864000 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:E:\DeepLearning\LLM OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-12-22T14:35:19.128-05:00 level=INFO source=images.go:754 msg="total blobs: 74"
time=2024-12-22T14:35:19.130-05:00 level=INFO source=images.go:761 msg="total unused blobs removed: 0"
time=2024-12-22T14:35:19.131-05:00 level=INFO source=routes.go:1205 msg="Listening on 127.0.0.1:11434 (version 0.3.14)"
time=2024-12-22T14:35:19.131-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu\ollama_llama_server.exe"
time=2024-12-22T14:35:19.131-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx\ollama_llama_server.exe"
time=2024-12-22T14:35:19.131-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe"
time=2024-12-22T14:35:19.131-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v11\ollama_llama_server.exe"
time=2024-12-22T14:35:19.131-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12\ollama_llama_server.exe"
time=2024-12-22T14:35:19.131-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm_v6.1\ollama_llama_server.exe"
time=2024-12-22T14:35:19.131-05:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm_v6.1]"
time=2024-12-22T14:35:19.131-05:00 level=DEBUG source=common.go:50 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
time=2024-12-22T14:35:19.131-05:00 level=DEBUG source=sched.go:105 msg="starting llm scheduler"
time=2024-12-22T14:35:19.131-05:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
time=2024-12-22T14:35:19.131-05:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2024-12-22T14:35:19.131-05:00 level=INFO source=gpu_windows.go:183 msg="efficiency cores detected" maxEfficiencyClass=1
time=2024-12-22T14:35:19.131-05:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=24 efficiency=16 threads=32
time=2024-12-22T14:35:19.131-05:00 level=DEBUG source=gpu.go:94 msg="searching for GPU discovery libraries for NVIDIA"
time=2024-12-22T14:35:19.131-05:00 level=DEBUG source=gpu.go:505 msg="Searching for GPU library" name=nvml.dll
time=2024-12-22T14:35:19.131-05:00 level=DEBUG source=gpu.go:528 msg="gpu library search" globs="[C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\nvml.dll C:\Users\User\AppData\Local\Programs\Ollama\nvml.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin\nvml.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\libnvvp\nvml.dll C:\Program Files\Common Files\Oracle\Java\javapath\nvml.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin\nvml.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp\nvml.dll C:\Windows\system32\nvml.dll C:\Windows\nvml.dll C:\Windows\System32\Wbem\nvml.dll C:\Windows\System32\WindowsPowerShell\v1.0\nvml.dll C:\Windows\System32\OpenSSH\nvml.dll C:\Program Files\dotnet\nvml.dll C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\nvml.dll C:\Program Files\Microsoft SQL Server\150\Tools\Binn\nvml.dll C:\Program Files\Microsoft SQL Server\Client SDK\ODBC\170\Tools\Binn\nvml.dll C:\Program Files\NVIDIA Corporation\Nsight Compute 2023.1.0\nvml.dll C:\Program Files (x86)\Incredibuild\nvml.dll C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\nvml.dll C:\Program Files\nodejs\nvml.dll C:\Program Files\NVIDIA Corporation\NVIDIA app\NvDLISR\nvml.dll C:\Program Files\Git\cmd\nvml.dll C:\Program Files\PuTTY\nvml.dll C:\Program Files\Docker\Docker\resources\bin\nvml.dll C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64_win\compiler\nvml.dll C:\Users\User\AppData\Local\Microsoft\WindowsApps\nvml.dll C:\Users\User\.dotnet\tools\nvml.dll C:\Users\User\miniconda3\nvml.dll C:\Users\User\miniconda3\Library\mingw-w64\bin\nvml.dll C:\Users\User\miniconda3\Library\usr\bin\nvml.dll C:\Users\User\miniconda3\Library\bin\nvml.dll C:\Users\User\miniconda3\Scripts\nvml.dll C:\Users\User\AppData\Roaming\npm\nvml.dll C:\Program Files\7-Zip\nvml.dll C:\ffmpeg\bin\nvml.dll C:\Windows\System32\nvml.dll C:\Users\User\AppData\Local\Programs\Ollama\nvml.dll C:\Users\User\.cache\lm-studio\bin\nvml.dll c:\Windows\System32\nvml.dll]"
time=2024-12-22T14:35:19.132-05:00 level=DEBUG source=gpu.go:533 msg="skipping PhysX cuda library path" path="C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\nvml.dll"
time=2024-12-22T14:35:19.132-05:00 level=DEBUG source=gpu.go:562 msg="discovered GPU libraries" paths="[C:\Windows\system32\nvml.dll C:\Windows\System32\nvml.dll c:\Windows\System32\nvml.dll]"
time=2024-12-22T14:35:19.149-05:00 level=DEBUG source=gpu.go:115 msg="nvidia-ml loaded" library=C:\Windows\system32\nvml.dll
time=2024-12-22T14:35:19.149-05:00 level=DEBUG source=gpu.go:505 msg="Searching for GPU library" name=nvcuda.dll
time=2024-12-22T14:35:19.149-05:00 level=DEBUG source=gpu.go:528 msg="gpu library search" globs="[C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\nvcuda.dll C:\Users\User\AppData\Local\Programs\Ollama\nvcuda.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin\nvcuda.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\libnvvp\nvcuda.dll C:\Program Files\Common Files\Oracle\Java\javapath\nvcuda.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin\nvcuda.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp\nvcuda.dll C:\Windows\system32\nvcuda.dll C:\Windows\nvcuda.dll C:\Windows\System32\Wbem\nvcuda.dll C:\Windows\System32\WindowsPowerShell\v1.0\nvcuda.dll C:\Windows\System32\OpenSSH\nvcuda.dll C:\Program Files\dotnet\nvcuda.dll C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\nvcuda.dll C:\Program Files\Microsoft SQL Server\150\Tools\Binn\nvcuda.dll C:\Program Files\Microsoft SQL Server\Client SDK\ODBC\170\Tools\Binn\nvcuda.dll C:\Program Files\NVIDIA Corporation\Nsight Compute 2023.1.0\nvcuda.dll C:\Program Files (x86)\Incredibuild\nvcuda.dll C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\nvcuda.dll C:\Program Files\nodejs\nvcuda.dll C:\Program Files\NVIDIA Corporation\NVIDIA app\NvDLISR\nvcuda.dll C:\Program Files\Git\cmd\nvcuda.dll C:\Program Files\PuTTY\nvcuda.dll C:\Program Files\Docker\Docker\resources\bin\nvcuda.dll C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64_win\compiler\nvcuda.dll C:\Users\User\AppData\Local\Microsoft\WindowsApps\nvcuda.dll C:\Users\User\.dotnet\tools\nvcuda.dll C:\Users\User\miniconda3\nvcuda.dll C:\Users\User\miniconda3\Library\mingw-w64\bin\nvcuda.dll C:\Users\User\miniconda3\Library\usr\bin\nvcuda.dll C:\Users\User\miniconda3\Library\bin\nvcuda.dll C:\Users\User\miniconda3\Scripts\nvcuda.dll C:\Users\User\AppData\Roaming\npm\nvcuda.dll C:\Program Files\7-Zip\nvcuda.dll C:\ffmpeg\bin\nvcuda.dll C:\Windows\System32\nvcuda.dll C:\Users\User\AppData\Local\Programs\Ollama\nvcuda.dll C:\Users\User\.cache\lm-studio\bin\nvcuda.dll c:\windows\system
\nvcuda.dll]"
time=2024-12-22T14:35:19.149-05:00 level=DEBUG source=gpu.go:533 msg="skipping PhysX cuda library path" path="C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\nvcuda.dll"
time=2024-12-22T14:35:19.150-05:00 level=DEBUG source=gpu.go:562 msg="discovered GPU libraries" paths="[C:\Windows\system32\nvcuda.dll C:\Windows\System32\nvcuda.dll]"
time=2024-12-22T14:35:19.177-05:00 level=DEBUG source=gpu.go:129 msg="detected GPUs" count=2 library=C:\Windows\system32\nvcuda.dll
time=2024-12-22T14:35:19.529-05:00 level=INFO source=gpu.go:326 msg="detected OS VRAM overhead" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB"
time=2024-12-22T14:35:19.529-05:00 level=DEBUG source=amd_windows.go:35 msg="unable to load amdhip64_6.dll, please make sure to upgrade to the latest amd driver: The specified module could not be found."
time=2024-12-22T14:35:19.531-05:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB"
time=2024-12-22T14:35:19.531-05:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB"
[GIN] 2024/12/22 - 14:35:45 | 200 | 0s | 127.0.0.1 | GET "/api/version"
[GIN] 2024/12/22 - 14:35:53 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2024/12/22 - 14:35:53 | 200 | 10.3943ms | 127.0.0.1 | POST "/api/show"
time=2024-12-22T14:35:53.955-05:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.7 GiB" before.free="155.9 GiB" before.free_swap="231.3 GiB" now.total="191.7 GiB" now.free="155.6 GiB" now.free_swap="229.7 GiB"
time=2024-12-22T14:35:53.973-05:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB"
time=2024-12-22T14:35:53.989-05:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
time=2024-12-22T14:35:53.989-05:00 level=DEBUG source=sched.go:181 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=0xd7cca0 gpu_count=2
time=2024-12-22T14:35:54.009-05:00 level=DEBUG source=sched.go:224 msg="loading first model" model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
time=2024-12-22T14:35:54.009-05:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]"
time=2024-12-22T14:35:54.009-05:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[22.1 GiB]"
time=2024-12-22T14:35:54.009-05:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]"
time=2024-12-22T14:35:54.010-05:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[22.1 GiB]"
time=2024-12-22T14:35:54.011-05:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 22.1 GiB]"
time=2024-12-22T14:35:54.011-05:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 22.1 GiB]"
time=2024-12-22T14:35:54.012-05:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.7 GiB" before.free="155.6 GiB" before.free_swap="229.7 GiB" now.total="191.7 GiB" now.free="155.6 GiB" now.free_swap="229.7 GiB"
time=2024-12-22T14:35:54.035-05:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB"
time=2024-12-22T14:35:54.050-05:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
time=2024-12-22T14:35:54.051-05:00 level=INFO source=server.go:105 msg="system memory" total="191.7 GiB" free="155.6 GiB" free_swap="229.7 GiB"
time=2024-12-22T14:35:54.051-05:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=2 available="[22.1 GiB 22.5 GiB]"
time=2024-12-22T14:35:54.051-05:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=77 layers.split=38,39 memory.available="[22.1 GiB 22.5 GiB]" memory.gpu_overhead="1.5 GiB" memory.required.full="43.7 GiB" memory.required.partial="41.4 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[20.4 GiB 21.0 GiB]" memory.weights.total="38.9 GiB" memory.weights.repeating="38.1 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB"
time=2024-12-22T14:35:54.052-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu\ollama_llama_server.exe"
time=2024-12-22T14:35:54.052-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx\ollama_llama_server.exe"
time=2024-12-22T14:35:54.052-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe"
time=2024-12-22T14:35:54.052-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v11\ollama_llama_server.exe"
time=2024-12-22T14:35:54.052-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12\ollama_llama_server.exe"
time=2024-12-22T14:35:54.052-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm_v6.1\ollama_llama_server.exe"
time=2024-12-22T14:35:54.052-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu\ollama_llama_server.exe"
time=2024-12-22T14:35:54.052-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx\ollama_llama_server.exe"
time=2024-12-22T14:35:54.052-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe"
time=2024-12-22T14:35:54.052-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v11\ollama_llama_server.exe"
time=2024-12-22T14:35:54.052-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12\ollama_llama_server.exe"
time=2024-12-22T14:35:54.052-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm_v6.1\ollama_llama_server.exe"
time=2024-12-22T14:35:54.056-05:00 level=INFO source=server.go:388 msg="starting llama server" cmd="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12\ollama_llama_server.exe --model E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 --ctx-size 2048 --batch-size 512 --embedding --n-gpu-layers 77 --verbose --threads 8 --flash-attn --no-mmap --parallel 1 --tensor-split 38,39 --port 58940"
time=2024-12-22T14:35:54.057-05:00 level=DEBUG source=server.go:405 msg=subprocess environment="[CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1 CUDA_PATH_V12_1=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1 CUDA_PATH_V12_3=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3 CUDA_VISIBLE_DEVICES=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23,GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 PATH=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12;C:\Users\User\AppData\Local\Programs\Ollama;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\libnvvp;C:\Program Files\Common Files\Oracle\Java\javapath;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Program Files\dotnet\;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\Microsoft SQL Server\150\Tools\Binn\;C:\Program Files\Microsoft SQL Server\Client SDK\ODBC\170\Tools\Binn\;C:\Program Files\NVIDIA Corporation\Nsight Compute 2023.1.0\;C:\Program Files (x86)\Incredibuild;C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\;C:\Program Files\nodejs\;C:\Program Files\NVIDIA Corporation\NVIDIA app\NvDLISR;C:\Program Files\Git\cmd;C:\Program Files\PuTTY\;C:\Program Files\Docker\Docker\resources\bin;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64_win\compiler;C:\Users\User\AppData\Local\Microsoft\WindowsApps;C:\Users\User\.dotnet\tools;C:\Users\User\miniconda3;C:\Users\User\miniconda3\Library\mingw-w64\bin;C:\Users\User\miniconda3\Library\usr\bin;C:\Users\User\miniconda3\Library\bin;C:\Users\User\miniconda3\Scripts;C:\Users\User\AppData\Roaming\npm;C:\Program Files\7-Zip;C:\ffmpeg\bin;;C:\Users\User\AppData\Local\Programs\Ollama;C:\Users\User\.cache\lm-studio\bin ]"
time=2024-12-22T14:35:54.112-05:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2024-12-22T14:35:54.112-05:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
time=2024-12-22T14:35:54.113-05:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
INFO [wmain] starting c++ runner | tid="31080" timestamp=1734896154
INFO [wmain] build info | build=3871 commit="63424972" tid="31080" timestamp=1734896154
INFO [wmain] system info | n_threads=8 n_threads_batch=8 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="31080" timestamp=1734896154 total_threads=32
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="58940" tid="31080" timestamp=1734896154
llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.1 70B ArliAI RPMax v1.3
llama_model_loader: - kv 3: general.version str = v1.3
llama_model_loader: - kv 4: general.finetune str = ArliAI-RPMax
llama_model_loader: - kv 5: general.basename str = Llama-3.1
llama_model_loader: - kv 6: general.size_label str = 70B
llama_model_loader: - kv 7: llama.block_count u32 = 80
llama_model_loader: - kv 8: llama.context_length u32 = 131072
llama_model_loader: - kv 9: llama.embedding_length u32 = 8192
llama_model_loader: - kv 10: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 11: llama.attention.head_count u32 = 64
llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 13: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 15: llama.attention.key_length u32 = 128
llama_model_loader: - kv 16: llama.attention.value_length u32 = 128
llama_model_loader: - kv 17: general.file_type u32 = 15
llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
time=2024-12-22T14:35:54.373-05:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 27: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv 28: general.quantization_version u32 = 2
llama_model_loader: - type f32: 162 tensors
llama_model_loader: - type q4_K: 441 tensors
llama_model_loader: - type q5_K: 40 tensors
llama_model_loader: - type q6_K: 81 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 70.55 B
llm_load_print_meta: model size = 39.59 GiB (4.82 BPW)
llm_load_print_meta: general.name = Llama 3.1 70B ArliAI RPMax v1.3
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size = 1.02 MiB
llm_load_tensors: offloading 77 repeating layers to GPU
llm_load_tensors: offloaded 77/81 layers to GPU
llm_load_tensors: CUDA_Host buffer size = 2942.24 MiB
llm_load_tensors: CUDA0 buffer size = 18482.20 MiB
llm_load_tensors: CUDA1 buffer size = 19118.70 MiB
time=2024-12-22T14:35:55.939-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.01"
time=2024-12-22T14:35:56.222-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.06"
time=2024-12-22T14:35:56.500-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.09"
time=2024-12-22T14:35:56.752-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.12"
time=2024-12-22T14:35:57.004-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.15"
time=2024-12-22T14:35:57.284-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.18"
time=2024-12-22T14:35:57.534-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.21"
time=2024-12-22T14:35:57.785-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.24"
time=2024-12-22T14:35:58.064-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.27"
time=2024-12-22T14:35:58.315-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.30"
time=2024-12-22T14:35:58.594-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.33"
time=2024-12-22T14:35:58.875-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.36"
time=2024-12-22T14:35:59.155-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.40"
time=2024-12-22T14:35:59.406-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.43"
time=2024-12-22T14:35:59.657-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.46"
time=2024-12-22T14:35:59.937-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.49"
time=2024-12-22T14:36:00.187-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.52"
time=2024-12-22T14:36:00.468-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.55"
time=2024-12-22T14:36:00.750-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.58"
time=2024-12-22T14:36:02.869-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.59"
time=2024-12-22T14:36:04.223-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.60"
time=2024-12-22T14:36:06.397-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.61"
time=2024-12-22T14:36:07.724-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.62"
time=2024-12-22T14:36:09.362-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.63"
time=2024-12-22T14:36:11.245-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.64"
time=2024-12-22T14:36:12.897-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.65"
time=2024-12-22T14:36:14.538-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.66"
time=2024-12-22T14:36:16.675-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.67"
time=2024-12-22T14:36:18.000-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.68"
time=2024-12-22T14:36:19.896-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.69"
time=2024-12-22T14:36:21.517-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.70"
time=2024-12-22T14:36:23.120-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.71"
time=2024-12-22T14:36:25.035-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.72"
time=2024-12-22T14:36:26.930-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.73"
time=2024-12-22T14:36:28.286-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.74"
time=2024-12-22T14:36:30.128-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.75"
time=2024-12-22T14:36:32.294-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.76"
time=2024-12-22T14:36:33.397-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.77"
time=2024-12-22T14:36:35.506-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.78"
time=2024-12-22T14:36:36.874-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.79"
time=2024-12-22T14:36:39.084-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.80"
time=2024-12-22T14:36:40.746-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.81"
time=2024-12-22T14:36:42.690-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.82"
time=2024-12-22T14:36:44.041-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.83"
time=2024-12-22T14:36:45.884-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.84"
time=2024-12-22T14:36:47.826-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.85"
time=2024-12-22T14:36:49.210-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.86"
time=2024-12-22T14:36:51.371-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.87"
time=2024-12-22T14:36:52.757-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.88"
time=2024-12-22T14:36:54.420-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.89"
time=2024-12-22T14:36:56.318-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.90"
time=2024-12-22T14:36:57.955-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.91"
time=2024-12-22T14:36:59.591-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.92"
time=2024-12-22T14:37:01.213-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.93"
time=2024-12-22T14:37:02.849-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.94"
time=2024-12-22T14:37:04.486-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.95"
time=2024-12-22T14:37:06.913-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.96"
time=2024-12-22T14:37:08.267-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.97"
time=2024-12-22T14:37:09.872-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.98"
time=2024-12-22T14:37:11.804-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.99"
time=2024-12-22T14:37:13.440-05:00 level=DEBUG source=server.go:632 msg="model load progress 1.00"
time=2024-12-22T14:37:13.719-05:00 level=DEBUG source=server.go:635 msg="model load completed, waiting for server to become available" status="llm server loading model"
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 24.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 304.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 312.00 MiB
llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.52 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 162.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB
llama_new_context_with_model: graph nodes = 2247
llama_new_context_with_model: graph splits = 44
DEBUG [initialize] initializing slots | n_slots=1 tid="31080" timestamp=1734896234
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=0 tid="31080" timestamp=1734896234
INFO [wmain] model loaded | tid="31080" timestamp=1734896234
DEBUG [update_slots] all slots are idle and system prompt is empty, clear the KV cache | tid="31080" timestamp=1734896234
DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=0 tid="31080" timestamp=1734896234
time=2024-12-22T14:37:14.780-05:00 level=INFO source=server.go:626 msg="llama runner started in 80.67 seconds"
time=2024-12-22T14:37:14.780-05:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
[GIN] 2024/12/22 - 14:37:14 | 200 | 1m20s | 127.0.0.1 | POST "/api/generate"
time=2024-12-22T14:37:14.780-05:00 level=DEBUG source=sched.go:466 msg="context for request finished"
time=2024-12-22T14:37:14.780-05:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 duration=1h0m0s
time=2024-12-22T14:37:14.780-05:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 refCount=0
time=2024-12-22T14:37:43.136-05:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
time=2024-12-22T14:37:43.136-05:00 level=DEBUG source=sched.go:283 msg="resetting model to expire immediately to make room" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 refCount=0
time=2024-12-22T14:37:43.136-05:00 level=DEBUG source=sched.go:296 msg="waiting for pending requests to complete and unload to occur" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
time=2024-12-22T14:37:43.136-05:00 level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
time=2024-12-22T14:37:43.136-05:00 level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
time=2024-12-22T14:37:43.136-05:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.7 GiB" before.free="155.6 GiB" before.free_swap="229.7 GiB" now.total="191.7 GiB" now.free="152.7 GiB" now.free_swap="186.8 GiB"
time=2024-12-22T14:37:43.158-05:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="2.3 GiB" now.used="21.7 GiB"
time=2024-12-22T14:37:43.174-05:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="2.9 GiB" now.used="20.0 GiB"
time=2024-12-22T14:37:43.175-05:00 level=DEBUG source=server.go:1086 msg="stopping llama server"
time=2024-12-22T14:37:43.175-05:00 level=DEBUG source=server.go:1092 msg="waiting for llama server to exit"
time=2024-12-22T14:37:43.439-05:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.7 GiB" before.free="152.7 GiB" before.free_swap="186.8 GiB" now.total="191.7 GiB" now.free="152.8 GiB" now.free_swap="226.0 GiB"
time=2024-12-22T14:37:43.501-05:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="2.3 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB"
time=2024-12-22T14:37:43.517-05:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="2.9 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
time=2024-12-22T14:37:43.517-05:00 level=DEBUG source=sched.go:659 msg="gpu VRAM free memory converged after 0.38 seconds" model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
time=2024-12-22T14:37:43.714-05:00 level=DEBUG source=server.go:1096 msg="llama server stopped"
time=2024-12-22T14:37:43.714-05:00 level=DEBUG source=sched.go:380 msg="runner released" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
time=2024-12-22T14:37:43.714-05:00 level=DEBUG source=sched.go:384 msg="sending an unloaded event" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
time=2024-12-22T14:37:43.714-05:00 level=DEBUG source=sched.go:302 msg="unload completed" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
time=2024-12-22T14:37:43.714-05:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.7 GiB" before.free="152.8 GiB" before.free_swap="226.0 GiB" now.total="191.7 GiB" now.free="156.1 GiB" now.free_swap="229.7 GiB"
time=2024-12-22T14:37:43.734-05:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB"
time=2024-12-22T14:37:43.749-05:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
time=2024-12-22T14:37:43.770-05:00 level=DEBUG source=sched.go:224 msg="loading first model" model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
time=2024-12-22T14:37:43.770-05:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]"
time=2024-12-22T14:37:43.770-05:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB" minimum_memory=479199232 layer_size="2.5 GiB" gpu_zer_overhead="0 B" partial_offload="67.1 GiB" full_offload="65.1 GiB"
time=2024-12-22T14:37:43.770-05:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers"
time=2024-12-22T14:37:43.770-05:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[22.1 GiB]"
time=2024-12-22T14:37:43.770-05:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.1 GiB" minimum_memory=479199232 layer_size="2.5 GiB" gpu_zer_overhead="0 B" partial_offload="67.1 GiB" full_offload="65.1 GiB"
time=2024-12-22T14:37:43.770-05:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers"
time=2024-12-22T14:37:43.770-05:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]"
time=2024-12-22T14:37:43.771-05:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[22.1 GiB]"
time=2024-12-22T14:37:43.772-05:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 22.1 GiB]"
time=2024-12-22T14:37:43.772-05:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB" minimum_memory=479199232 layer_size="2.5 GiB" gpu_zer_overhead="0 B" partial_offload="67.1 GiB" full_offload="67.1 GiB"
time=2024-12-22T14:37:43.772-05:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.1 GiB" minimum_memory=479199232 layer_size="2.5 GiB" gpu_zer_overhead="0 B" partial_offload="67.1 GiB" full_offload="67.1 GiB"
time=2024-12-22T14:37:43.772-05:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers"
time=2024-12-22T14:37:43.772-05:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 22.1 GiB]"
time=2024-12-22T14:37:43.773-05:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.7 GiB" before.free="156.1 GiB" before.free_swap="229.7 GiB" now.total="191.7 GiB" now.free="156.1 GiB" now.free_swap="229.7 GiB"
time=2024-12-22T14:37:43.795-05:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB"
time=2024-12-22T14:37:43.811-05:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
time=2024-12-22T14:37:43.812-05:00 level=INFO source=server.go:105 msg="system memory" total="191.7 GiB" free="156.1 GiB" free_swap="229.7 GiB"
time=2024-12-22T14:37:43.812-05:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=2 available="[22.1 GiB 22.5 GiB]"
time=2024-12-22T14:37:43.813-05:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=4 layers.split=2,2 memory.available="[22.1 GiB 22.5 GiB]" memory.gpu_overhead="1.5 GiB" memory.required.full="117.9 GiB" memory.required.partial="40.6 GiB" memory.required.kv="40.0 GiB" memory.required.allocations="[20.3 GiB 20.3 GiB]" memory.weights.total="78.2 GiB" memory.weights.repeating="77.4 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="16.8 GiB" memory.graph.partial="16.8 GiB"
time=2024-12-22T14:37:43.813-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu\ollama_llama_server.exe"
time=2024-12-22T14:37:43.813-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx\ollama_llama_server.exe"
time=2024-12-22T14:37:43.813-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe"
time=2024-12-22T14:37:43.813-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v11\ollama_llama_server.exe"
time=2024-12-22T14:37:43.813-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12\ollama_llama_server.exe"
time=2024-12-22T14:37:43.813-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm_v6.1\ollama_llama_server.exe"
time=2024-12-22T14:37:43.813-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu\ollama_llama_server.exe"
time=2024-12-22T14:37:43.813-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx\ollama_llama_server.exe"
time=2024-12-22T14:37:43.813-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe"
time=2024-12-22T14:37:43.813-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v11\ollama_llama_server.exe"
time=2024-12-22T14:37:43.813-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12\ollama_llama_server.exe"
time=2024-12-22T14:37:43.813-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm_v6.1\ollama_llama_server.exe"
time=2024-12-22T14:37:43.816-05:00 level=INFO source=server.go:388 msg="starting llama server" cmd="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12\ollama_llama_server.exe --model E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 --ctx-size 131072 --batch-size 512 --embedding --n-gpu-layers 4 --verbose --threads 8 --flash-attn --no-mmap --parallel 1 --tensor-split 2,2 --port 59404"
time=2024-12-22T14:37:43.816-05:00 level=DEBUG source=server.go:405 msg=subprocess environment="[CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1 CUDA_PATH_V12_1=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1 CUDA_PATH_V12_3=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3 CUDA_VISIBLE_DEVICES=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23,GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 PATH=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12;C:\Users\User\AppData\Local\Programs\Ollama;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\libnvvp;C:\Program Files\Common Files\Oracle\Java\javapath;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Program Files\dotnet\;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\Microsoft SQL Server\150\Tools\Binn\;C:\Program Files\Microsoft SQL Server\Client SDK\ODBC\170\Tools\Binn\;C:\Program Files\NVIDIA Corporation\Nsight Compute 2023.1.0\;C:\Program Files (x86)\Incredibuild;C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\;C:\Program Files\nodejs\;C:\Program Files\NVIDIA Corporation\NVIDIA app\NvDLISR;C:\Program Files\Git\cmd;C:\Program Files\PuTTY\;C:\Program Files\Docker\Docker\resources\bin;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64_win\compiler;C:\Users\User\AppData\Local\Microsoft\WindowsApps;C:\Users\User\.dotnet\tools;C:\Users\User\miniconda3;C:\Users\User\miniconda3\Library\mingw-w64\bin;C:\Users\User\miniconda3\Library\usr\bin;C:\Users\User\miniconda3\Library\bin;C:\Users\User\miniconda3\Scripts;C:\Users\User\AppData\Roaming\npm;C:\Program Files\7-Zip;C:\ffmpeg\bin;;C:\Users\User\AppData\Local\Programs\Ollama;C:\Users\User\.cache\lm-studio\bin ]"
time=2024-12-22T14:37:43.820-05:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2024-12-22T14:37:43.820-05:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
time=2024-12-22T14:37:43.820-05:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
INFO [wmain] starting c++ runner | tid="35996" timestamp=1734896263
INFO [wmain] build info | build=3871 commit="63424972" tid="35996" timestamp=1734896263
INFO [wmain] system info | n_threads=8 n_threads_batch=8 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="35996" timestamp=1734896263 total_threads=32
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="59404" tid="35996" timestamp=1734896263
llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.1 70B ArliAI RPMax v1.3
llama_model_loader: - kv 3: general.version str = v1.3
llama_model_loader: - kv 4: general.finetune str = ArliAI-RPMax
llama_model_loader: - kv 5: general.basename str = Llama-3.1
llama_model_loader: - kv 6: general.size_label str = 70B
llama_model_loader: - kv 7: llama.block_count u32 = 80
llama_model_loader: - kv 8: llama.context_length u32 = 131072
llama_model_loader: - kv 9: llama.embedding_length u32 = 8192
llama_model_loader: - kv 10: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 11: llama.attention.head_count u32 = 64
llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 13: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 15: llama.attention.key_length u32 = 128
llama_model_loader: - kv 16: llama.attention.value_length u32 = 128
llama_model_loader: - kv 17: general.file_type u32 = 15
llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 27: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv 28: general.quantization_version u32 = 2
llama_model_loader: - type f32: 162 tensors
llama_model_loader: - type q4_K: 441 tensors
llama_model_loader: - type q5_K: 40 tensors
llama_model_loader: - type q6_K: 81 tensors
time=2024-12-22T14:37:44.075-05:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 70.55 B
llm_load_print_meta: model size = 39.59 GiB (4.82 BPW)
llm_load_print_meta: general.name = Llama 3.1 70B ArliAI RPMax v1.3
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size = 1.02 MiB
llm_load_tensors: offloading 4 repeating layers to GPU
llm_load_tensors: offloaded 4/81 layers to GPU
llm_load_tensors: CUDA_Host buffer size = 38467.63 MiB
llm_load_tensors: CUDA0 buffer size = 1037.75 MiB
llm_load_tensors: CUDA1 buffer size = 1037.75 MiB
time=2024-12-22T14:37:57.983-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.04"
time=2024-12-22T14:37:58.263-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.08"
time=2024-12-22T14:37:58.541-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.12"
time=2024-12-22T14:37:58.807-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.16"
time=2024-12-22T14:37:59.086-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.20"
time=2024-12-22T14:37:59.365-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.24"
time=2024-12-22T14:37:59.645-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.28"
time=2024-12-22T14:37:59.926-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.32"
time=2024-12-22T14:38:00.192-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.36"
time=2024-12-22T14:38:00.472-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.40"
time=2024-12-22T14:38:00.752-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.44"
time=2024-12-22T14:38:01.032-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.48"
time=2024-12-22T14:38:01.283-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.52"
time=2024-12-22T14:38:01.533-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.55"
time=2024-12-22T14:38:01.812-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.59"
time=2024-12-22T14:38:02.063-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.63"
time=2024-12-22T14:38:02.343-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.67"
time=2024-12-22T14:38:02.623-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.71"
time=2024-12-22T14:38:02.885-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.75"
time=2024-12-22T14:38:03.165-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.79"
time=2024-12-22T14:38:03.446-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.83"
time=2024-12-22T14:38:03.726-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.87"
time=2024-12-22T14:38:04.005-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.92"
time=2024-12-22T14:38:04.271-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.95"
time=2024-12-22T14:38:04.550-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.97"
time=2024-12-22T14:38:04.832-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.98"
llama_new_context_with_model: n_ctx = 131072
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_cuda_host_malloc: failed to allocate 38912.00 MiB of pinned memory: out of memory
time=2024-12-22T14:38:05.110-05:00 level=DEBUG source=server.go:632 msg="model load progress 1.00"
time=2024-12-22T14:38:05.389-05:00 level=DEBUG source=server.go:635 msg="model load completed, waiting for server to become available" status="llm server loading model"
llama_kv_cache_init: CPU KV buffer size = 38912.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 1024.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 1024.00 MiB
llama_new_context_with_model: KV self size = 40960.00 MiB, K (f16): 20480.00 MiB, V (f16): 20480.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.52 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 272.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 272.01 MiB
llama_new_context_with_model: graph nodes = 2247
llama_new_context_with_model: graph splits = 993
DEBUG [initialize] initializing slots | n_slots=1 tid="35996" timestamp=1734896295
DEBUG [initialize] new slot | n_ctx_slot=131072 slot_id=0 tid="35996" timestamp=1734896295
INFO [wmain] model loaded | tid="35996" timestamp=1734896295
DEBUG [update_slots] all slots are idle and system prompt is empty, clear the KV cache | tid="35996" timestamp=1734896295
time=2024-12-22T14:38:15.801-05:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding"
DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=0 tid="35996" timestamp=1734896297
DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=1 tid="35996" timestamp=1734896297
DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2 tid="35996" timestamp=1734896297
DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=3 tid="35996" timestamp=1734896297
DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=4 tid="35996" timestamp=1734896297
DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=5 tid="35996" timestamp=1734896297
time=2024-12-22T14:38:17.919-05:00 level=INFO source=server.go:626 msg="llama runner started in 34.10 seconds"
time=2024-12-22T14:38:17.919-05:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
time=2024-12-22T14:38:17.919-05:00 level=DEBUG source=routes.go:1422 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\nInstruction: Describe the following scene in detail, add many light hearted conversation indicating closeness. Both Stephanie and her mom, Sarah, are very beautiful and sexy women. They and I were at their house, came straight from the church. Sarah told me that they had some leak in the roof near bedroom and asked me to help. Don't describe the background too much. Describe that their house is a 3-bedroom, 2 bathroom house, not too big, not too small. The roof is quite simple. Make it about 120 sentences\nScene: Open the scene with Sarah's more detailed description on the leak. Stephanie just came from the shed carrying the ladder and several roof shingles and a toolbox. I thanked Stephanie for the ladder and propped it. I took of my shirt, bare chested. I could tell that both Sarah and Stephanie admire my body. I took the shingles and the toolbox up the roof. I managed to replace a few obvious broken shingles, but a section may need a more thorough treatment. So, I climbed down and told them the situation. I was drenched in sweat. Sarah was worried about the cost. I assured her that I would cover all the cost for her. She was overjoyed. Stephanie hugged me tight and kissed me in the lips. Sarah also hugged me tight and kissed me on the cheeks. Because I was sweaty, that made their clothes wet a bit. Just at that moment, Sarah's estranged husband, Mario, came, reeked in alcohol. He was furious. He thought that I was a roofer who tried to take advantage of his wife and daughter. We tried to explain the situation but he was still angry. I wear my shirt again. Sarah told him that I offered her to pay for the roof repair, but he didn't care about the roof. He told to bugger off and don't come back. He then walked away from us, leaving us alone.\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=6 tid="35996" timestamp=1734896297
DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=7 tid="35996" timestamp=1734896297
DEBUG [update_slots] slot progression | ga_i=0 n_past=0 n_past_se=0 n_prompt_tokens_processed=409 slot_id=0 task_id=7 tid="35996" timestamp=1734896297
DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=7 tid="35996" timestamp=1734896297

<!-- gh-comment-id:2558573011 --> @robbyjo commented on GitHub (Dec 22, 2024): Thank you for the insight. I tried version 0.3.14 and set the context (num_ctx) to 131072 and IT WORKED!!!!! THANK YOU SO MUCH!!! I tested the following model: > ollama.exe run hf.co/ArliAI/Llama-3.1-70B-ArliAI-RPMax-v1.3-GGUF:Q4_K_M Server log: > 2024/12/22 14:35:19 routes.go:1158: INFO server config env="map[CUDA_VISIBLE_DEVICES:0,1 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:1572864000 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:E:\\DeepLearning\\LLM OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" time=2024-12-22T14:35:19.128-05:00 level=INFO source=images.go:754 msg="total blobs: 74" time=2024-12-22T14:35:19.130-05:00 level=INFO source=images.go:761 msg="total unused blobs removed: 0" time=2024-12-22T14:35:19.131-05:00 level=INFO source=routes.go:1205 msg="Listening on 127.0.0.1:11434 (version 0.3.14)" time=2024-12-22T14:35:19.131-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu\\ollama_llama_server.exe" time=2024-12-22T14:35:19.131-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx\\ollama_llama_server.exe" time=2024-12-22T14:35:19.131-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe" time=2024-12-22T14:35:19.131-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v11\\ollama_llama_server.exe" time=2024-12-22T14:35:19.131-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12\\ollama_llama_server.exe" time=2024-12-22T14:35:19.131-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\rocm_v6.1\\ollama_llama_server.exe" time=2024-12-22T14:35:19.131-05:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm_v6.1]" time=2024-12-22T14:35:19.131-05:00 level=DEBUG source=common.go:50 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY" time=2024-12-22T14:35:19.131-05:00 level=DEBUG source=sched.go:105 msg="starting llm scheduler" time=2024-12-22T14:35:19.131-05:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs" time=2024-12-22T14:35:19.131-05:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2024-12-22T14:35:19.131-05:00 level=INFO source=gpu_windows.go:183 msg="efficiency cores detected" maxEfficiencyClass=1 time=2024-12-22T14:35:19.131-05:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=24 efficiency=16 threads=32 time=2024-12-22T14:35:19.131-05:00 level=DEBUG source=gpu.go:94 msg="searching for GPU discovery libraries for NVIDIA" time=2024-12-22T14:35:19.131-05:00 level=DEBUG source=gpu.go:505 msg="Searching for GPU library" name=nvml.dll time=2024-12-22T14:35:19.131-05:00 level=DEBUG source=gpu.go:528 msg="gpu library search" globs="[C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\nvml.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\bin\\nvml.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\libnvvp\\nvml.dll C:\\Program Files\\Common Files\\Oracle\\Java\\javapath\\nvml.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\bin\\nvml.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\libnvvp\\nvml.dll C:\\Windows\\system32\\nvml.dll C:\\Windows\\nvml.dll C:\\Windows\\System32\\Wbem\\nvml.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvml.dll C:\\Windows\\System32\\OpenSSH\\nvml.dll C:\\Program Files\\dotnet\\nvml.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\nvml.dll C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\nvml.dll C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2023.1.0\\nvml.dll C:\\Program Files (x86)\\Incredibuild\\nvml.dll C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\nvml.dll C:\\Program Files\\nodejs\\nvml.dll C:\\Program Files\\NVIDIA Corporation\\NVIDIA app\\NvDLISR\\nvml.dll C:\\Program Files\\Git\\cmd\\nvml.dll C:\\Program Files\\PuTTY\\nvml.dll C:\\Program Files\\Docker\\Docker\\resources\\bin\\nvml.dll C:\\Program Files (x86)\\Common Files\\Intel\\Shared Libraries\\redist\\intel64_win\\compiler\\nvml.dll C:\\Users\\User\\AppData\\Local\\Microsoft\\WindowsApps\\nvml.dll C:\\Users\\User\\.dotnet\\tools\\nvml.dll C:\\Users\\User\\miniconda3\\nvml.dll C:\\Users\\User\\miniconda3\\Library\\mingw-w64\\bin\\nvml.dll C:\\Users\\User\\miniconda3\\Library\\usr\\bin\\nvml.dll C:\\Users\\User\\miniconda3\\Library\\bin\\nvml.dll C:\\Users\\User\\miniconda3\\Scripts\\nvml.dll C:\\Users\\User\\AppData\\Roaming\\npm\\nvml.dll C:\\Program Files\\7-Zip\\nvml.dll C:\\ffmpeg\\bin\\nvml.dll C:\\Windows\\System32\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\nvml.dll C:\\Users\\User\\.cache\\lm-studio\\bin\\nvml.dll c:\\Windows\\System32\\nvml.dll]" time=2024-12-22T14:35:19.132-05:00 level=DEBUG source=gpu.go:533 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll" time=2024-12-22T14:35:19.132-05:00 level=DEBUG source=gpu.go:562 msg="discovered GPU libraries" paths="[C:\\Windows\\system32\\nvml.dll C:\\Windows\\System32\\nvml.dll c:\\Windows\\System32\\nvml.dll]" time=2024-12-22T14:35:19.149-05:00 level=DEBUG source=gpu.go:115 msg="nvidia-ml loaded" library=C:\Windows\system32\nvml.dll time=2024-12-22T14:35:19.149-05:00 level=DEBUG source=gpu.go:505 msg="Searching for GPU library" name=nvcuda.dll time=2024-12-22T14:35:19.149-05:00 level=DEBUG source=gpu.go:528 msg="gpu library search" globs="[C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\nvcuda.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\bin\\nvcuda.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\libnvvp\\nvcuda.dll C:\\Program Files\\Common Files\\Oracle\\Java\\javapath\\nvcuda.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\bin\\nvcuda.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\libnvvp\\nvcuda.dll C:\\Windows\\system32\\nvcuda.dll C:\\Windows\\nvcuda.dll C:\\Windows\\System32\\Wbem\\nvcuda.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvcuda.dll C:\\Windows\\System32\\OpenSSH\\nvcuda.dll C:\\Program Files\\dotnet\\nvcuda.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\nvcuda.dll C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\nvcuda.dll C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2023.1.0\\nvcuda.dll C:\\Program Files (x86)\\Incredibuild\\nvcuda.dll C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\nvcuda.dll C:\\Program Files\\nodejs\\nvcuda.dll C:\\Program Files\\NVIDIA Corporation\\NVIDIA app\\NvDLISR\\nvcuda.dll C:\\Program Files\\Git\\cmd\\nvcuda.dll C:\\Program Files\\PuTTY\\nvcuda.dll C:\\Program Files\\Docker\\Docker\\resources\\bin\\nvcuda.dll C:\\Program Files (x86)\\Common Files\\Intel\\Shared Libraries\\redist\\intel64_win\\compiler\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Microsoft\\WindowsApps\\nvcuda.dll C:\\Users\\User\\.dotnet\\tools\\nvcuda.dll C:\\Users\\User\\miniconda3\\nvcuda.dll C:\\Users\\User\\miniconda3\\Library\\mingw-w64\\bin\\nvcuda.dll C:\\Users\\User\\miniconda3\\Library\\usr\\bin\\nvcuda.dll C:\\Users\\User\\miniconda3\\Library\\bin\\nvcuda.dll C:\\Users\\User\\miniconda3\\Scripts\\nvcuda.dll C:\\Users\\User\\AppData\\Roaming\\npm\\nvcuda.dll C:\\Program Files\\7-Zip\\nvcuda.dll C:\\ffmpeg\\bin\\nvcuda.dll C:\\Windows\\System32\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\nvcuda.dll C:\\Users\\User\\.cache\\lm-studio\\bin\\nvcuda.dll c:\\windows\\system*\\nvcuda.dll]" time=2024-12-22T14:35:19.149-05:00 level=DEBUG source=gpu.go:533 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll" time=2024-12-22T14:35:19.150-05:00 level=DEBUG source=gpu.go:562 msg="discovered GPU libraries" paths="[C:\\Windows\\system32\\nvcuda.dll C:\\Windows\\System32\\nvcuda.dll]" time=2024-12-22T14:35:19.177-05:00 level=DEBUG source=gpu.go:129 msg="detected GPUs" count=2 library=C:\Windows\system32\nvcuda.dll time=2024-12-22T14:35:19.529-05:00 level=INFO source=gpu.go:326 msg="detected OS VRAM overhead" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" time=2024-12-22T14:35:19.529-05:00 level=DEBUG source=amd_windows.go:35 msg="unable to load amdhip64_6.dll, please make sure to upgrade to the latest amd driver: The specified module could not be found." time=2024-12-22T14:35:19.531-05:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB" time=2024-12-22T14:35:19.531-05:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB" [GIN] 2024/12/22 - 14:35:45 | 200 | 0s | 127.0.0.1 | GET "/api/version" [GIN] 2024/12/22 - 14:35:53 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/12/22 - 14:35:53 | 200 | 10.3943ms | 127.0.0.1 | POST "/api/show" time=2024-12-22T14:35:53.955-05:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.7 GiB" before.free="155.9 GiB" before.free_swap="231.3 GiB" now.total="191.7 GiB" now.free="155.6 GiB" now.free_swap="229.7 GiB" time=2024-12-22T14:35:53.973-05:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB" time=2024-12-22T14:35:53.989-05:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" time=2024-12-22T14:35:53.989-05:00 level=DEBUG source=sched.go:181 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=0xd7cca0 gpu_count=2 time=2024-12-22T14:35:54.009-05:00 level=DEBUG source=sched.go:224 msg="loading first model" model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 time=2024-12-22T14:35:54.009-05:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]" time=2024-12-22T14:35:54.009-05:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[22.1 GiB]" time=2024-12-22T14:35:54.009-05:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]" time=2024-12-22T14:35:54.010-05:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[22.1 GiB]" time=2024-12-22T14:35:54.011-05:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 22.1 GiB]" time=2024-12-22T14:35:54.011-05:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 22.1 GiB]" time=2024-12-22T14:35:54.012-05:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.7 GiB" before.free="155.6 GiB" before.free_swap="229.7 GiB" now.total="191.7 GiB" now.free="155.6 GiB" now.free_swap="229.7 GiB" time=2024-12-22T14:35:54.035-05:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB" time=2024-12-22T14:35:54.050-05:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" time=2024-12-22T14:35:54.051-05:00 level=INFO source=server.go:105 msg="system memory" total="191.7 GiB" free="155.6 GiB" free_swap="229.7 GiB" time=2024-12-22T14:35:54.051-05:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=2 available="[22.1 GiB 22.5 GiB]" time=2024-12-22T14:35:54.051-05:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=77 layers.split=38,39 memory.available="[22.1 GiB 22.5 GiB]" memory.gpu_overhead="1.5 GiB" memory.required.full="43.7 GiB" memory.required.partial="41.4 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[20.4 GiB 21.0 GiB]" memory.weights.total="38.9 GiB" memory.weights.repeating="38.1 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB" time=2024-12-22T14:35:54.052-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu\\ollama_llama_server.exe" time=2024-12-22T14:35:54.052-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx\\ollama_llama_server.exe" time=2024-12-22T14:35:54.052-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe" time=2024-12-22T14:35:54.052-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v11\\ollama_llama_server.exe" time=2024-12-22T14:35:54.052-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12\\ollama_llama_server.exe" time=2024-12-22T14:35:54.052-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\rocm_v6.1\\ollama_llama_server.exe" time=2024-12-22T14:35:54.052-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu\\ollama_llama_server.exe" time=2024-12-22T14:35:54.052-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx\\ollama_llama_server.exe" time=2024-12-22T14:35:54.052-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe" time=2024-12-22T14:35:54.052-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v11\\ollama_llama_server.exe" time=2024-12-22T14:35:54.052-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12\\ollama_llama_server.exe" time=2024-12-22T14:35:54.052-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\rocm_v6.1\\ollama_llama_server.exe" time=2024-12-22T14:35:54.056-05:00 level=INFO source=server.go:388 msg="starting llama server" cmd="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12\\ollama_llama_server.exe --model E:\\DeepLearning\\LLM\\blobs\\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 --ctx-size 2048 --batch-size 512 --embedding --n-gpu-layers 77 --verbose --threads 8 --flash-attn --no-mmap --parallel 1 --tensor-split 38,39 --port 58940" time=2024-12-22T14:35:54.057-05:00 level=DEBUG source=server.go:405 msg=subprocess environment="[CUDA_PATH=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1 CUDA_PATH_V12_1=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1 CUDA_PATH_V12_3=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3 CUDA_VISIBLE_DEVICES=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23,GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 PATH=C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12;C:\\Users\\User\\AppData\\Local\\Programs\\Ollama;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\libnvvp;C:\\Program Files\\Common Files\\Oracle\\Java\\javapath;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\libnvvp;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\;C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2023.1.0\\;C:\\Program Files (x86)\\Incredibuild;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Program Files\\nodejs\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA app\\NvDLISR;C:\\Program Files\\Git\\cmd;C:\\Program Files\\PuTTY\\;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files (x86)\\Common Files\\Intel\\Shared Libraries\\redist\\intel64_win\\compiler;C:\\Users\\User\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\User\\.dotnet\\tools;C:\\Users\\User\\miniconda3;C:\\Users\\User\\miniconda3\\Library\\mingw-w64\\bin;C:\\Users\\User\\miniconda3\\Library\\usr\\bin;C:\\Users\\User\\miniconda3\\Library\\bin;C:\\Users\\User\\miniconda3\\Scripts;C:\\Users\\User\\AppData\\Roaming\\npm;C:\\Program Files\\7-Zip;C:\\ffmpeg\\bin;;C:\\Users\\User\\AppData\\Local\\Programs\\Ollama;C:\\Users\\User\\.cache\\lm-studio\\bin ]" time=2024-12-22T14:35:54.112-05:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2024-12-22T14:35:54.112-05:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding" time=2024-12-22T14:35:54.113-05:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" INFO [wmain] starting c++ runner | tid="31080" timestamp=1734896154 INFO [wmain] build info | build=3871 commit="63424972" tid="31080" timestamp=1734896154 INFO [wmain] system info | n_threads=8 n_threads_batch=8 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="31080" timestamp=1734896154 total_threads=32 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="58940" tid="31080" timestamp=1734896154 llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.1 70B ArliAI RPMax v1.3 llama_model_loader: - kv 3: general.version str = v1.3 llama_model_loader: - kv 4: general.finetune str = ArliAI-RPMax llama_model_loader: - kv 5: general.basename str = Llama-3.1 llama_model_loader: - kv 6: general.size_label str = 70B llama_model_loader: - kv 7: llama.block_count u32 = 80 llama_model_loader: - kv 8: llama.context_length u32 = 131072 llama_model_loader: - kv 9: llama.embedding_length u32 = 8192 llama_model_loader: - kv 10: llama.feed_forward_length u32 = 28672 llama_model_loader: - kv 11: llama.attention.head_count u32 = 64 llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 13: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 15: llama.attention.key_length u32 = 128 llama_model_loader: - kv 16: llama.attention.value_length u32 = 128 llama_model_loader: - kv 17: general.file_type u32 = 15 llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe time=2024-12-22T14:35:54.373-05:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 27: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 28: general.quantization_version u32 = 2 llama_model_loader: - type f32: 162 tensors llama_model_loader: - type q4_K: 441 tensors llama_model_loader: - type q5_K: 40 tensors llama_model_loader: - type q6_K: 81 tensors llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_layer = 80 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 28672 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 70.55 B llm_load_print_meta: model size = 39.59 GiB (4.82 BPW) llm_load_print_meta: general.name = Llama 3.1 70B ArliAI RPMax v1.3 llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes llm_load_tensors: ggml ctx size = 1.02 MiB llm_load_tensors: offloading 77 repeating layers to GPU llm_load_tensors: offloaded 77/81 layers to GPU llm_load_tensors: CUDA_Host buffer size = 2942.24 MiB llm_load_tensors: CUDA0 buffer size = 18482.20 MiB llm_load_tensors: CUDA1 buffer size = 19118.70 MiB time=2024-12-22T14:35:55.939-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.01" time=2024-12-22T14:35:56.222-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.06" time=2024-12-22T14:35:56.500-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.09" time=2024-12-22T14:35:56.752-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.12" time=2024-12-22T14:35:57.004-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.15" time=2024-12-22T14:35:57.284-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.18" time=2024-12-22T14:35:57.534-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.21" time=2024-12-22T14:35:57.785-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.24" time=2024-12-22T14:35:58.064-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.27" time=2024-12-22T14:35:58.315-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.30" time=2024-12-22T14:35:58.594-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.33" time=2024-12-22T14:35:58.875-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.36" time=2024-12-22T14:35:59.155-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.40" time=2024-12-22T14:35:59.406-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.43" time=2024-12-22T14:35:59.657-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.46" time=2024-12-22T14:35:59.937-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.49" time=2024-12-22T14:36:00.187-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.52" time=2024-12-22T14:36:00.468-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.55" time=2024-12-22T14:36:00.750-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.58" time=2024-12-22T14:36:02.869-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.59" time=2024-12-22T14:36:04.223-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.60" time=2024-12-22T14:36:06.397-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.61" time=2024-12-22T14:36:07.724-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.62" time=2024-12-22T14:36:09.362-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.63" time=2024-12-22T14:36:11.245-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.64" time=2024-12-22T14:36:12.897-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.65" time=2024-12-22T14:36:14.538-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.66" time=2024-12-22T14:36:16.675-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.67" time=2024-12-22T14:36:18.000-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.68" time=2024-12-22T14:36:19.896-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.69" time=2024-12-22T14:36:21.517-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.70" time=2024-12-22T14:36:23.120-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.71" time=2024-12-22T14:36:25.035-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.72" time=2024-12-22T14:36:26.930-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.73" time=2024-12-22T14:36:28.286-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.74" time=2024-12-22T14:36:30.128-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.75" time=2024-12-22T14:36:32.294-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.76" time=2024-12-22T14:36:33.397-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.77" time=2024-12-22T14:36:35.506-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.78" time=2024-12-22T14:36:36.874-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.79" time=2024-12-22T14:36:39.084-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.80" time=2024-12-22T14:36:40.746-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.81" time=2024-12-22T14:36:42.690-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.82" time=2024-12-22T14:36:44.041-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.83" time=2024-12-22T14:36:45.884-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.84" time=2024-12-22T14:36:47.826-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.85" time=2024-12-22T14:36:49.210-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.86" time=2024-12-22T14:36:51.371-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.87" time=2024-12-22T14:36:52.757-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.88" time=2024-12-22T14:36:54.420-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.89" time=2024-12-22T14:36:56.318-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.90" time=2024-12-22T14:36:57.955-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.91" time=2024-12-22T14:36:59.591-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.92" time=2024-12-22T14:37:01.213-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.93" time=2024-12-22T14:37:02.849-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.94" time=2024-12-22T14:37:04.486-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.95" time=2024-12-22T14:37:06.913-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.96" time=2024-12-22T14:37:08.267-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.97" time=2024-12-22T14:37:09.872-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.98" time=2024-12-22T14:37:11.804-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.99" time=2024-12-22T14:37:13.440-05:00 level=DEBUG source=server.go:632 msg="model load progress 1.00" time=2024-12-22T14:37:13.719-05:00 level=DEBUG source=server.go:635 msg="model load completed, waiting for server to become available" status="llm server loading model" llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 24.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 304.00 MiB llama_kv_cache_init: CUDA1 KV buffer size = 312.00 MiB llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.52 MiB llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB llama_new_context_with_model: CUDA1 compute buffer size = 162.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB llama_new_context_with_model: graph nodes = 2247 llama_new_context_with_model: graph splits = 44 DEBUG [initialize] initializing slots | n_slots=1 tid="31080" timestamp=1734896234 DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=0 tid="31080" timestamp=1734896234 INFO [wmain] model loaded | tid="31080" timestamp=1734896234 DEBUG [update_slots] all slots are idle and system prompt is empty, clear the KV cache | tid="31080" timestamp=1734896234 DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=0 tid="31080" timestamp=1734896234 time=2024-12-22T14:37:14.780-05:00 level=INFO source=server.go:626 msg="llama runner started in 80.67 seconds" time=2024-12-22T14:37:14.780-05:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 [GIN] 2024/12/22 - 14:37:14 | 200 | 1m20s | 127.0.0.1 | POST "/api/generate" time=2024-12-22T14:37:14.780-05:00 level=DEBUG source=sched.go:466 msg="context for request finished" time=2024-12-22T14:37:14.780-05:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 duration=1h0m0s time=2024-12-22T14:37:14.780-05:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 refCount=0 time=2024-12-22T14:37:43.136-05:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 time=2024-12-22T14:37:43.136-05:00 level=DEBUG source=sched.go:283 msg="resetting model to expire immediately to make room" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 refCount=0 time=2024-12-22T14:37:43.136-05:00 level=DEBUG source=sched.go:296 msg="waiting for pending requests to complete and unload to occur" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 time=2024-12-22T14:37:43.136-05:00 level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 time=2024-12-22T14:37:43.136-05:00 level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 time=2024-12-22T14:37:43.136-05:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.7 GiB" before.free="155.6 GiB" before.free_swap="229.7 GiB" now.total="191.7 GiB" now.free="152.7 GiB" now.free_swap="186.8 GiB" time=2024-12-22T14:37:43.158-05:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="2.3 GiB" now.used="21.7 GiB" time=2024-12-22T14:37:43.174-05:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="2.9 GiB" now.used="20.0 GiB" time=2024-12-22T14:37:43.175-05:00 level=DEBUG source=server.go:1086 msg="stopping llama server" time=2024-12-22T14:37:43.175-05:00 level=DEBUG source=server.go:1092 msg="waiting for llama server to exit" time=2024-12-22T14:37:43.439-05:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.7 GiB" before.free="152.7 GiB" before.free_swap="186.8 GiB" now.total="191.7 GiB" now.free="152.8 GiB" now.free_swap="226.0 GiB" time=2024-12-22T14:37:43.501-05:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="2.3 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB" time=2024-12-22T14:37:43.517-05:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="2.9 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" time=2024-12-22T14:37:43.517-05:00 level=DEBUG source=sched.go:659 msg="gpu VRAM free memory converged after 0.38 seconds" model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 time=2024-12-22T14:37:43.714-05:00 level=DEBUG source=server.go:1096 msg="llama server stopped" time=2024-12-22T14:37:43.714-05:00 level=DEBUG source=sched.go:380 msg="runner released" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 time=2024-12-22T14:37:43.714-05:00 level=DEBUG source=sched.go:384 msg="sending an unloaded event" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 time=2024-12-22T14:37:43.714-05:00 level=DEBUG source=sched.go:302 msg="unload completed" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 time=2024-12-22T14:37:43.714-05:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.7 GiB" before.free="152.8 GiB" before.free_swap="226.0 GiB" now.total="191.7 GiB" now.free="156.1 GiB" now.free_swap="229.7 GiB" time=2024-12-22T14:37:43.734-05:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB" time=2024-12-22T14:37:43.749-05:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" time=2024-12-22T14:37:43.770-05:00 level=DEBUG source=sched.go:224 msg="loading first model" model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 time=2024-12-22T14:37:43.770-05:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]" time=2024-12-22T14:37:43.770-05:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB" minimum_memory=479199232 layer_size="2.5 GiB" gpu_zer_overhead="0 B" partial_offload="67.1 GiB" full_offload="65.1 GiB" time=2024-12-22T14:37:43.770-05:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers" time=2024-12-22T14:37:43.770-05:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[22.1 GiB]" time=2024-12-22T14:37:43.770-05:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.1 GiB" minimum_memory=479199232 layer_size="2.5 GiB" gpu_zer_overhead="0 B" partial_offload="67.1 GiB" full_offload="65.1 GiB" time=2024-12-22T14:37:43.770-05:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers" time=2024-12-22T14:37:43.770-05:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]" time=2024-12-22T14:37:43.771-05:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[22.1 GiB]" time=2024-12-22T14:37:43.772-05:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 22.1 GiB]" time=2024-12-22T14:37:43.772-05:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB" minimum_memory=479199232 layer_size="2.5 GiB" gpu_zer_overhead="0 B" partial_offload="67.1 GiB" full_offload="67.1 GiB" time=2024-12-22T14:37:43.772-05:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.1 GiB" minimum_memory=479199232 layer_size="2.5 GiB" gpu_zer_overhead="0 B" partial_offload="67.1 GiB" full_offload="67.1 GiB" time=2024-12-22T14:37:43.772-05:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers" time=2024-12-22T14:37:43.772-05:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 22.1 GiB]" time=2024-12-22T14:37:43.773-05:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.7 GiB" before.free="156.1 GiB" before.free_swap="229.7 GiB" now.total="191.7 GiB" now.free="156.1 GiB" now.free_swap="229.7 GiB" time=2024-12-22T14:37:43.795-05:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB" time=2024-12-22T14:37:43.811-05:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" time=2024-12-22T14:37:43.812-05:00 level=INFO source=server.go:105 msg="system memory" total="191.7 GiB" free="156.1 GiB" free_swap="229.7 GiB" time=2024-12-22T14:37:43.812-05:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=2 available="[22.1 GiB 22.5 GiB]" time=2024-12-22T14:37:43.813-05:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=4 layers.split=2,2 memory.available="[22.1 GiB 22.5 GiB]" memory.gpu_overhead="1.5 GiB" memory.required.full="117.9 GiB" memory.required.partial="40.6 GiB" memory.required.kv="40.0 GiB" memory.required.allocations="[20.3 GiB 20.3 GiB]" memory.weights.total="78.2 GiB" memory.weights.repeating="77.4 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="16.8 GiB" memory.graph.partial="16.8 GiB" time=2024-12-22T14:37:43.813-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu\\ollama_llama_server.exe" time=2024-12-22T14:37:43.813-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx\\ollama_llama_server.exe" time=2024-12-22T14:37:43.813-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe" time=2024-12-22T14:37:43.813-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v11\\ollama_llama_server.exe" time=2024-12-22T14:37:43.813-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12\\ollama_llama_server.exe" time=2024-12-22T14:37:43.813-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\rocm_v6.1\\ollama_llama_server.exe" time=2024-12-22T14:37:43.813-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu\\ollama_llama_server.exe" time=2024-12-22T14:37:43.813-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx\\ollama_llama_server.exe" time=2024-12-22T14:37:43.813-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe" time=2024-12-22T14:37:43.813-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v11\\ollama_llama_server.exe" time=2024-12-22T14:37:43.813-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12\\ollama_llama_server.exe" time=2024-12-22T14:37:43.813-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\rocm_v6.1\\ollama_llama_server.exe" time=2024-12-22T14:37:43.816-05:00 level=INFO source=server.go:388 msg="starting llama server" cmd="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12\\ollama_llama_server.exe --model E:\\DeepLearning\\LLM\\blobs\\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 --ctx-size 131072 --batch-size 512 --embedding --n-gpu-layers 4 --verbose --threads 8 --flash-attn --no-mmap --parallel 1 --tensor-split 2,2 --port 59404" time=2024-12-22T14:37:43.816-05:00 level=DEBUG source=server.go:405 msg=subprocess environment="[CUDA_PATH=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1 CUDA_PATH_V12_1=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1 CUDA_PATH_V12_3=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3 CUDA_VISIBLE_DEVICES=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23,GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 PATH=C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12;C:\\Users\\User\\AppData\\Local\\Programs\\Ollama;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\libnvvp;C:\\Program Files\\Common Files\\Oracle\\Java\\javapath;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\libnvvp;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\;C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2023.1.0\\;C:\\Program Files (x86)\\Incredibuild;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Program Files\\nodejs\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA app\\NvDLISR;C:\\Program Files\\Git\\cmd;C:\\Program Files\\PuTTY\\;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files (x86)\\Common Files\\Intel\\Shared Libraries\\redist\\intel64_win\\compiler;C:\\Users\\User\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\User\\.dotnet\\tools;C:\\Users\\User\\miniconda3;C:\\Users\\User\\miniconda3\\Library\\mingw-w64\\bin;C:\\Users\\User\\miniconda3\\Library\\usr\\bin;C:\\Users\\User\\miniconda3\\Library\\bin;C:\\Users\\User\\miniconda3\\Scripts;C:\\Users\\User\\AppData\\Roaming\\npm;C:\\Program Files\\7-Zip;C:\\ffmpeg\\bin;;C:\\Users\\User\\AppData\\Local\\Programs\\Ollama;C:\\Users\\User\\.cache\\lm-studio\\bin ]" time=2024-12-22T14:37:43.820-05:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2024-12-22T14:37:43.820-05:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding" time=2024-12-22T14:37:43.820-05:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" INFO [wmain] starting c++ runner | tid="35996" timestamp=1734896263 INFO [wmain] build info | build=3871 commit="63424972" tid="35996" timestamp=1734896263 INFO [wmain] system info | n_threads=8 n_threads_batch=8 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="35996" timestamp=1734896263 total_threads=32 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="59404" tid="35996" timestamp=1734896263 llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.1 70B ArliAI RPMax v1.3 llama_model_loader: - kv 3: general.version str = v1.3 llama_model_loader: - kv 4: general.finetune str = ArliAI-RPMax llama_model_loader: - kv 5: general.basename str = Llama-3.1 llama_model_loader: - kv 6: general.size_label str = 70B llama_model_loader: - kv 7: llama.block_count u32 = 80 llama_model_loader: - kv 8: llama.context_length u32 = 131072 llama_model_loader: - kv 9: llama.embedding_length u32 = 8192 llama_model_loader: - kv 10: llama.feed_forward_length u32 = 28672 llama_model_loader: - kv 11: llama.attention.head_count u32 = 64 llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 13: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 15: llama.attention.key_length u32 = 128 llama_model_loader: - kv 16: llama.attention.value_length u32 = 128 llama_model_loader: - kv 17: general.file_type u32 = 15 llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 27: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 28: general.quantization_version u32 = 2 llama_model_loader: - type f32: 162 tensors llama_model_loader: - type q4_K: 441 tensors llama_model_loader: - type q5_K: 40 tensors llama_model_loader: - type q6_K: 81 tensors time=2024-12-22T14:37:44.075-05:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_layer = 80 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 28672 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 70.55 B llm_load_print_meta: model size = 39.59 GiB (4.82 BPW) llm_load_print_meta: general.name = Llama 3.1 70B ArliAI RPMax v1.3 llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes llm_load_tensors: ggml ctx size = 1.02 MiB llm_load_tensors: offloading 4 repeating layers to GPU llm_load_tensors: offloaded 4/81 layers to GPU llm_load_tensors: CUDA_Host buffer size = 38467.63 MiB llm_load_tensors: CUDA0 buffer size = 1037.75 MiB llm_load_tensors: CUDA1 buffer size = 1037.75 MiB time=2024-12-22T14:37:57.983-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.04" time=2024-12-22T14:37:58.263-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.08" time=2024-12-22T14:37:58.541-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.12" time=2024-12-22T14:37:58.807-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.16" time=2024-12-22T14:37:59.086-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.20" time=2024-12-22T14:37:59.365-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.24" time=2024-12-22T14:37:59.645-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.28" time=2024-12-22T14:37:59.926-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.32" time=2024-12-22T14:38:00.192-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.36" time=2024-12-22T14:38:00.472-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.40" time=2024-12-22T14:38:00.752-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.44" time=2024-12-22T14:38:01.032-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.48" time=2024-12-22T14:38:01.283-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.52" time=2024-12-22T14:38:01.533-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.55" time=2024-12-22T14:38:01.812-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.59" time=2024-12-22T14:38:02.063-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.63" time=2024-12-22T14:38:02.343-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.67" time=2024-12-22T14:38:02.623-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.71" time=2024-12-22T14:38:02.885-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.75" time=2024-12-22T14:38:03.165-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.79" time=2024-12-22T14:38:03.446-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.83" time=2024-12-22T14:38:03.726-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.87" time=2024-12-22T14:38:04.005-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.92" time=2024-12-22T14:38:04.271-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.95" time=2024-12-22T14:38:04.550-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.97" time=2024-12-22T14:38:04.832-05:00 level=DEBUG source=server.go:632 msg="model load progress 0.98" llama_new_context_with_model: n_ctx = 131072 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 ggml_cuda_host_malloc: failed to allocate 38912.00 MiB of pinned memory: out of memory time=2024-12-22T14:38:05.110-05:00 level=DEBUG source=server.go:632 msg="model load progress 1.00" time=2024-12-22T14:38:05.389-05:00 level=DEBUG source=server.go:635 msg="model load completed, waiting for server to become available" status="llm server loading model" llama_kv_cache_init: CPU KV buffer size = 38912.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 1024.00 MiB llama_kv_cache_init: CUDA1 KV buffer size = 1024.00 MiB llama_new_context_with_model: KV self size = 40960.00 MiB, K (f16): 20480.00 MiB, V (f16): 20480.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.52 MiB llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB llama_new_context_with_model: CUDA1 compute buffer size = 272.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 272.01 MiB llama_new_context_with_model: graph nodes = 2247 llama_new_context_with_model: graph splits = 993 DEBUG [initialize] initializing slots | n_slots=1 tid="35996" timestamp=1734896295 DEBUG [initialize] new slot | n_ctx_slot=131072 slot_id=0 tid="35996" timestamp=1734896295 INFO [wmain] model loaded | tid="35996" timestamp=1734896295 DEBUG [update_slots] all slots are idle and system prompt is empty, clear the KV cache | tid="35996" timestamp=1734896295 time=2024-12-22T14:38:15.801-05:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding" DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=0 tid="35996" timestamp=1734896297 DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=1 tid="35996" timestamp=1734896297 DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2 tid="35996" timestamp=1734896297 DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=3 tid="35996" timestamp=1734896297 DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=4 tid="35996" timestamp=1734896297 DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=5 tid="35996" timestamp=1734896297 time=2024-12-22T14:38:17.919-05:00 level=INFO source=server.go:626 msg="llama runner started in 34.10 seconds" time=2024-12-22T14:38:17.919-05:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 time=2024-12-22T14:38:17.919-05:00 level=DEBUG source=routes.go:1422 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\nInstruction: Describe the following scene in detail, add many light hearted conversation indicating closeness. Both Stephanie and her mom, Sarah, are very beautiful and sexy women. They and I were at their house, came straight from the church. Sarah told me that they had some leak in the roof near bedroom and asked me to help. Don't describe the background too much. Describe that their house is a 3-bedroom, 2 bathroom house, not too big, not too small. The roof is quite simple. Make it about 120 sentences\nScene: Open the scene with Sarah's more detailed description on the leak. Stephanie just came from the shed carrying the ladder and several roof shingles and a toolbox. I thanked Stephanie for the ladder and propped it. I took of my shirt, bare chested. I could tell that both Sarah and Stephanie admire my body. I took the shingles and the toolbox up the roof. I managed to replace a few obvious broken shingles, but a section may need a more thorough treatment. So, I climbed down and told them the situation. I was drenched in sweat. Sarah was worried about the cost. I assured her that I would cover all the cost for her. She was overjoyed. Stephanie hugged me tight and kissed me in the lips. Sarah also hugged me tight and kissed me on the cheeks. Because I was sweaty, that made their clothes wet a bit. Just at that moment, Sarah's estranged husband, Mario, came, reeked in alcohol. He was furious. He thought that I was a roofer who tried to take advantage of his wife and daughter. We tried to explain the situation but he was still angry. I wear my shirt again. Sarah told him that I offered her to pay for the roof repair, but he didn't care about the roof. He told to bugger off and don't come back. He then walked away from us, leaving us alone.\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=6 tid="35996" timestamp=1734896297 DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=7 tid="35996" timestamp=1734896297 DEBUG [update_slots] slot progression | ga_i=0 n_past=0 n_past_se=0 n_prompt_tokens_processed=409 slot_id=0 task_id=7 tid="35996" timestamp=1734896297 DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=7 tid="35996" timestamp=1734896297
Author
Owner

@rick-github commented on GitHub (Dec 22, 2024):

@robbyjo Great! Would it be possible for you to try different versions until you find the one where it fails? 0.4.0 is the likely culprit because of the switch go go runners, but there has been other work between there and 0.5.4 that might also have caused the problem. If it can be nailed down to a specific version there's a better chance of finding and fixing the root cause.

<!-- gh-comment-id:2558611649 --> @rick-github commented on GitHub (Dec 22, 2024): @robbyjo Great! Would it be possible for you to try different versions until you find the one where it fails? 0.4.0 is the likely culprit because of the switch go go runners, but there has been other work between there and 0.5.4 that might also have caused the problem. If it can be nailed down to a specific version there's a better chance of finding and fixing the root cause.
Author
Owner

@robbyjo commented on GitHub (Dec 22, 2024):

@rick-github It worked last at 0.4.7, failed at 0.5.0.

However for the entire 0.4.x, I saw the GPU VRAM usage only at ~6GB for GPU 0 and ~5GB for GPU 1 instead of ~22GB each for 0.3.14. And it feels slow as well compared to single GPU on 0.5.4.

Edit: Clarification. For 0.4.x, GPU VRAM usage was ~22GB BEFORE I change the num_ctx parameter and put in my query.

<!-- gh-comment-id:2558625710 --> @robbyjo commented on GitHub (Dec 22, 2024): @rick-github It worked last at 0.4.7, failed at 0.5.0. However for the entire 0.4.x, I saw the GPU VRAM usage only at ~6GB for GPU 0 and ~5GB for GPU 1 instead of ~22GB each for 0.3.14. And it feels slow as well compared to single GPU on 0.5.4. Edit: Clarification. For 0.4.x, GPU VRAM usage was ~22GB BEFORE I change the num_ctx parameter and put in my query.
Author
Owner

@robbyjo commented on GitHub (Dec 22, 2024):

For version 0.4.7, I tried the /set parameter num_gpu 48 (and 54, 60, 64), it failed with cudaMalloc failed: out of memory if num_ctx is 131072

Strangely enough, for version 0.4.7, setting num_gpu to 54 and num_ctx to 32768, the output is garbled again even though the VRAM usage is up.

Setting num_gpu to 40 and num_ctx to 32768, the GPU usage was up to about ~12GB each card, but the output is also garbled.

<!-- gh-comment-id:2558631027 --> @robbyjo commented on GitHub (Dec 22, 2024): For version 0.4.7, I tried the /set parameter num_gpu 48 (and 54, 60, 64), it failed with cudaMalloc failed: out of memory if num_ctx is 131072 Strangely enough, for version 0.4.7, setting num_gpu to 54 and num_ctx to 32768, the output is garbled again even though the VRAM usage is up. Setting num_gpu to 40 and num_ctx to 32768, the GPU usage was up to about ~12GB each card, but the output is also garbled.
Author
Owner

@rick-github commented on GitHub (Dec 22, 2024):

Could you add the log for 0.4.7?

<!-- gh-comment-id:2558639370 --> @rick-github commented on GitHub (Dec 22, 2024): Could you add the log for 0.4.7?
Author
Owner

@robbyjo commented on GitHub (Dec 22, 2024):

Ok. Here is the log for 0.4.7. At first, I only changed num_ctx to 131072, which worked great except for low memory utilization. I interrupted the output. Then I changed num_ctx to 32768 and num_gpu to 48 and repeated the same query. The result was then garbled.

2024/12/22 18:11:20 routes.go:1197: INFO server config env="map[CUDA_VISIBLE_DEVICES:0,1 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:1572864000 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:E:\DeepLearning\LLM OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-12-22T18:11:20.021-05:00 level=INFO source=images.go:753 msg="total blobs: 74"
time=2024-12-22T18:11:20.023-05:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
time=2024-12-22T18:11:20.024-05:00 level=INFO source=routes.go:1248 msg="Listening on 127.0.0.1:11434 (version 0.4.7)"
time=2024-12-22T18:11:20.024-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu\ollama_llama_server.exe"
time=2024-12-22T18:11:20.024-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx\ollama_llama_server.exe"
time=2024-12-22T18:11:20.024-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe"
time=2024-12-22T18:11:20.024-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v11\ollama_llama_server.exe"
time=2024-12-22T18:11:20.024-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12\ollama_llama_server.exe"
time=2024-12-22T18:11:20.024-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm\ollama_llama_server.exe"
time=2024-12-22T18:11:20.024-05:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cuda_v11 cuda_v12 rocm cpu cpu_avx cpu_avx2]"
time=2024-12-22T18:11:20.024-05:00 level=DEBUG source=common.go:50 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
time=2024-12-22T18:11:20.024-05:00 level=DEBUG source=sched.go:105 msg="starting llm scheduler"
time=2024-12-22T18:11:20.024-05:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
time=2024-12-22T18:11:20.024-05:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2024-12-22T18:11:20.024-05:00 level=INFO source=gpu_windows.go:183 msg="efficiency cores detected" maxEfficiencyClass=1
time=2024-12-22T18:11:20.024-05:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=24 efficiency=16 threads=32
time=2024-12-22T18:11:20.024-05:00 level=DEBUG source=gpu.go:94 msg="searching for GPU discovery libraries for NVIDIA"
time=2024-12-22T18:11:20.024-05:00 level=DEBUG source=gpu.go:509 msg="Searching for GPU library" name=nvml.dll
time=2024-12-22T18:11:20.024-05:00 level=DEBUG source=gpu.go:532 msg="gpu library search" globs="[C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\nvml.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin\nvml.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\libnvvp\nvml.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin\nvml.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\libnvvp\nvml.dll C:\Program Files\Common Files\Oracle\Java\javapath\nvml.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin\nvml.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp\nvml.dll C:\Windows\system32\nvml.dll C:\Windows\nvml.dll C:\Windows\System32\Wbem\nvml.dll C:\Windows\System32\WindowsPowerShell\v1.0\nvml.dll C:\Windows\System32\OpenSSH\nvml.dll C:\Program Files\dotnet\nvml.dll C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\nvml.dll C:\Program Files\Microsoft SQL Server\150\Tools\Binn\nvml.dll C:\Program Files\Microsoft SQL Server\Client SDK\ODBC\170\Tools\Binn\nvml.dll C:\Program Files (x86)\Incredibuild\nvml.dll C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\nvml.dll C:\Program Files\nodejs\nvml.dll C:\Program Files\NVIDIA Corporation\NVIDIA app\NvDLISR\nvml.dll C:\Program Files\Git\cmd\nvml.dll C:\Program Files\PuTTY\nvml.dll C:\Program Files\Docker\Docker\resources\bin\nvml.dll C:\Program Files\Go\bin\nvml.dll C:\Program Files\NVIDIA Corporation\Nsight Compute 2024.3.2\nvml.dll C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64_win\compiler\nvml.dll C:\Users\User\AppData\Local\Microsoft\WindowsApps\nvml.dll C:\Users\User\.dotnet\tools\nvml.dll C:\Users\User\miniconda3\nvml.dll C:\Users\User\miniconda3\Library\mingw-w64\bin\nvml.dll C:\Users\User\miniconda3\Library\usr\bin\nvml.dll C:\Users\User\miniconda3\Library\bin\nvml.dll C:\Users\User\miniconda3\Scripts\nvml.dll C:\Users\User\AppData\Roaming\npm\nvml.dll C:\Program Files\7-Zip\nvml.dll C:\ffmpeg\bin\nvml.dll C:\Users\User\AppData\Local\Programs\Ollama\nvml.dll C:\Users\User\AppData\Local\Programs\Ollama\nvml.dll C:\Users\User\.cache\lm-studio\bin\nvml.dll C:\Users\User\go\bin\nvml.dll C:\Users\User\.dotnet\tools\nvml.dll c:\Windows\System32\nvml.dll]"
time=2024-12-22T18:11:20.024-05:00 level=DEBUG source=gpu.go:537 msg="skipping PhysX cuda library path" path="C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\nvml.dll"
time=2024-12-22T18:11:20.025-05:00 level=DEBUG source=gpu.go:566 msg="discovered GPU libraries" paths="[C:\Windows\system32\nvml.dll c:\Windows\System32\nvml.dll]"
time=2024-12-22T18:11:20.040-05:00 level=DEBUG source=gpu.go:115 msg="nvidia-ml loaded" library=C:\Windows\system32\nvml.dll
time=2024-12-22T18:11:20.040-05:00 level=DEBUG source=gpu.go:509 msg="Searching for GPU library" name=nvcuda.dll
time=2024-12-22T18:11:20.040-05:00 level=DEBUG source=gpu.go:532 msg="gpu library search" globs="[C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\nvcuda.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin\nvcuda.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\libnvvp\nvcuda.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin\nvcuda.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\libnvvp\nvcuda.dll C:\Program Files\Common Files\Oracle\Java\javapath\nvcuda.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin\nvcuda.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp\nvcuda.dll C:\Windows\system32\nvcuda.dll C:\Windows\nvcuda.dll C:\Windows\System32\Wbem\nvcuda.dll C:\Windows\System32\WindowsPowerShell\v1.0\nvcuda.dll C:\Windows\System32\OpenSSH\nvcuda.dll C:\Program Files\dotnet\nvcuda.dll C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\nvcuda.dll C:\Program Files\Microsoft SQL Server\150\Tools\Binn\nvcuda.dll C:\Program Files\Microsoft SQL Server\Client SDK\ODBC\170\Tools\Binn\nvcuda.dll C:\Program Files (x86)\Incredibuild\nvcuda.dll C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\nvcuda.dll C:\Program Files\nodejs\nvcuda.dll C:\Program Files\NVIDIA Corporation\NVIDIA app\NvDLISR\nvcuda.dll C:\Program Files\Git\cmd\nvcuda.dll C:\Program Files\PuTTY\nvcuda.dll C:\Program Files\Docker\Docker\resources\bin\nvcuda.dll C:\Program Files\Go\bin\nvcuda.dll C:\Program Files\NVIDIA Corporation\Nsight Compute 2024.3.2\nvcuda.dll C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64_win\compiler\nvcuda.dll C:\Users\User\AppData\Local\Microsoft\WindowsApps\nvcuda.dll C:\Users\User\.dotnet\tools\nvcuda.dll C:\Users\User\miniconda3\nvcuda.dll C:\Users\User\miniconda3\Library\mingw-w64\bin\nvcuda.dll C:\Users\User\miniconda3\Library\usr\bin\nvcuda.dll C:\Users\User\miniconda3\Library\bin\nvcuda.dll C:\Users\User\miniconda3\Scripts\nvcuda.dll C:\Users\User\AppData\Roaming\npm\nvcuda.dll C:\Program Files\7-Zip\nvcuda.dll C:\ffmpeg\bin\nvcuda.dll C:\Users\User\AppData\Local\Programs\Ollama\nvcuda.dll C:\Users\User\AppData\Local\Programs\Ollama\nvcuda.dll C:\Users\User\.cache\lm-studio\bin\nvcuda.dll C:\Users\User\go\bin\nvcuda.dll C:\Users\User\.dotnet\tools\nvcuda.dll c:\windows\system
\nvcuda.dll]"
time=2024-12-22T18:11:20.040-05:00 level=DEBUG source=gpu.go:537 msg="skipping PhysX cuda library path" path="C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\nvcuda.dll"
time=2024-12-22T18:11:20.041-05:00 level=DEBUG source=gpu.go:566 msg="discovered GPU libraries" paths=[C:\Windows\system32\nvcuda.dll]
initializing C:\Windows\system32\nvcuda.dll
dlsym: cuInit - 00007FFD00D34D20
dlsym: cuDriverGetVersion - 00007FFD00D34DC0
dlsym: cuDeviceGetCount - 00007FFD00D355B6
dlsym: cuDeviceGet - 00007FFD00D355B0
dlsym: cuDeviceGetAttribute - 00007FFD00D34F10
dlsym: cuDeviceGetUuid - 00007FFD00D355C2
dlsym: cuDeviceGetName - 00007FFD00D355BC
dlsym: cuCtxCreate_v3 - 00007FFD00D35634
dlsym: cuMemGetInfo_v2 - 00007FFD00D35736
dlsym: cuCtxDestroy - 00007FFD00D35646
calling cuInit
calling cuDriverGetVersion
raw version 0x2f26
CUDA driver version: 12.7
calling cuDeviceGetCount
device count 2
time=2024-12-22T18:11:20.065-05:00 level=DEBUG source=gpu.go:129 msg="detected GPUs" count=2 library=C:\Windows\system32\nvcuda.dll
[GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23] CUDA totalMem 24563 mb
[GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23] CUDA freeMem 22994 mb
[GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23] Compute Capability 8.9
[GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8] CUDA totalMem 24563 mb
[GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8] CUDA freeMem 22994 mb
[GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8] Compute Capability 8.9
time=2024-12-22T18:11:20.279-05:00 level=INFO source=gpu.go:328 msg="detected OS VRAM overhead" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB"
time=2024-12-22T18:11:20.280-05:00 level=DEBUG source=amd_windows.go:35 msg="unable to load amdhip64_6.dll, please make sure to upgrade to the latest amd driver: The specified module could not be found."
releasing cuda driver library
releasing nvml library
time=2024-12-22T18:11:20.281-05:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB"
time=2024-12-22T18:11:20.281-05:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB"
[GIN] 2024/12/22 - 18:11:27 | 200 | 0s | 127.0.0.1 | GET "/api/version"
[GIN] 2024/12/22 - 18:11:32 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2024/12/22 - 18:11:32 | 200 | 17.5657ms | 127.0.0.1 | POST "/api/show"
time=2024-12-22T18:11:32.230-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="159.4 GiB" before.free_swap="199.4 GiB" now.total="191.7 GiB" now.free="161.3 GiB" now.free_swap="225.5 GiB"
time=2024-12-22T18:11:32.242-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB"
time=2024-12-22T18:11:32.257-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-22T18:11:32.258-05:00 level=DEBUG source=sched.go:181 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=0x7ff7752ccf20 gpu_count=2
time=2024-12-22T18:11:32.282-05:00 level=DEBUG source=sched.go:224 msg="loading first model" model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
time=2024-12-22T18:11:32.282-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]"
time=2024-12-22T18:11:32.283-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.1 GiB]"
time=2024-12-22T18:11:32.284-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]"
time=2024-12-22T18:11:32.286-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.1 GiB]"
time=2024-12-22T18:11:32.286-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 22.1 GiB]"
time=2024-12-22T18:11:32.286-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 22.1 GiB]"
time=2024-12-22T18:11:32.286-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="161.3 GiB" before.free_swap="225.5 GiB" now.total="191.7 GiB" now.free="161.3 GiB" now.free_swap="225.5 GiB"
time=2024-12-22T18:11:32.304-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB"
time=2024-12-22T18:11:32.320-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-22T18:11:32.320-05:00 level=INFO source=server.go:105 msg="system memory" total="191.7 GiB" free="161.3 GiB" free_swap="225.5 GiB"
time=2024-12-22T18:11:32.321-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.1 GiB 22.5 GiB]"
time=2024-12-22T18:11:32.321-05:00 level=INFO source=memory.go:343 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=77 layers.split=38,39 memory.available="[22.1 GiB 22.5 GiB]" memory.gpu_overhead="1.5 GiB" memory.required.full="43.7 GiB" memory.required.partial="41.4 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[20.4 GiB 21.0 GiB]" memory.weights.total="38.9 GiB" memory.weights.repeating="38.1 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB"
time=2024-12-22T18:11:32.321-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu\ollama_llama_server.exe"
time=2024-12-22T18:11:32.321-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx\ollama_llama_server.exe"
time=2024-12-22T18:11:32.321-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe"
time=2024-12-22T18:11:32.321-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v11\ollama_llama_server.exe"
time=2024-12-22T18:11:32.321-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12\ollama_llama_server.exe"
time=2024-12-22T18:11:32.321-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm\ollama_llama_server.exe"
time=2024-12-22T18:11:32.321-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu\ollama_llama_server.exe"
time=2024-12-22T18:11:32.321-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx\ollama_llama_server.exe"
time=2024-12-22T18:11:32.321-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe"
time=2024-12-22T18:11:32.322-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v11\ollama_llama_server.exe"
time=2024-12-22T18:11:32.322-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12\ollama_llama_server.exe"
time=2024-12-22T18:11:32.322-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm\ollama_llama_server.exe"
time=2024-12-22T18:11:32.326-05:00 level=INFO source=server.go:380 msg="starting llama server" cmd="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12\ollama_llama_server.exe --model E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 --ctx-size 2048 --batch-size 512 --n-gpu-layers 77 --verbose --threads 8 --flash-attn --no-mmap --parallel 1 --tensor-split 38,39 --port 63835"
time=2024-12-22T18:11:32.326-05:00 level=DEBUG source=server.go:397 msg=subprocess environment="[CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6 CUDA_PATH_V12_1=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1 CUDA_PATH_V12_3=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3 CUDA_PATH_V12_6=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6 CUDA_VISIBLE_DEVICES=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23,GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 PATH=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\libnvvp;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\libnvvp;C:\Program Files\Common Files\Oracle\Java\javapath;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Program Files\dotnet\;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\Microsoft SQL Server\150\Tools\Binn\;C:\Program Files\Microsoft SQL Server\Client SDK\ODBC\170\Tools\Binn\;C:\Program Files (x86)\Incredibuild;C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\;C:\Program Files\nodejs\;C:\Program Files\NVIDIA Corporation\NVIDIA app\NvDLISR;C:\Program Files\Git\cmd;C:\Program Files\PuTTY\;C:\Program Files\Docker\Docker\resources\bin;C:\Program Files\Go\bin;C:\Program Files\NVIDIA Corporation\Nsight Compute 2024.3.2\;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64_win\compiler;C:\Users\User\AppData\Local\Microsoft\WindowsApps;C:\Users\User\.dotnet\tools;C:\Users\User\miniconda3;C:\Users\User\miniconda3\Library\mingw-w64\bin;C:\Users\User\miniconda3\Library\usr\bin;C:\Users\User\miniconda3\Library\bin;C:\Users\User\miniconda3\Scripts;C:\Users\User\AppData\Roaming\npm;C:\Program Files\7-Zip;C:\ffmpeg\bin;;C:\Users\User\AppData\Local\Programs\Ollama;C:\Users\User\.cache\lm-studio\bin;C:\Users\User\go\bin;C:\Users\User\.dotnet\tools]"
time=2024-12-22T18:11:32.329-05:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2024-12-22T18:11:32.329-05:00 level=INFO source=server.go:559 msg="waiting for llama runner to start responding"
time=2024-12-22T18:11:32.330-05:00 level=INFO source=server.go:593 msg="waiting for server to become available" status="llm server error"
time=2024-12-22T18:11:32.406-05:00 level=INFO source=runner.go:939 msg="starting go runner"
time=2024-12-22T18:11:32.406-05:00 level=INFO source=runner.go:940 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(clang)" threads=8
time=2024-12-22T18:11:32.406-05:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:63835"
llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.1 70B ArliAI RPMax v1.3
llama_model_loader: - kv 3: general.version str = v1.3
llama_model_loader: - kv 4: general.finetune str = ArliAI-RPMax
llama_model_loader: - kv 5: general.basename str = Llama-3.1
llama_model_loader: - kv 6: general.size_label str = 70B
llama_model_loader: - kv 7: llama.block_count u32 = 80
llama_model_loader: - kv 8: llama.context_length u32 = 131072
llama_model_loader: - kv 9: llama.embedding_length u32 = 8192
llama_model_loader: - kv 10: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 11: llama.attention.head_count u32 = 64
llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 13: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 15: llama.attention.key_length u32 = 128
llama_model_loader: - kv 16: llama.attention.value_length u32 = 128
llama_model_loader: - kv 17: general.file_type u32 = 15
llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 27: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv 28: general.quantization_version u32 = 2
llama_model_loader: - type f32: 162 tensors
llama_model_loader: - type q4_K: 441 tensors
llama_model_loader: - type q5_K: 40 tensors
llama_model_loader: - type q6_K: 81 tensors
time=2024-12-22T18:11:32.585-05:00 level=INFO source=server.go:593 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 70.55 B
llm_load_print_meta: model size = 39.59 GiB (4.82 BPW)
llm_load_print_meta: general.name = Llama 3.1 70B ArliAI RPMax v1.3
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size = 1.02 MiB
llm_load_tensors: offloading 77 repeating layers to GPU
llm_load_tensors: offloaded 77/81 layers to GPU
llm_load_tensors: CUDA_Host buffer size = 2942.24 MiB
llm_load_tensors: CUDA0 buffer size = 18482.20 MiB
llm_load_tensors: CUDA1 buffer size = 19118.70 MiB
time=2024-12-22T18:11:33.642-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.01"
time=2024-12-22T18:11:33.904-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.06"
time=2024-12-22T18:11:34.168-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.09"
time=2024-12-22T18:11:34.433-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.13"
time=2024-12-22T18:11:34.696-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.15"
time=2024-12-22T18:11:34.962-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.19"
time=2024-12-22T18:11:35.224-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.22"
time=2024-12-22T18:11:35.487-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.25"
time=2024-12-22T18:11:35.751-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.28"
time=2024-12-22T18:11:36.014-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.32"
time=2024-12-22T18:11:36.275-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.34"
time=2024-12-22T18:11:36.541-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.38"
time=2024-12-22T18:11:36.804-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.41"
time=2024-12-22T18:11:37.069-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.44"
time=2024-12-22T18:11:37.332-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.47"
time=2024-12-22T18:11:37.594-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.51"
time=2024-12-22T18:11:37.858-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.54"
time=2024-12-22T18:11:38.123-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.57"
time=2024-12-22T18:11:38.384-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.60"
time=2024-12-22T18:11:38.647-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.63"
time=2024-12-22T18:11:38.910-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.66"
time=2024-12-22T18:11:39.175-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.69"
time=2024-12-22T18:11:39.439-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.72"
time=2024-12-22T18:11:39.702-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.75"
time=2024-12-22T18:11:39.965-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.79"
time=2024-12-22T18:11:40.229-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.81"
time=2024-12-22T18:11:40.495-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.85"
time=2024-12-22T18:11:40.760-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.89"
time=2024-12-22T18:11:41.023-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.92"
time=2024-12-22T18:11:41.288-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.96"
time=2024-12-22T18:11:41.551-05:00 level=DEBUG source=server.go:604 msg="model load progress 1.00"
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 24.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 304.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 312.00 MiB
llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.52 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 162.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB
llama_new_context_with_model: graph nodes = 2247
llama_new_context_with_model: graph splits = 44
time=2024-12-22T18:11:41.816-05:00 level=INFO source=server.go:598 msg="llama runner started in 9.49 seconds"
time=2024-12-22T18:11:41.816-05:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
[GIN] 2024/12/22 - 18:11:41 | 200 | 9.5959297s | 127.0.0.1 | POST "/api/generate"
time=2024-12-22T18:11:41.816-05:00 level=DEBUG source=sched.go:466 msg="context for request finished"
time=2024-12-22T18:11:41.816-05:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 duration=1h0m0s
time=2024-12-22T18:11:41.816-05:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 refCount=0
time=2024-12-22T18:11:58.819-05:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
time=2024-12-22T18:11:58.819-05:00 level=DEBUG source=sched.go:283 msg="resetting model to expire immediately to make room" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 refCount=0
time=2024-12-22T18:11:58.819-05:00 level=DEBUG source=sched.go:296 msg="waiting for pending requests to complete and unload to occur" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
time=2024-12-22T18:11:58.819-05:00 level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
time=2024-12-22T18:11:58.819-05:00 level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
time=2024-12-22T18:11:58.819-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="161.3 GiB" before.free_swap="225.5 GiB" now.total="191.7 GiB" now.free="158.0 GiB" now.free_swap="183.0 GiB"
time=2024-12-22T18:11:58.838-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="2.3 GiB" now.used="21.6 GiB"
time=2024-12-22T18:11:58.854-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="2.9 GiB" now.used="20.0 GiB"
releasing nvml library
time=2024-12-22T18:11:58.855-05:00 level=DEBUG source=server.go:1075 msg="stopping llama server"
time=2024-12-22T18:11:58.856-05:00 level=DEBUG source=server.go:1081 msg="waiting for llama server to exit"
time=2024-12-22T18:11:59.119-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="158.0 GiB" before.free_swap="183.0 GiB" now.total="191.7 GiB" now.free="158.1 GiB" now.free_swap="221.8 GiB"
time=2024-12-22T18:11:59.181-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="2.3 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB"
time=2024-12-22T18:11:59.198-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="2.9 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-22T18:11:59.199-05:00 level=DEBUG source=sched.go:659 msg="gpu VRAM free memory converged after 0.38 seconds" model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
time=2024-12-22T18:11:59.387-05:00 level=DEBUG source=server.go:1085 msg="llama server stopped"
time=2024-12-22T18:11:59.387-05:00 level=DEBUG source=sched.go:380 msg="runner released" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
time=2024-12-22T18:11:59.387-05:00 level=DEBUG source=sched.go:384 msg="sending an unloaded event" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
time=2024-12-22T18:11:59.387-05:00 level=DEBUG source=sched.go:302 msg="unload completed" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
time=2024-12-22T18:11:59.387-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="158.1 GiB" before.free_swap="221.8 GiB" now.total="191.7 GiB" now.free="161.4 GiB" now.free_swap="225.8 GiB"
time=2024-12-22T18:11:59.399-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB"
time=2024-12-22T18:11:59.414-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-22T18:11:59.437-05:00 level=DEBUG source=sched.go:224 msg="loading first model" model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
time=2024-12-22T18:11:59.437-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]"
time=2024-12-22T18:11:59.438-05:00 level=DEBUG source=memory.go:173 msg="gpu has too little memory to allocate any layers" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB" minimum_memory=479199232 layer_size="2.5 GiB" gpu_zer_overhead="0 B" partial_offload="67.1 GiB" full_offload="65.1 GiB"
time=2024-12-22T18:11:59.438-05:00 level=DEBUG source=memory.go:317 msg="insufficient VRAM to load any model layers"
time=2024-12-22T18:11:59.438-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.1 GiB]"
time=2024-12-22T18:11:59.438-05:00 level=DEBUG source=memory.go:173 msg="gpu has too little memory to allocate any layers" id=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.1 GiB" minimum_memory=479199232 layer_size="2.5 GiB" gpu_zer_overhead="0 B" partial_offload="67.1 GiB" full_offload="65.1 GiB"
time=2024-12-22T18:11:59.438-05:00 level=DEBUG source=memory.go:317 msg="insufficient VRAM to load any model layers"
time=2024-12-22T18:11:59.438-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]"
time=2024-12-22T18:11:59.439-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.1 GiB]"
time=2024-12-22T18:11:59.441-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 22.1 GiB]"
time=2024-12-22T18:11:59.442-05:00 level=DEBUG source=memory.go:173 msg="gpu has too little memory to allocate any layers" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB" minimum_memory=479199232 layer_size="2.5 GiB" gpu_zer_overhead="0 B" partial_offload="67.1 GiB" full_offload="67.1 GiB"
time=2024-12-22T18:11:59.442-05:00 level=DEBUG source=memory.go:173 msg="gpu has too little memory to allocate any layers" id=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.1 GiB" minimum_memory=479199232 layer_size="2.5 GiB" gpu_zer_overhead="0 B" partial_offload="67.1 GiB" full_offload="67.1 GiB"
time=2024-12-22T18:11:59.442-05:00 level=DEBUG source=memory.go:317 msg="insufficient VRAM to load any model layers"
time=2024-12-22T18:11:59.442-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 22.1 GiB]"
time=2024-12-22T18:11:59.442-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="161.4 GiB" before.free_swap="225.8 GiB" now.total="191.7 GiB" now.free="161.4 GiB" now.free_swap="225.8 GiB"
time=2024-12-22T18:11:59.461-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB"
time=2024-12-22T18:11:59.477-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-22T18:11:59.478-05:00 level=INFO source=server.go:105 msg="system memory" total="191.7 GiB" free="161.4 GiB" free_swap="225.8 GiB"
time=2024-12-22T18:11:59.478-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.1 GiB 22.5 GiB]"
time=2024-12-22T18:11:59.479-05:00 level=INFO source=memory.go:343 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=4 layers.split=2,2 memory.available="[22.1 GiB 22.5 GiB]" memory.gpu_overhead="1.5 GiB" memory.required.full="117.9 GiB" memory.required.partial="40.6 GiB" memory.required.kv="40.0 GiB" memory.required.allocations="[20.3 GiB 20.3 GiB]" memory.weights.total="78.2 GiB" memory.weights.repeating="77.4 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="16.8 GiB" memory.graph.partial="16.8 GiB"
time=2024-12-22T18:11:59.479-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu\ollama_llama_server.exe"
time=2024-12-22T18:11:59.479-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx\ollama_llama_server.exe"
time=2024-12-22T18:11:59.479-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe"
time=2024-12-22T18:11:59.479-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v11\ollama_llama_server.exe"
time=2024-12-22T18:11:59.479-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12\ollama_llama_server.exe"
time=2024-12-22T18:11:59.479-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm\ollama_llama_server.exe"
time=2024-12-22T18:11:59.480-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu\ollama_llama_server.exe"
time=2024-12-22T18:11:59.480-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx\ollama_llama_server.exe"
time=2024-12-22T18:11:59.480-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe"
time=2024-12-22T18:11:59.480-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v11\ollama_llama_server.exe"
time=2024-12-22T18:11:59.480-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12\ollama_llama_server.exe"
time=2024-12-22T18:11:59.480-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm\ollama_llama_server.exe"
time=2024-12-22T18:11:59.484-05:00 level=INFO source=server.go:380 msg="starting llama server" cmd="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12\ollama_llama_server.exe --model E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 --ctx-size 131072 --batch-size 512 --n-gpu-layers 4 --verbose --threads 8 --flash-attn --no-mmap --parallel 1 --tensor-split 2,2 --port 63918"
time=2024-12-22T18:11:59.484-05:00 level=DEBUG source=server.go:397 msg=subprocess environment="[CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6 CUDA_PATH_V12_1=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1 CUDA_PATH_V12_3=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3 CUDA_PATH_V12_6=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6 CUDA_VISIBLE_DEVICES=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23,GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 PATH=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\libnvvp;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\libnvvp;C:\Program Files\Common Files\Oracle\Java\javapath;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Program Files\dotnet\;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\Microsoft SQL Server\150\Tools\Binn\;C:\Program Files\Microsoft SQL Server\Client SDK\ODBC\170\Tools\Binn\;C:\Program Files (x86)\Incredibuild;C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\;C:\Program Files\nodejs\;C:\Program Files\NVIDIA Corporation\NVIDIA app\NvDLISR;C:\Program Files\Git\cmd;C:\Program Files\PuTTY\;C:\Program Files\Docker\Docker\resources\bin;C:\Program Files\Go\bin;C:\Program Files\NVIDIA Corporation\Nsight Compute 2024.3.2\;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64_win\compiler;C:\Users\User\AppData\Local\Microsoft\WindowsApps;C:\Users\User\.dotnet\tools;C:\Users\User\miniconda3;C:\Users\User\miniconda3\Library\mingw-w64\bin;C:\Users\User\miniconda3\Library\usr\bin;C:\Users\User\miniconda3\Library\bin;C:\Users\User\miniconda3\Scripts;C:\Users\User\AppData\Roaming\npm;C:\Program Files\7-Zip;C:\ffmpeg\bin;;C:\Users\User\AppData\Local\Programs\Ollama;C:\Users\User\.cache\lm-studio\bin;C:\Users\User\go\bin;C:\Users\User\.dotnet\tools]"
time=2024-12-22T18:11:59.491-05:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2024-12-22T18:11:59.491-05:00 level=INFO source=server.go:559 msg="waiting for llama runner to start responding"
time=2024-12-22T18:11:59.495-05:00 level=INFO source=server.go:593 msg="waiting for server to become available" status="llm server error"
time=2024-12-22T18:11:59.558-05:00 level=INFO source=runner.go:939 msg="starting go runner"
time=2024-12-22T18:11:59.559-05:00 level=INFO source=runner.go:940 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(clang)" threads=8
time=2024-12-22T18:11:59.560-05:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:63918"
llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.1 70B ArliAI RPMax v1.3
llama_model_loader: - kv 3: general.version str = v1.3
llama_model_loader: - kv 4: general.finetune str = ArliAI-RPMax
llama_model_loader: - kv 5: general.basename str = Llama-3.1
llama_model_loader: - kv 6: general.size_label str = 70B
llama_model_loader: - kv 7: llama.block_count u32 = 80
llama_model_loader: - kv 8: llama.context_length u32 = 131072
llama_model_loader: - kv 9: llama.embedding_length u32 = 8192
llama_model_loader: - kv 10: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 11: llama.attention.head_count u32 = 64
llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 13: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 15: llama.attention.key_length u32 = 128
llama_model_loader: - kv 16: llama.attention.value_length u32 = 128
llama_model_loader: - kv 17: general.file_type u32 = 15
llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 27: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv 28: general.quantization_version u32 = 2
llama_model_loader: - type f32: 162 tensors
llama_model_loader: - type q4_K: 441 tensors
llama_model_loader: - type q5_K: 40 tensors
llama_model_loader: - type q6_K: 81 tensors
time=2024-12-22T18:11:59.756-05:00 level=INFO source=server.go:593 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 70.55 B
llm_load_print_meta: model size = 39.59 GiB (4.82 BPW)
llm_load_print_meta: general.name = Llama 3.1 70B ArliAI RPMax v1.3
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size = 1.02 MiB
llm_load_tensors: offloading 4 repeating layers to GPU
llm_load_tensors: offloaded 4/81 layers to GPU
llm_load_tensors: CUDA_Host buffer size = 38467.63 MiB
llm_load_tensors: CUDA0 buffer size = 1037.75 MiB
llm_load_tensors: CUDA1 buffer size = 1037.75 MiB
time=2024-12-22T18:12:12.655-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.01"
time=2024-12-22T18:12:12.915-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.06"
time=2024-12-22T18:12:13.178-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.10"
time=2024-12-22T18:12:13.430-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.14"
time=2024-12-22T18:12:13.694-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.18"
time=2024-12-22T18:12:13.957-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.21"
time=2024-12-22T18:12:14.222-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.25"
time=2024-12-22T18:12:14.487-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.29"
time=2024-12-22T18:12:14.751-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.33"
time=2024-12-22T18:12:15.017-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.37"
time=2024-12-22T18:12:15.280-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.40"
time=2024-12-22T18:12:15.546-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.45"
time=2024-12-22T18:12:15.810-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.48"
time=2024-12-22T18:12:16.075-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.52"
time=2024-12-22T18:12:16.336-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.56"
time=2024-12-22T18:12:16.599-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.60"
time=2024-12-22T18:12:16.861-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.63"
time=2024-12-22T18:12:17.128-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.67"
time=2024-12-22T18:12:17.391-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.71"
time=2024-12-22T18:12:17.655-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.74"
time=2024-12-22T18:12:17.920-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.78"
time=2024-12-22T18:12:18.183-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.82"
time=2024-12-22T18:12:18.434-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.86"
time=2024-12-22T18:12:18.699-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.90"
time=2024-12-22T18:12:18.961-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.95"
time=2024-12-22T18:12:19.225-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.98"
time=2024-12-22T18:12:19.494-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.99"
llama_new_context_with_model: n_ctx = 131072
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_cuda_host_malloc: failed to allocate 38912.00 MiB of pinned memory: out of memory
time=2024-12-22T18:12:19.755-05:00 level=DEBUG source=server.go:604 msg="model load progress 1.00"
time=2024-12-22T18:12:20.019-05:00 level=DEBUG source=server.go:607 msg="model load completed, waiting for server to become available" status="llm server loading model"
llama_kv_cache_init: CPU KV buffer size = 38912.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 1024.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 1024.00 MiB
llama_new_context_with_model: KV self size = 40960.00 MiB, K (f16): 20480.00 MiB, V (f16): 20480.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.52 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 272.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 272.01 MiB
llama_new_context_with_model: graph nodes = 2247
llama_new_context_with_model: graph splits = 993
time=2024-12-22T18:12:27.755-05:00 level=INFO source=server.go:598 msg="llama runner started in 28.26 seconds"
time=2024-12-22T18:12:27.755-05:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
time=2024-12-22T18:12:27.755-05:00 level=DEBUG source=routes.go:1466 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\nInstruction: Describe the following scene in detail, add many light hearted conversation indicating closeness. Both Stephanie and her mom, Sarah, are very beautiful and sexy women. They and I were at their house, came straight from the church. Sarah told me that they had some leak in the roof near bedroom and asked me to help. Don't describe the background too much. Describe that their house is a 3-bedroom, 2 bathroom house, not too big, not too small. The roof is quite simple. Make it about 120 sentences\nScene: Open the scene with Sarah's more detailed description on the leak. Stephanie just came from the shed carrying the ladder and several roof shingles and a toolbox. I thanked Stephanie for the ladder and propped it. I took of my shirt, bare chested. I could tell that both Sarah and Stephanie admire my body. I took the shingles and the toolbox up the roof. I managed to replace a few obvious broken shingles, but a section may need a more thorough treatment. So, I climbed down and told them the situation. I was drenched in sweat. Sarah was worried about the cost. I assured her that I would cover all the cost for her. She was overjoyed. Stephanie hugged me tight and kissed me in the lips. Sarah also hugged me tight and kissed me on the cheeks. Because I was sweaty, that made their clothes wet a bit. Just at that moment, Sarah's estranged husband, Mario, came, reeked in alcohol. He was furious. He thought that I was a roofer who tried to take advantage of his wife and daughter. We tried to explain the situation but he was still angry. I wear my shirt again. Sarah told him that I offered her to pay for the roof repair, but he didn't care about the roof. He told to bugger off and don't come back. He then walked away from us, leaving us alone.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
time=2024-12-22T18:12:27.757-05:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=409 used=0 remaining=409
time=2024-12-22T18:12:40.762-05:00 level=DEBUG source=sched.go:466 msg="context for request finished"
time=2024-12-22T18:12:40.762-05:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 duration=1h0m0s
time=2024-12-22T18:12:40.762-05:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 refCount=0
[GIN] 2024/12/22 - 18:12:40 | 200 | 41.9626905s | 127.0.0.1 | POST "/api/chat"
time=2024-12-22T18:13:06.093-05:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
time=2024-12-22T18:13:06.093-05:00 level=DEBUG source=sched.go:283 msg="resetting model to expire immediately to make room" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 refCount=0
time=2024-12-22T18:13:06.093-05:00 level=DEBUG source=sched.go:296 msg="waiting for pending requests to complete and unload to occur" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
time=2024-12-22T18:13:06.093-05:00 level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
time=2024-12-22T18:13:06.093-05:00 level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
time=2024-12-22T18:13:06.093-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="161.4 GiB" before.free_swap="225.8 GiB" now.total="191.7 GiB" now.free="85.0 GiB" now.free_swap="145.6 GiB"
time=2024-12-22T18:13:06.118-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="18.6 GiB" now.used="5.4 GiB"
time=2024-12-22T18:13:06.133-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="19.7 GiB" now.used="3.2 GiB"
releasing nvml library
time=2024-12-22T18:13:06.134-05:00 level=DEBUG source=server.go:1075 msg="stopping llama server"
time=2024-12-22T18:13:06.134-05:00 level=DEBUG source=server.go:1081 msg="waiting for llama server to exit"
time=2024-12-22T18:13:06.398-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="85.0 GiB" before.free_swap="145.6 GiB" now.total="191.7 GiB" now.free="85.1 GiB" now.free_swap="148.9 GiB"
time=2024-12-22T18:13:06.707-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="18.6 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB"
time=2024-12-22T18:13:06.723-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="19.7 GiB" now.total="24.0 GiB" now.free="22.4 GiB" now.used="563.6 MiB"
releasing nvml library
time=2024-12-22T18:13:06.724-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="85.1 GiB" before.free_swap="148.9 GiB" now.total="191.7 GiB" now.free="85.1 GiB" now.free_swap="151.7 GiB"
time=2024-12-22T18:13:07.439-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB"
time=2024-12-22T18:13:07.454-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.4 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-22T18:13:07.454-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="85.1 GiB" before.free_swap="151.7 GiB" now.total="191.7 GiB" now.free="86.0 GiB" now.free_swap="152.4 GiB"
time=2024-12-22T18:13:07.469-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB"
time=2024-12-22T18:13:07.485-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-22T18:13:07.640-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="86.0 GiB" before.free_swap="152.4 GiB" now.total="191.7 GiB" now.free="88.7 GiB" now.free_swap="152.4 GiB"
time=2024-12-22T18:13:07.655-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB"
time=2024-12-22T18:13:07.671-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-22T18:13:07.892-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="88.7 GiB" before.free_swap="152.4 GiB" now.total="191.7 GiB" now.free="92.4 GiB" now.free_swap="152.4 GiB"
time=2024-12-22T18:13:07.907-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB"
time=2024-12-22T18:13:07.922-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-22T18:13:08.140-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="92.4 GiB" before.free_swap="152.4 GiB" now.total="191.7 GiB" now.free="95.7 GiB" now.free_swap="152.4 GiB"
time=2024-12-22T18:13:08.156-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB"
time=2024-12-22T18:13:08.171-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-22T18:13:08.388-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="95.7 GiB" before.free_swap="152.4 GiB" now.total="191.7 GiB" now.free="99.0 GiB" now.free_swap="152.4 GiB"
time=2024-12-22T18:13:08.403-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB"
time=2024-12-22T18:13:08.419-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-22T18:13:08.638-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="99.0 GiB" before.free_swap="152.4 GiB" now.total="191.7 GiB" now.free="102.3 GiB" now.free_swap="152.4 GiB"
time=2024-12-22T18:13:08.653-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB"
time=2024-12-22T18:13:08.668-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-22T18:13:08.887-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="102.3 GiB" before.free_swap="152.4 GiB" now.total="191.7 GiB" now.free="105.5 GiB" now.free_swap="152.4 GiB"
time=2024-12-22T18:13:08.902-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB"
time=2024-12-22T18:13:08.918-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-22T18:13:09.136-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="105.5 GiB" before.free_swap="152.4 GiB" now.total="191.7 GiB" now.free="108.8 GiB" now.free_swap="152.4 GiB"
time=2024-12-22T18:13:09.151-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB"
time=2024-12-22T18:13:09.167-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-22T18:13:09.385-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="108.8 GiB" before.free_swap="152.4 GiB" now.total="191.7 GiB" now.free="112.7 GiB" now.free_swap="152.4 GiB"
time=2024-12-22T18:13:09.401-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB"
time=2024-12-22T18:13:09.418-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-22T18:13:09.650-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="112.7 GiB" before.free_swap="152.4 GiB" now.total="191.7 GiB" now.free="116.8 GiB" now.free_swap="152.4 GiB"
time=2024-12-22T18:13:09.666-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB"
time=2024-12-22T18:13:09.682-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-22T18:13:09.886-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="116.8 GiB" before.free_swap="152.4 GiB" now.total="191.7 GiB" now.free="120.0 GiB" now.free_swap="152.4 GiB"
time=2024-12-22T18:13:09.900-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB"
time=2024-12-22T18:13:09.916-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-22T18:13:10.135-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="120.0 GiB" before.free_swap="152.4 GiB" now.total="191.7 GiB" now.free="123.6 GiB" now.free_swap="190.6 GiB"
time=2024-12-22T18:13:10.151-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB"
time=2024-12-22T18:13:10.166-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-22T18:13:10.399-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="123.6 GiB" before.free_swap="190.6 GiB" now.total="191.7 GiB" now.free="129.0 GiB" now.free_swap="190.6 GiB"
time=2024-12-22T18:13:10.415-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB"
time=2024-12-22T18:13:10.431-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-22T18:13:10.650-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="129.0 GiB" before.free_swap="190.6 GiB" now.total="191.7 GiB" now.free="133.7 GiB" now.free_swap="190.6 GiB"
time=2024-12-22T18:13:10.666-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB"
time=2024-12-22T18:13:10.681-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-22T18:13:10.899-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="133.7 GiB" before.free_swap="190.6 GiB" now.total="191.7 GiB" now.free="137.9 GiB" now.free_swap="190.6 GiB"
time=2024-12-22T18:13:10.914-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB"
time=2024-12-22T18:13:10.929-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-22T18:13:11.135-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0418316 model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
time=2024-12-22T18:13:11.135-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="137.9 GiB" before.free_swap="190.6 GiB" now.total="191.7 GiB" now.free="142.5 GiB" now.free_swap="190.6 GiB"
time=2024-12-22T18:13:11.168-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB"
time=2024-12-22T18:13:11.181-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-22T18:13:11.399-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.3064613 model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
time=2024-12-22T18:13:12.236-05:00 level=DEBUG source=server.go:1085 msg="llama server stopped"
time=2024-12-22T18:13:12.236-05:00 level=DEBUG source=sched.go:380 msg="runner released" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
time=2024-12-22T18:13:12.236-05:00 level=DEBUG source=sched.go:384 msg="sending an unloaded event" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
time=2024-12-22T18:13:12.236-05:00 level=DEBUG source=sched.go:302 msg="unload completed" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
time=2024-12-22T18:13:12.236-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="142.5 GiB" before.free_swap="190.6 GiB" now.total="191.7 GiB" now.free="161.9 GiB" now.free_swap="228.9 GiB"
time=2024-12-22T18:13:12.259-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB"
time=2024-12-22T18:13:12.274-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-22T18:13:12.274-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=6.1810826 model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
time=2024-12-22T18:13:12.275-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="161.9 GiB" before.free_swap="228.9 GiB" now.total="191.7 GiB" now.free="161.9 GiB" now.free_swap="228.9 GiB"
time=2024-12-22T18:13:12.289-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB"
time=2024-12-22T18:13:12.304-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-22T18:13:12.345-05:00 level=DEBUG source=sched.go:224 msg="loading first model" model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
time=2024-12-22T18:13:12.345-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]"
time=2024-12-22T18:13:12.345-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.1 GiB]"
time=2024-12-22T18:13:12.346-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]"
time=2024-12-22T18:13:12.346-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.1 GiB]"
time=2024-12-22T18:13:12.348-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 22.1 GiB]"
time=2024-12-22T18:13:12.349-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 22.1 GiB]"
time=2024-12-22T18:13:12.349-05:00 level=INFO source=sched.go:730 msg="new model will fit in available VRAM, loading" model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 library=cuda parallel=1 required="39.5 GiB"
time=2024-12-22T18:13:12.349-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="161.9 GiB" before.free_swap="228.9 GiB" now.total="191.7 GiB" now.free="161.9 GiB" now.free_swap="228.9 GiB"
time=2024-12-22T18:13:12.458-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB"
time=2024-12-22T18:13:12.473-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB"
releasing nvml library
time=2024-12-22T18:13:12.474-05:00 level=INFO source=server.go:105 msg="system memory" total="191.7 GiB" free="161.9 GiB" free_swap="228.9 GiB"
time=2024-12-22T18:13:12.474-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 22.1 GiB]"
time=2024-12-22T18:13:12.475-05:00 level=INFO source=memory.go:343 msg="offload to cuda" layers.requested=48 layers.model=81 layers.offload=48 layers.split=24,24 memory.available="[22.5 GiB 22.1 GiB]" memory.gpu_overhead="1.5 GiB" memory.required.full="59.7 GiB" memory.required.partial="39.5 GiB" memory.required.kv="10.0 GiB" memory.required.allocations="[19.7 GiB 19.7 GiB]" memory.weights.total="48.2 GiB" memory.weights.repeating="47.4 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="4.3 GiB" memory.graph.partial="4.3 GiB"
time=2024-12-22T18:13:12.476-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu\ollama_llama_server.exe"
time=2024-12-22T18:13:12.476-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx\ollama_llama_server.exe"
time=2024-12-22T18:13:12.476-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe"
time=2024-12-22T18:13:12.476-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v11\ollama_llama_server.exe"
time=2024-12-22T18:13:12.476-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12\ollama_llama_server.exe"
time=2024-12-22T18:13:12.476-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm\ollama_llama_server.exe"
time=2024-12-22T18:13:12.476-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu\ollama_llama_server.exe"
time=2024-12-22T18:13:12.476-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx\ollama_llama_server.exe"
time=2024-12-22T18:13:12.476-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe"
time=2024-12-22T18:13:12.476-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v11\ollama_llama_server.exe"
time=2024-12-22T18:13:12.476-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12\ollama_llama_server.exe"
time=2024-12-22T18:13:12.476-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm\ollama_llama_server.exe"
time=2024-12-22T18:13:12.481-05:00 level=INFO source=server.go:380 msg="starting llama server" cmd="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12\ollama_llama_server.exe --model E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 --ctx-size 32768 --batch-size 512 --n-gpu-layers 48 --verbose --threads 8 --flash-attn --no-mmap --parallel 1 --tensor-split 24,24 --port 64176"
time=2024-12-22T18:13:12.481-05:00 level=DEBUG source=server.go:397 msg=subprocess environment="[CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6 CUDA_PATH_V12_1=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1 CUDA_PATH_V12_3=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3 CUDA_PATH_V12_6=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6 CUDA_VISIBLE_DEVICES=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8,GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 PATH=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\libnvvp;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\libnvvp;C:\Program Files\Common Files\Oracle\Java\javapath;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Program Files\dotnet\;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\Microsoft SQL Server\150\Tools\Binn\;C:\Program Files\Microsoft SQL Server\Client SDK\ODBC\170\Tools\Binn\;C:\Program Files (x86)\Incredibuild;C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\;C:\Program Files\nodejs\;C:\Program Files\NVIDIA Corporation\NVIDIA app\NvDLISR;C:\Program Files\Git\cmd;C:\Program Files\PuTTY\;C:\Program Files\Docker\Docker\resources\bin;C:\Program Files\Go\bin;C:\Program Files\NVIDIA Corporation\Nsight Compute 2024.3.2\;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64_win\compiler;C:\Users\User\AppData\Local\Microsoft\WindowsApps;C:\Users\User\.dotnet\tools;C:\Users\User\miniconda3;C:\Users\User\miniconda3\Library\mingw-w64\bin;C:\Users\User\miniconda3\Library\usr\bin;C:\Users\User\miniconda3\Library\bin;C:\Users\User\miniconda3\Scripts;C:\Users\User\AppData\Roaming\npm;C:\Program Files\7-Zip;C:\ffmpeg\bin;;C:\Users\User\AppData\Local\Programs\Ollama;C:\Users\User\.cache\lm-studio\bin;C:\Users\User\go\bin;C:\Users\User\.dotnet\tools]"
time=2024-12-22T18:13:12.489-05:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2024-12-22T18:13:12.489-05:00 level=INFO source=server.go:559 msg="waiting for llama runner to start responding"
time=2024-12-22T18:13:12.492-05:00 level=INFO source=server.go:593 msg="waiting for server to become available" status="llm server error"
time=2024-12-22T18:13:12.585-05:00 level=INFO source=runner.go:939 msg="starting go runner"
time=2024-12-22T18:13:12.585-05:00 level=INFO source=runner.go:940 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(clang)" threads=8
time=2024-12-22T18:13:12.586-05:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:64176"
llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.1 70B ArliAI RPMax v1.3
llama_model_loader: - kv 3: general.version str = v1.3
llama_model_loader: - kv 4: general.finetune str = ArliAI-RPMax
llama_model_loader: - kv 5: general.basename str = Llama-3.1
llama_model_loader: - kv 6: general.size_label str = 70B
llama_model_loader: - kv 7: llama.block_count u32 = 80
llama_model_loader: - kv 8: llama.context_length u32 = 131072
llama_model_loader: - kv 9: llama.embedding_length u32 = 8192
llama_model_loader: - kv 10: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 11: llama.attention.head_count u32 = 64
llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 13: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 15: llama.attention.key_length u32 = 128
llama_model_loader: - kv 16: llama.attention.value_length u32 = 128
llama_model_loader: - kv 17: general.file_type u32 = 15
llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 27: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv 28: general.quantization_version u32 = 2
llama_model_loader: - type f32: 162 tensors
llama_model_loader: - type q4_K: 441 tensors
llama_model_loader: - type q5_K: 40 tensors
llama_model_loader: - type q6_K: 81 tensors
time=2024-12-22T18:13:12.757-05:00 level=INFO source=server.go:593 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 70.55 B
llm_load_print_meta: model size = 39.59 GiB (4.82 BPW)
llm_load_print_meta: general.name = Llama 3.1 70B ArliAI RPMax v1.3
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size = 1.02 MiB
llm_load_tensors: offloading 48 repeating layers to GPU
llm_load_tensors: offloaded 48/81 layers to GPU
llm_load_tensors: CUDA_Host buffer size = 17107.43 MiB
llm_load_tensors: CUDA0 buffer size = 11512.01 MiB
llm_load_tensors: CUDA1 buffer size = 11923.69 MiB
time=2024-12-22T18:13:17.388-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.04"
time=2024-12-22T18:13:17.651-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.06"
time=2024-12-22T18:13:17.902-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.09"
time=2024-12-22T18:13:18.164-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.13"
time=2024-12-22T18:13:18.414-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.16"
time=2024-12-22T18:13:18.680-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.20"
time=2024-12-22T18:13:18.943-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.24"
time=2024-12-22T18:13:19.208-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.28"
time=2024-12-22T18:13:19.471-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.31"
time=2024-12-22T18:13:19.734-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.35"
time=2024-12-22T18:13:19.999-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.39"
time=2024-12-22T18:13:20.260-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.42"
time=2024-12-22T18:13:20.525-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.45"
time=2024-12-22T18:13:20.791-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.48"
time=2024-12-22T18:13:21.054-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.51"
time=2024-12-22T18:13:21.318-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.54"
time=2024-12-22T18:13:21.582-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.57"
time=2024-12-22T18:13:21.832-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.60"
time=2024-12-22T18:13:22.095-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.64"
time=2024-12-22T18:13:22.359-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.67"
time=2024-12-22T18:13:22.624-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.70"
time=2024-12-22T18:13:22.888-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.72"
time=2024-12-22T18:13:23.153-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.75"
time=2024-12-22T18:13:23.417-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.78"
time=2024-12-22T18:13:23.680-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.81"
time=2024-12-22T18:13:23.942-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.84"
time=2024-12-22T18:13:24.203-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.88"
time=2024-12-22T18:13:24.468-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.92"
time=2024-12-22T18:13:24.732-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.95"
time=2024-12-22T18:13:24.994-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.99"
llama_new_context_with_model: n_ctx = 32768
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
time=2024-12-22T18:13:25.258-05:00 level=DEBUG source=server.go:604 msg="model load progress 1.00"
time=2024-12-22T18:13:25.509-05:00 level=DEBUG source=server.go:607 msg="model load completed, waiting for server to become available" status="llm server loading model"
llama_kv_cache_init: CUDA_Host KV buffer size = 4096.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 3072.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 3072.00 MiB
llama_new_context_with_model: KV self size = 10240.00 MiB, K (f16): 5120.00 MiB, V (f16): 5120.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.52 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 176.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 80.01 MiB
llama_new_context_with_model: graph nodes = 2247
llama_new_context_with_model: graph splits = 421
time=2024-12-22T18:13:26.563-05:00 level=INFO source=server.go:598 msg="llama runner started in 14.07 seconds"
time=2024-12-22T18:13:26.563-05:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42
time=2024-12-22T18:13:26.564-05:00 level=DEBUG source=server.go:962 msg="new runner detected, loading model for cgo tokenization"
llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.1 70B ArliAI RPMax v1.3
llama_model_loader: - kv 3: general.version str = v1.3
llama_model_loader: - kv 4: general.finetune str = ArliAI-RPMax
llama_model_loader: - kv 5: general.basename str = Llama-3.1
llama_model_loader: - kv 6: general.size_label str = 70B
llama_model_loader: - kv 7: llama.block_count u32 = 80
llama_model_loader: - kv 8: llama.context_length u32 = 131072
llama_model_loader: - kv 9: llama.embedding_length u32 = 8192
llama_model_loader: - kv 10: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 11: llama.attention.head_count u32 = 64
llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 13: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 15: llama.attention.key_length u32 = 128
llama_model_loader: - kv 16: llama.attention.value_length u32 = 128
llama_model_loader: - kv 17: general.file_type u32 = 15
llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 27: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv 28: general.quantization_version u32 = 2
llama_model_loader: - type f32: 162 tensors
llama_model_loader: - type q4_K: 441 tensors
llama_model_loader: - type q5_K: 40 tensors
llama_model_loader: - type q6_K: 81 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 1
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model params = 70.55 B
llm_load_print_meta: model size = 39.59 GiB (4.82 BPW)
llm_load_print_meta: general.name = Llama 3.1 70B ArliAI RPMax v1.3
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2024-12-22T18:13:26.937-05:00 level=DEBUG source=routes.go:1466 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\nInstruction: Describe the following scene in detail, add many light hearted conversation indicating closeness. Both Stephanie and her mom, Sarah, are very beautiful and sexy women. They and I were at their house, came straight from the church. Sarah told me that they had some leak in the roof near bedroom and asked me to help. Don't describe the background too much. Describe that their house is a 3-bedroom, 2 bathroom house, not too big, not too small. The roof is quite simple. Make it about 120 sentences\nScene: Open the scene with Sarah's more detailed description on the leak. Stephanie just came from the shed carrying the ladder and several roof shingles and a toolbox. I thanked Stephanie for the ladder and propped it. I took of my shirt, bare chested. I could tell that both Sarah and Stephanie admire my body. I took the shingles and the toolbox up the roof. I managed to replace a few obvious broken shingles, but a section may need a more thorough treatment. So, I climbed down and told them the situation. I was drenched in sweat. Sarah was worried about the cost. I assured her that I would cover all the cost for her. She was overjoyed. Stephanie hugged me tight and kissed me in the lips. Sarah also hugged me tight and kissed me on the cheeks. Because I was sweaty, that made their clothes wet a bit. Just at that moment, Sarah's estranged husband, Mario, came, reeked in alcohol. He was furious. He thought that I was a roofer who tried to take advantage of his wife and daughter. We tried to explain the situation but he was still angry. I wear my shirt again. Sarah told him that I offered her to pay for the roof repair, but he didn't care about the roof. He told to bugger off and don't come back. He then walked away from us, leaving us alone.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nSarah explained that there had been a<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nInstruction: Describe the following scene in detail, add many light hearted conversation indicating closeness. Both Stephanie and her mom, Sarah, are very beautiful and sexy women. They and I were at their house, came straight from the church. Sarah told me that they had some leak in the roof near bedroom and asked me to help. Don't describe the background too much. Describe that their house is a 3-bedroom, 2 bathroom house, not too big, not too small. The roof is quite simple. Make it about 120 sentences\nScene: Open the scene with Sarah's more detailed description on the leak. Stephanie just came from the shed carrying the ladder and several roof shingles and a toolbox. I thanked Stephanie for the ladder and propped it. I took of my shirt, bare chested. I could tell that both Sarah and Stephanie admire my body. I took the shingles and the toolbox up the roof. I managed to replace a few obvious broken shingles, but a section may need a more thorough treatment. So, I climbed down and told them the situation. I was drenched in sweat. Sarah was worried about the cost. I assured her that I would cover all the cost for her. She was overjoyed. Stephanie hugged me tight and kissed me in the lips. Sarah also hugged me tight and kissed me on the cheeks. Because I was sweaty, that made their clothes wet a bit. Just at that moment, Sarah's estranged husband, Mario, came, reeked in alcohol. He was furious. He thought that I was a roofer who tried to take advantage of his wife and daughter. We tried to explain the situation but he was still angry. I wear my shirt again. Sarah told him that I offered her to pay for the roof repair, but he didn't care about the roof. He told to bugger off and don't come back. He then walked away from us, leaving us alone.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
time=2024-12-22T18:13:26.938-05:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=825 used=0 remaining=825
time=2024-12-22T18:13:37.417-05:00 level=DEBUG source=sched.go:466 msg="context for request finished"
time=2024-12-22T18:13:37.417-05:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 duration=1h0m0s
time=2024-12-22T18:13:37.417-05:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 refCount=0
[GIN] 2024/12/22 - 18:13:37 | 200 | 31.3430724s | 127.0.0.1 | POST "/api/chat"

<!-- gh-comment-id:2558641618 --> @robbyjo commented on GitHub (Dec 22, 2024): Ok. Here is the log for 0.4.7. At first, I only changed num_ctx to 131072, which worked great except for low memory utilization. I interrupted the output. Then I changed num_ctx to 32768 and num_gpu to 48 and repeated the same query. The result was then garbled. > 2024/12/22 18:11:20 routes.go:1197: INFO server config env="map[CUDA_VISIBLE_DEVICES:0,1 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:1572864000 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:E:\\DeepLearning\\LLM OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" time=2024-12-22T18:11:20.021-05:00 level=INFO source=images.go:753 msg="total blobs: 74" time=2024-12-22T18:11:20.023-05:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0" time=2024-12-22T18:11:20.024-05:00 level=INFO source=routes.go:1248 msg="Listening on 127.0.0.1:11434 (version 0.4.7)" time=2024-12-22T18:11:20.024-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu\\ollama_llama_server.exe" time=2024-12-22T18:11:20.024-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx\\ollama_llama_server.exe" time=2024-12-22T18:11:20.024-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe" time=2024-12-22T18:11:20.024-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v11\\ollama_llama_server.exe" time=2024-12-22T18:11:20.024-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12\\ollama_llama_server.exe" time=2024-12-22T18:11:20.024-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\rocm\\ollama_llama_server.exe" time=2024-12-22T18:11:20.024-05:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cuda_v11 cuda_v12 rocm cpu cpu_avx cpu_avx2]" time=2024-12-22T18:11:20.024-05:00 level=DEBUG source=common.go:50 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY" time=2024-12-22T18:11:20.024-05:00 level=DEBUG source=sched.go:105 msg="starting llm scheduler" time=2024-12-22T18:11:20.024-05:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs" time=2024-12-22T18:11:20.024-05:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2024-12-22T18:11:20.024-05:00 level=INFO source=gpu_windows.go:183 msg="efficiency cores detected" maxEfficiencyClass=1 time=2024-12-22T18:11:20.024-05:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=24 efficiency=16 threads=32 time=2024-12-22T18:11:20.024-05:00 level=DEBUG source=gpu.go:94 msg="searching for GPU discovery libraries for NVIDIA" time=2024-12-22T18:11:20.024-05:00 level=DEBUG source=gpu.go:509 msg="Searching for GPU library" name=nvml.dll time=2024-12-22T18:11:20.024-05:00 level=DEBUG source=gpu.go:532 msg="gpu library search" globs="[C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvml.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\bin\\nvml.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\libnvvp\\nvml.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\bin\\nvml.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\libnvvp\\nvml.dll C:\\Program Files\\Common Files\\Oracle\\Java\\javapath\\nvml.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\bin\\nvml.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\libnvvp\\nvml.dll C:\\Windows\\system32\\nvml.dll C:\\Windows\\nvml.dll C:\\Windows\\System32\\Wbem\\nvml.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvml.dll C:\\Windows\\System32\\OpenSSH\\nvml.dll C:\\Program Files\\dotnet\\nvml.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\nvml.dll C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\nvml.dll C:\\Program Files (x86)\\Incredibuild\\nvml.dll C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\nvml.dll C:\\Program Files\\nodejs\\nvml.dll C:\\Program Files\\NVIDIA Corporation\\NVIDIA app\\NvDLISR\\nvml.dll C:\\Program Files\\Git\\cmd\\nvml.dll C:\\Program Files\\PuTTY\\nvml.dll C:\\Program Files\\Docker\\Docker\\resources\\bin\\nvml.dll C:\\Program Files\\Go\\bin\\nvml.dll C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2024.3.2\\nvml.dll C:\\Program Files (x86)\\Common Files\\Intel\\Shared Libraries\\redist\\intel64_win\\compiler\\nvml.dll C:\\Users\\User\\AppData\\Local\\Microsoft\\WindowsApps\\nvml.dll C:\\Users\\User\\.dotnet\\tools\\nvml.dll C:\\Users\\User\\miniconda3\\nvml.dll C:\\Users\\User\\miniconda3\\Library\\mingw-w64\\bin\\nvml.dll C:\\Users\\User\\miniconda3\\Library\\usr\\bin\\nvml.dll C:\\Users\\User\\miniconda3\\Library\\bin\\nvml.dll C:\\Users\\User\\miniconda3\\Scripts\\nvml.dll C:\\Users\\User\\AppData\\Roaming\\npm\\nvml.dll C:\\Program Files\\7-Zip\\nvml.dll C:\\ffmpeg\\bin\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\nvml.dll C:\\Users\\User\\.cache\\lm-studio\\bin\\nvml.dll C:\\Users\\User\\go\\bin\\nvml.dll C:\\Users\\User\\.dotnet\\tools\\nvml.dll c:\\Windows\\System32\\nvml.dll]" time=2024-12-22T18:11:20.024-05:00 level=DEBUG source=gpu.go:537 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll" time=2024-12-22T18:11:20.025-05:00 level=DEBUG source=gpu.go:566 msg="discovered GPU libraries" paths="[C:\\Windows\\system32\\nvml.dll c:\\Windows\\System32\\nvml.dll]" time=2024-12-22T18:11:20.040-05:00 level=DEBUG source=gpu.go:115 msg="nvidia-ml loaded" library=C:\Windows\system32\nvml.dll time=2024-12-22T18:11:20.040-05:00 level=DEBUG source=gpu.go:509 msg="Searching for GPU library" name=nvcuda.dll time=2024-12-22T18:11:20.040-05:00 level=DEBUG source=gpu.go:532 msg="gpu library search" globs="[C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvcuda.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\bin\\nvcuda.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\libnvvp\\nvcuda.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\bin\\nvcuda.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\libnvvp\\nvcuda.dll C:\\Program Files\\Common Files\\Oracle\\Java\\javapath\\nvcuda.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\bin\\nvcuda.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\libnvvp\\nvcuda.dll C:\\Windows\\system32\\nvcuda.dll C:\\Windows\\nvcuda.dll C:\\Windows\\System32\\Wbem\\nvcuda.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvcuda.dll C:\\Windows\\System32\\OpenSSH\\nvcuda.dll C:\\Program Files\\dotnet\\nvcuda.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\nvcuda.dll C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\nvcuda.dll C:\\Program Files (x86)\\Incredibuild\\nvcuda.dll C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\nvcuda.dll C:\\Program Files\\nodejs\\nvcuda.dll C:\\Program Files\\NVIDIA Corporation\\NVIDIA app\\NvDLISR\\nvcuda.dll C:\\Program Files\\Git\\cmd\\nvcuda.dll C:\\Program Files\\PuTTY\\nvcuda.dll C:\\Program Files\\Docker\\Docker\\resources\\bin\\nvcuda.dll C:\\Program Files\\Go\\bin\\nvcuda.dll C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2024.3.2\\nvcuda.dll C:\\Program Files (x86)\\Common Files\\Intel\\Shared Libraries\\redist\\intel64_win\\compiler\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Microsoft\\WindowsApps\\nvcuda.dll C:\\Users\\User\\.dotnet\\tools\\nvcuda.dll C:\\Users\\User\\miniconda3\\nvcuda.dll C:\\Users\\User\\miniconda3\\Library\\mingw-w64\\bin\\nvcuda.dll C:\\Users\\User\\miniconda3\\Library\\usr\\bin\\nvcuda.dll C:\\Users\\User\\miniconda3\\Library\\bin\\nvcuda.dll C:\\Users\\User\\miniconda3\\Scripts\\nvcuda.dll C:\\Users\\User\\AppData\\Roaming\\npm\\nvcuda.dll C:\\Program Files\\7-Zip\\nvcuda.dll C:\\ffmpeg\\bin\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\nvcuda.dll C:\\Users\\User\\.cache\\lm-studio\\bin\\nvcuda.dll C:\\Users\\User\\go\\bin\\nvcuda.dll C:\\Users\\User\\.dotnet\\tools\\nvcuda.dll c:\\windows\\system*\\nvcuda.dll]" time=2024-12-22T18:11:20.040-05:00 level=DEBUG source=gpu.go:537 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll" time=2024-12-22T18:11:20.041-05:00 level=DEBUG source=gpu.go:566 msg="discovered GPU libraries" paths=[C:\Windows\system32\nvcuda.dll] initializing C:\Windows\system32\nvcuda.dll dlsym: cuInit - 00007FFD00D34D20 dlsym: cuDriverGetVersion - 00007FFD00D34DC0 dlsym: cuDeviceGetCount - 00007FFD00D355B6 dlsym: cuDeviceGet - 00007FFD00D355B0 dlsym: cuDeviceGetAttribute - 00007FFD00D34F10 dlsym: cuDeviceGetUuid - 00007FFD00D355C2 dlsym: cuDeviceGetName - 00007FFD00D355BC dlsym: cuCtxCreate_v3 - 00007FFD00D35634 dlsym: cuMemGetInfo_v2 - 00007FFD00D35736 dlsym: cuCtxDestroy - 00007FFD00D35646 calling cuInit calling cuDriverGetVersion raw version 0x2f26 CUDA driver version: 12.7 calling cuDeviceGetCount device count 2 time=2024-12-22T18:11:20.065-05:00 level=DEBUG source=gpu.go:129 msg="detected GPUs" count=2 library=C:\Windows\system32\nvcuda.dll [GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23] CUDA totalMem 24563 mb [GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23] CUDA freeMem 22994 mb [GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23] Compute Capability 8.9 [GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8] CUDA totalMem 24563 mb [GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8] CUDA freeMem 22994 mb [GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8] Compute Capability 8.9 time=2024-12-22T18:11:20.279-05:00 level=INFO source=gpu.go:328 msg="detected OS VRAM overhead" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" time=2024-12-22T18:11:20.280-05:00 level=DEBUG source=amd_windows.go:35 msg="unable to load amdhip64_6.dll, please make sure to upgrade to the latest amd driver: The specified module could not be found." releasing cuda driver library releasing nvml library time=2024-12-22T18:11:20.281-05:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB" time=2024-12-22T18:11:20.281-05:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB" [GIN] 2024/12/22 - 18:11:27 | 200 | 0s | 127.0.0.1 | GET "/api/version" [GIN] 2024/12/22 - 18:11:32 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/12/22 - 18:11:32 | 200 | 17.5657ms | 127.0.0.1 | POST "/api/show" time=2024-12-22T18:11:32.230-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="159.4 GiB" before.free_swap="199.4 GiB" now.total="191.7 GiB" now.free="161.3 GiB" now.free_swap="225.5 GiB" time=2024-12-22T18:11:32.242-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB" time=2024-12-22T18:11:32.257-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-22T18:11:32.258-05:00 level=DEBUG source=sched.go:181 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=0x7ff7752ccf20 gpu_count=2 time=2024-12-22T18:11:32.282-05:00 level=DEBUG source=sched.go:224 msg="loading first model" model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 time=2024-12-22T18:11:32.282-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]" time=2024-12-22T18:11:32.283-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.1 GiB]" time=2024-12-22T18:11:32.284-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]" time=2024-12-22T18:11:32.286-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.1 GiB]" time=2024-12-22T18:11:32.286-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 22.1 GiB]" time=2024-12-22T18:11:32.286-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 22.1 GiB]" time=2024-12-22T18:11:32.286-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="161.3 GiB" before.free_swap="225.5 GiB" now.total="191.7 GiB" now.free="161.3 GiB" now.free_swap="225.5 GiB" time=2024-12-22T18:11:32.304-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB" time=2024-12-22T18:11:32.320-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-22T18:11:32.320-05:00 level=INFO source=server.go:105 msg="system memory" total="191.7 GiB" free="161.3 GiB" free_swap="225.5 GiB" time=2024-12-22T18:11:32.321-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.1 GiB 22.5 GiB]" time=2024-12-22T18:11:32.321-05:00 level=INFO source=memory.go:343 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=77 layers.split=38,39 memory.available="[22.1 GiB 22.5 GiB]" memory.gpu_overhead="1.5 GiB" memory.required.full="43.7 GiB" memory.required.partial="41.4 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[20.4 GiB 21.0 GiB]" memory.weights.total="38.9 GiB" memory.weights.repeating="38.1 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB" time=2024-12-22T18:11:32.321-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu\\ollama_llama_server.exe" time=2024-12-22T18:11:32.321-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx\\ollama_llama_server.exe" time=2024-12-22T18:11:32.321-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe" time=2024-12-22T18:11:32.321-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v11\\ollama_llama_server.exe" time=2024-12-22T18:11:32.321-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12\\ollama_llama_server.exe" time=2024-12-22T18:11:32.321-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\rocm\\ollama_llama_server.exe" time=2024-12-22T18:11:32.321-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu\\ollama_llama_server.exe" time=2024-12-22T18:11:32.321-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx\\ollama_llama_server.exe" time=2024-12-22T18:11:32.321-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe" time=2024-12-22T18:11:32.322-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v11\\ollama_llama_server.exe" time=2024-12-22T18:11:32.322-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12\\ollama_llama_server.exe" time=2024-12-22T18:11:32.322-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\rocm\\ollama_llama_server.exe" time=2024-12-22T18:11:32.326-05:00 level=INFO source=server.go:380 msg="starting llama server" cmd="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12\\ollama_llama_server.exe --model E:\\DeepLearning\\LLM\\blobs\\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 --ctx-size 2048 --batch-size 512 --n-gpu-layers 77 --verbose --threads 8 --flash-attn --no-mmap --parallel 1 --tensor-split 38,39 --port 63835" time=2024-12-22T18:11:32.326-05:00 level=DEBUG source=server.go:397 msg=subprocess environment="[CUDA_PATH=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6 CUDA_PATH_V12_1=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1 CUDA_PATH_V12_3=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3 CUDA_PATH_V12_6=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6 CUDA_VISIBLE_DEVICES=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23,GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 PATH=C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\libnvvp;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\libnvvp;C:\\Program Files\\Common Files\\Oracle\\Java\\javapath;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\libnvvp;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\;C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\;C:\\Program Files (x86)\\Incredibuild;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Program Files\\nodejs\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA app\\NvDLISR;C:\\Program Files\\Git\\cmd;C:\\Program Files\\PuTTY\\;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files\\Go\\bin;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2024.3.2\\;C:\\Program Files (x86)\\Common Files\\Intel\\Shared Libraries\\redist\\intel64_win\\compiler;C:\\Users\\User\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\User\\.dotnet\\tools;C:\\Users\\User\\miniconda3;C:\\Users\\User\\miniconda3\\Library\\mingw-w64\\bin;C:\\Users\\User\\miniconda3\\Library\\usr\\bin;C:\\Users\\User\\miniconda3\\Library\\bin;C:\\Users\\User\\miniconda3\\Scripts;C:\\Users\\User\\AppData\\Roaming\\npm;C:\\Program Files\\7-Zip;C:\\ffmpeg\\bin;;C:\\Users\\User\\AppData\\Local\\Programs\\Ollama;C:\\Users\\User\\.cache\\lm-studio\\bin;C:\\Users\\User\\go\\bin;C:\\Users\\User\\.dotnet\\tools]" time=2024-12-22T18:11:32.329-05:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2024-12-22T18:11:32.329-05:00 level=INFO source=server.go:559 msg="waiting for llama runner to start responding" time=2024-12-22T18:11:32.330-05:00 level=INFO source=server.go:593 msg="waiting for server to become available" status="llm server error" time=2024-12-22T18:11:32.406-05:00 level=INFO source=runner.go:939 msg="starting go runner" time=2024-12-22T18:11:32.406-05:00 level=INFO source=runner.go:940 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(clang)" threads=8 time=2024-12-22T18:11:32.406-05:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:63835" llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.1 70B ArliAI RPMax v1.3 llama_model_loader: - kv 3: general.version str = v1.3 llama_model_loader: - kv 4: general.finetune str = ArliAI-RPMax llama_model_loader: - kv 5: general.basename str = Llama-3.1 llama_model_loader: - kv 6: general.size_label str = 70B llama_model_loader: - kv 7: llama.block_count u32 = 80 llama_model_loader: - kv 8: llama.context_length u32 = 131072 llama_model_loader: - kv 9: llama.embedding_length u32 = 8192 llama_model_loader: - kv 10: llama.feed_forward_length u32 = 28672 llama_model_loader: - kv 11: llama.attention.head_count u32 = 64 llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 13: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 15: llama.attention.key_length u32 = 128 llama_model_loader: - kv 16: llama.attention.value_length u32 = 128 llama_model_loader: - kv 17: general.file_type u32 = 15 llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 27: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 28: general.quantization_version u32 = 2 llama_model_loader: - type f32: 162 tensors llama_model_loader: - type q4_K: 441 tensors llama_model_loader: - type q5_K: 40 tensors llama_model_loader: - type q6_K: 81 tensors time=2024-12-22T18:11:32.585-05:00 level=INFO source=server.go:593 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_layer = 80 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 28672 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 70.55 B llm_load_print_meta: model size = 39.59 GiB (4.82 BPW) llm_load_print_meta: general.name = Llama 3.1 70B ArliAI RPMax v1.3 llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes llm_load_tensors: ggml ctx size = 1.02 MiB llm_load_tensors: offloading 77 repeating layers to GPU llm_load_tensors: offloaded 77/81 layers to GPU llm_load_tensors: CUDA_Host buffer size = 2942.24 MiB llm_load_tensors: CUDA0 buffer size = 18482.20 MiB llm_load_tensors: CUDA1 buffer size = 19118.70 MiB time=2024-12-22T18:11:33.642-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.01" time=2024-12-22T18:11:33.904-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.06" time=2024-12-22T18:11:34.168-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.09" time=2024-12-22T18:11:34.433-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.13" time=2024-12-22T18:11:34.696-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.15" time=2024-12-22T18:11:34.962-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.19" time=2024-12-22T18:11:35.224-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.22" time=2024-12-22T18:11:35.487-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.25" time=2024-12-22T18:11:35.751-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.28" time=2024-12-22T18:11:36.014-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.32" time=2024-12-22T18:11:36.275-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.34" time=2024-12-22T18:11:36.541-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.38" time=2024-12-22T18:11:36.804-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.41" time=2024-12-22T18:11:37.069-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.44" time=2024-12-22T18:11:37.332-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.47" time=2024-12-22T18:11:37.594-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.51" time=2024-12-22T18:11:37.858-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.54" time=2024-12-22T18:11:38.123-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.57" time=2024-12-22T18:11:38.384-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.60" time=2024-12-22T18:11:38.647-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.63" time=2024-12-22T18:11:38.910-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.66" time=2024-12-22T18:11:39.175-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.69" time=2024-12-22T18:11:39.439-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.72" time=2024-12-22T18:11:39.702-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.75" time=2024-12-22T18:11:39.965-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.79" time=2024-12-22T18:11:40.229-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.81" time=2024-12-22T18:11:40.495-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.85" time=2024-12-22T18:11:40.760-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.89" time=2024-12-22T18:11:41.023-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.92" time=2024-12-22T18:11:41.288-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.96" time=2024-12-22T18:11:41.551-05:00 level=DEBUG source=server.go:604 msg="model load progress 1.00" llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 24.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 304.00 MiB llama_kv_cache_init: CUDA1 KV buffer size = 312.00 MiB llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.52 MiB llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB llama_new_context_with_model: CUDA1 compute buffer size = 162.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB llama_new_context_with_model: graph nodes = 2247 llama_new_context_with_model: graph splits = 44 time=2024-12-22T18:11:41.816-05:00 level=INFO source=server.go:598 msg="llama runner started in 9.49 seconds" time=2024-12-22T18:11:41.816-05:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 [GIN] 2024/12/22 - 18:11:41 | 200 | 9.5959297s | 127.0.0.1 | POST "/api/generate" time=2024-12-22T18:11:41.816-05:00 level=DEBUG source=sched.go:466 msg="context for request finished" time=2024-12-22T18:11:41.816-05:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 duration=1h0m0s time=2024-12-22T18:11:41.816-05:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 refCount=0 time=2024-12-22T18:11:58.819-05:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 time=2024-12-22T18:11:58.819-05:00 level=DEBUG source=sched.go:283 msg="resetting model to expire immediately to make room" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 refCount=0 time=2024-12-22T18:11:58.819-05:00 level=DEBUG source=sched.go:296 msg="waiting for pending requests to complete and unload to occur" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 time=2024-12-22T18:11:58.819-05:00 level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 time=2024-12-22T18:11:58.819-05:00 level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 time=2024-12-22T18:11:58.819-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="161.3 GiB" before.free_swap="225.5 GiB" now.total="191.7 GiB" now.free="158.0 GiB" now.free_swap="183.0 GiB" time=2024-12-22T18:11:58.838-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="2.3 GiB" now.used="21.6 GiB" time=2024-12-22T18:11:58.854-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="2.9 GiB" now.used="20.0 GiB" releasing nvml library time=2024-12-22T18:11:58.855-05:00 level=DEBUG source=server.go:1075 msg="stopping llama server" time=2024-12-22T18:11:58.856-05:00 level=DEBUG source=server.go:1081 msg="waiting for llama server to exit" time=2024-12-22T18:11:59.119-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="158.0 GiB" before.free_swap="183.0 GiB" now.total="191.7 GiB" now.free="158.1 GiB" now.free_swap="221.8 GiB" time=2024-12-22T18:11:59.181-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="2.3 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB" time=2024-12-22T18:11:59.198-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="2.9 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-22T18:11:59.199-05:00 level=DEBUG source=sched.go:659 msg="gpu VRAM free memory converged after 0.38 seconds" model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 time=2024-12-22T18:11:59.387-05:00 level=DEBUG source=server.go:1085 msg="llama server stopped" time=2024-12-22T18:11:59.387-05:00 level=DEBUG source=sched.go:380 msg="runner released" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 time=2024-12-22T18:11:59.387-05:00 level=DEBUG source=sched.go:384 msg="sending an unloaded event" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 time=2024-12-22T18:11:59.387-05:00 level=DEBUG source=sched.go:302 msg="unload completed" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 time=2024-12-22T18:11:59.387-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="158.1 GiB" before.free_swap="221.8 GiB" now.total="191.7 GiB" now.free="161.4 GiB" now.free_swap="225.8 GiB" time=2024-12-22T18:11:59.399-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB" time=2024-12-22T18:11:59.414-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-22T18:11:59.437-05:00 level=DEBUG source=sched.go:224 msg="loading first model" model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 time=2024-12-22T18:11:59.437-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]" time=2024-12-22T18:11:59.438-05:00 level=DEBUG source=memory.go:173 msg="gpu has too little memory to allocate any layers" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB" minimum_memory=479199232 layer_size="2.5 GiB" gpu_zer_overhead="0 B" partial_offload="67.1 GiB" full_offload="65.1 GiB" time=2024-12-22T18:11:59.438-05:00 level=DEBUG source=memory.go:317 msg="insufficient VRAM to load any model layers" time=2024-12-22T18:11:59.438-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.1 GiB]" time=2024-12-22T18:11:59.438-05:00 level=DEBUG source=memory.go:173 msg="gpu has too little memory to allocate any layers" id=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.1 GiB" minimum_memory=479199232 layer_size="2.5 GiB" gpu_zer_overhead="0 B" partial_offload="67.1 GiB" full_offload="65.1 GiB" time=2024-12-22T18:11:59.438-05:00 level=DEBUG source=memory.go:317 msg="insufficient VRAM to load any model layers" time=2024-12-22T18:11:59.438-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]" time=2024-12-22T18:11:59.439-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.1 GiB]" time=2024-12-22T18:11:59.441-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 22.1 GiB]" time=2024-12-22T18:11:59.442-05:00 level=DEBUG source=memory.go:173 msg="gpu has too little memory to allocate any layers" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB" minimum_memory=479199232 layer_size="2.5 GiB" gpu_zer_overhead="0 B" partial_offload="67.1 GiB" full_offload="67.1 GiB" time=2024-12-22T18:11:59.442-05:00 level=DEBUG source=memory.go:173 msg="gpu has too little memory to allocate any layers" id=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.1 GiB" minimum_memory=479199232 layer_size="2.5 GiB" gpu_zer_overhead="0 B" partial_offload="67.1 GiB" full_offload="67.1 GiB" time=2024-12-22T18:11:59.442-05:00 level=DEBUG source=memory.go:317 msg="insufficient VRAM to load any model layers" time=2024-12-22T18:11:59.442-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 22.1 GiB]" time=2024-12-22T18:11:59.442-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="161.4 GiB" before.free_swap="225.8 GiB" now.total="191.7 GiB" now.free="161.4 GiB" now.free_swap="225.8 GiB" time=2024-12-22T18:11:59.461-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB" time=2024-12-22T18:11:59.477-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-22T18:11:59.478-05:00 level=INFO source=server.go:105 msg="system memory" total="191.7 GiB" free="161.4 GiB" free_swap="225.8 GiB" time=2024-12-22T18:11:59.478-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.1 GiB 22.5 GiB]" time=2024-12-22T18:11:59.479-05:00 level=INFO source=memory.go:343 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=4 layers.split=2,2 memory.available="[22.1 GiB 22.5 GiB]" memory.gpu_overhead="1.5 GiB" memory.required.full="117.9 GiB" memory.required.partial="40.6 GiB" memory.required.kv="40.0 GiB" memory.required.allocations="[20.3 GiB 20.3 GiB]" memory.weights.total="78.2 GiB" memory.weights.repeating="77.4 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="16.8 GiB" memory.graph.partial="16.8 GiB" time=2024-12-22T18:11:59.479-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu\\ollama_llama_server.exe" time=2024-12-22T18:11:59.479-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx\\ollama_llama_server.exe" time=2024-12-22T18:11:59.479-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe" time=2024-12-22T18:11:59.479-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v11\\ollama_llama_server.exe" time=2024-12-22T18:11:59.479-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12\\ollama_llama_server.exe" time=2024-12-22T18:11:59.479-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\rocm\\ollama_llama_server.exe" time=2024-12-22T18:11:59.480-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu\\ollama_llama_server.exe" time=2024-12-22T18:11:59.480-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx\\ollama_llama_server.exe" time=2024-12-22T18:11:59.480-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe" time=2024-12-22T18:11:59.480-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v11\\ollama_llama_server.exe" time=2024-12-22T18:11:59.480-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12\\ollama_llama_server.exe" time=2024-12-22T18:11:59.480-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\rocm\\ollama_llama_server.exe" time=2024-12-22T18:11:59.484-05:00 level=INFO source=server.go:380 msg="starting llama server" cmd="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12\\ollama_llama_server.exe --model E:\\DeepLearning\\LLM\\blobs\\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 --ctx-size 131072 --batch-size 512 --n-gpu-layers 4 --verbose --threads 8 --flash-attn --no-mmap --parallel 1 --tensor-split 2,2 --port 63918" time=2024-12-22T18:11:59.484-05:00 level=DEBUG source=server.go:397 msg=subprocess environment="[CUDA_PATH=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6 CUDA_PATH_V12_1=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1 CUDA_PATH_V12_3=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3 CUDA_PATH_V12_6=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6 CUDA_VISIBLE_DEVICES=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23,GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 PATH=C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\libnvvp;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\libnvvp;C:\\Program Files\\Common Files\\Oracle\\Java\\javapath;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\libnvvp;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\;C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\;C:\\Program Files (x86)\\Incredibuild;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Program Files\\nodejs\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA app\\NvDLISR;C:\\Program Files\\Git\\cmd;C:\\Program Files\\PuTTY\\;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files\\Go\\bin;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2024.3.2\\;C:\\Program Files (x86)\\Common Files\\Intel\\Shared Libraries\\redist\\intel64_win\\compiler;C:\\Users\\User\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\User\\.dotnet\\tools;C:\\Users\\User\\miniconda3;C:\\Users\\User\\miniconda3\\Library\\mingw-w64\\bin;C:\\Users\\User\\miniconda3\\Library\\usr\\bin;C:\\Users\\User\\miniconda3\\Library\\bin;C:\\Users\\User\\miniconda3\\Scripts;C:\\Users\\User\\AppData\\Roaming\\npm;C:\\Program Files\\7-Zip;C:\\ffmpeg\\bin;;C:\\Users\\User\\AppData\\Local\\Programs\\Ollama;C:\\Users\\User\\.cache\\lm-studio\\bin;C:\\Users\\User\\go\\bin;C:\\Users\\User\\.dotnet\\tools]" time=2024-12-22T18:11:59.491-05:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2024-12-22T18:11:59.491-05:00 level=INFO source=server.go:559 msg="waiting for llama runner to start responding" time=2024-12-22T18:11:59.495-05:00 level=INFO source=server.go:593 msg="waiting for server to become available" status="llm server error" time=2024-12-22T18:11:59.558-05:00 level=INFO source=runner.go:939 msg="starting go runner" time=2024-12-22T18:11:59.559-05:00 level=INFO source=runner.go:940 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(clang)" threads=8 time=2024-12-22T18:11:59.560-05:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:63918" llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.1 70B ArliAI RPMax v1.3 llama_model_loader: - kv 3: general.version str = v1.3 llama_model_loader: - kv 4: general.finetune str = ArliAI-RPMax llama_model_loader: - kv 5: general.basename str = Llama-3.1 llama_model_loader: - kv 6: general.size_label str = 70B llama_model_loader: - kv 7: llama.block_count u32 = 80 llama_model_loader: - kv 8: llama.context_length u32 = 131072 llama_model_loader: - kv 9: llama.embedding_length u32 = 8192 llama_model_loader: - kv 10: llama.feed_forward_length u32 = 28672 llama_model_loader: - kv 11: llama.attention.head_count u32 = 64 llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 13: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 15: llama.attention.key_length u32 = 128 llama_model_loader: - kv 16: llama.attention.value_length u32 = 128 llama_model_loader: - kv 17: general.file_type u32 = 15 llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 27: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 28: general.quantization_version u32 = 2 llama_model_loader: - type f32: 162 tensors llama_model_loader: - type q4_K: 441 tensors llama_model_loader: - type q5_K: 40 tensors llama_model_loader: - type q6_K: 81 tensors time=2024-12-22T18:11:59.756-05:00 level=INFO source=server.go:593 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_layer = 80 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 28672 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 70.55 B llm_load_print_meta: model size = 39.59 GiB (4.82 BPW) llm_load_print_meta: general.name = Llama 3.1 70B ArliAI RPMax v1.3 llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes llm_load_tensors: ggml ctx size = 1.02 MiB llm_load_tensors: offloading 4 repeating layers to GPU llm_load_tensors: offloaded 4/81 layers to GPU llm_load_tensors: CUDA_Host buffer size = 38467.63 MiB llm_load_tensors: CUDA0 buffer size = 1037.75 MiB llm_load_tensors: CUDA1 buffer size = 1037.75 MiB time=2024-12-22T18:12:12.655-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.01" time=2024-12-22T18:12:12.915-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.06" time=2024-12-22T18:12:13.178-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.10" time=2024-12-22T18:12:13.430-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.14" time=2024-12-22T18:12:13.694-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.18" time=2024-12-22T18:12:13.957-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.21" time=2024-12-22T18:12:14.222-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.25" time=2024-12-22T18:12:14.487-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.29" time=2024-12-22T18:12:14.751-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.33" time=2024-12-22T18:12:15.017-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.37" time=2024-12-22T18:12:15.280-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.40" time=2024-12-22T18:12:15.546-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.45" time=2024-12-22T18:12:15.810-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.48" time=2024-12-22T18:12:16.075-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.52" time=2024-12-22T18:12:16.336-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.56" time=2024-12-22T18:12:16.599-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.60" time=2024-12-22T18:12:16.861-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.63" time=2024-12-22T18:12:17.128-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.67" time=2024-12-22T18:12:17.391-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.71" time=2024-12-22T18:12:17.655-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.74" time=2024-12-22T18:12:17.920-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.78" time=2024-12-22T18:12:18.183-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.82" time=2024-12-22T18:12:18.434-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.86" time=2024-12-22T18:12:18.699-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.90" time=2024-12-22T18:12:18.961-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.95" time=2024-12-22T18:12:19.225-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.98" time=2024-12-22T18:12:19.494-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.99" llama_new_context_with_model: n_ctx = 131072 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 ggml_cuda_host_malloc: failed to allocate 38912.00 MiB of pinned memory: out of memory time=2024-12-22T18:12:19.755-05:00 level=DEBUG source=server.go:604 msg="model load progress 1.00" time=2024-12-22T18:12:20.019-05:00 level=DEBUG source=server.go:607 msg="model load completed, waiting for server to become available" status="llm server loading model" llama_kv_cache_init: CPU KV buffer size = 38912.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 1024.00 MiB llama_kv_cache_init: CUDA1 KV buffer size = 1024.00 MiB llama_new_context_with_model: KV self size = 40960.00 MiB, K (f16): 20480.00 MiB, V (f16): 20480.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.52 MiB llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB llama_new_context_with_model: CUDA1 compute buffer size = 272.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 272.01 MiB llama_new_context_with_model: graph nodes = 2247 llama_new_context_with_model: graph splits = 993 time=2024-12-22T18:12:27.755-05:00 level=INFO source=server.go:598 msg="llama runner started in 28.26 seconds" time=2024-12-22T18:12:27.755-05:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 time=2024-12-22T18:12:27.755-05:00 level=DEBUG source=routes.go:1466 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\nInstruction: Describe the following scene in detail, add many light hearted conversation indicating closeness. Both Stephanie and her mom, Sarah, are very beautiful and sexy women. They and I were at their house, came straight from the church. Sarah told me that they had some leak in the roof near bedroom and asked me to help. Don't describe the background too much. Describe that their house is a 3-bedroom, 2 bathroom house, not too big, not too small. The roof is quite simple. Make it about 120 sentences\nScene: Open the scene with Sarah's more detailed description on the leak. Stephanie just came from the shed carrying the ladder and several roof shingles and a toolbox. I thanked Stephanie for the ladder and propped it. I took of my shirt, bare chested. I could tell that both Sarah and Stephanie admire my body. I took the shingles and the toolbox up the roof. I managed to replace a few obvious broken shingles, but a section may need a more thorough treatment. So, I climbed down and told them the situation. I was drenched in sweat. Sarah was worried about the cost. I assured her that I would cover all the cost for her. She was overjoyed. Stephanie hugged me tight and kissed me in the lips. Sarah also hugged me tight and kissed me on the cheeks. Because I was sweaty, that made their clothes wet a bit. Just at that moment, Sarah's estranged husband, Mario, came, reeked in alcohol. He was furious. He thought that I was a roofer who tried to take advantage of his wife and daughter. We tried to explain the situation but he was still angry. I wear my shirt again. Sarah told him that I offered her to pay for the roof repair, but he didn't care about the roof. He told to bugger off and don't come back. He then walked away from us, leaving us alone.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" time=2024-12-22T18:12:27.757-05:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=409 used=0 remaining=409 time=2024-12-22T18:12:40.762-05:00 level=DEBUG source=sched.go:466 msg="context for request finished" time=2024-12-22T18:12:40.762-05:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 duration=1h0m0s time=2024-12-22T18:12:40.762-05:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 refCount=0 [GIN] 2024/12/22 - 18:12:40 | 200 | 41.9626905s | 127.0.0.1 | POST "/api/chat" time=2024-12-22T18:13:06.093-05:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 time=2024-12-22T18:13:06.093-05:00 level=DEBUG source=sched.go:283 msg="resetting model to expire immediately to make room" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 refCount=0 time=2024-12-22T18:13:06.093-05:00 level=DEBUG source=sched.go:296 msg="waiting for pending requests to complete and unload to occur" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 time=2024-12-22T18:13:06.093-05:00 level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 time=2024-12-22T18:13:06.093-05:00 level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 time=2024-12-22T18:13:06.093-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="161.4 GiB" before.free_swap="225.8 GiB" now.total="191.7 GiB" now.free="85.0 GiB" now.free_swap="145.6 GiB" time=2024-12-22T18:13:06.118-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="18.6 GiB" now.used="5.4 GiB" time=2024-12-22T18:13:06.133-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="19.7 GiB" now.used="3.2 GiB" releasing nvml library time=2024-12-22T18:13:06.134-05:00 level=DEBUG source=server.go:1075 msg="stopping llama server" time=2024-12-22T18:13:06.134-05:00 level=DEBUG source=server.go:1081 msg="waiting for llama server to exit" time=2024-12-22T18:13:06.398-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="85.0 GiB" before.free_swap="145.6 GiB" now.total="191.7 GiB" now.free="85.1 GiB" now.free_swap="148.9 GiB" time=2024-12-22T18:13:06.707-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="18.6 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB" time=2024-12-22T18:13:06.723-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="19.7 GiB" now.total="24.0 GiB" now.free="22.4 GiB" now.used="563.6 MiB" releasing nvml library time=2024-12-22T18:13:06.724-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="85.1 GiB" before.free_swap="148.9 GiB" now.total="191.7 GiB" now.free="85.1 GiB" now.free_swap="151.7 GiB" time=2024-12-22T18:13:07.439-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB" time=2024-12-22T18:13:07.454-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.4 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-22T18:13:07.454-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="85.1 GiB" before.free_swap="151.7 GiB" now.total="191.7 GiB" now.free="86.0 GiB" now.free_swap="152.4 GiB" time=2024-12-22T18:13:07.469-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB" time=2024-12-22T18:13:07.485-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-22T18:13:07.640-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="86.0 GiB" before.free_swap="152.4 GiB" now.total="191.7 GiB" now.free="88.7 GiB" now.free_swap="152.4 GiB" time=2024-12-22T18:13:07.655-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB" time=2024-12-22T18:13:07.671-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-22T18:13:07.892-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="88.7 GiB" before.free_swap="152.4 GiB" now.total="191.7 GiB" now.free="92.4 GiB" now.free_swap="152.4 GiB" time=2024-12-22T18:13:07.907-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB" time=2024-12-22T18:13:07.922-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-22T18:13:08.140-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="92.4 GiB" before.free_swap="152.4 GiB" now.total="191.7 GiB" now.free="95.7 GiB" now.free_swap="152.4 GiB" time=2024-12-22T18:13:08.156-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB" time=2024-12-22T18:13:08.171-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-22T18:13:08.388-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="95.7 GiB" before.free_swap="152.4 GiB" now.total="191.7 GiB" now.free="99.0 GiB" now.free_swap="152.4 GiB" time=2024-12-22T18:13:08.403-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB" time=2024-12-22T18:13:08.419-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-22T18:13:08.638-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="99.0 GiB" before.free_swap="152.4 GiB" now.total="191.7 GiB" now.free="102.3 GiB" now.free_swap="152.4 GiB" time=2024-12-22T18:13:08.653-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB" time=2024-12-22T18:13:08.668-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-22T18:13:08.887-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="102.3 GiB" before.free_swap="152.4 GiB" now.total="191.7 GiB" now.free="105.5 GiB" now.free_swap="152.4 GiB" time=2024-12-22T18:13:08.902-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB" time=2024-12-22T18:13:08.918-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-22T18:13:09.136-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="105.5 GiB" before.free_swap="152.4 GiB" now.total="191.7 GiB" now.free="108.8 GiB" now.free_swap="152.4 GiB" time=2024-12-22T18:13:09.151-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB" time=2024-12-22T18:13:09.167-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-22T18:13:09.385-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="108.8 GiB" before.free_swap="152.4 GiB" now.total="191.7 GiB" now.free="112.7 GiB" now.free_swap="152.4 GiB" time=2024-12-22T18:13:09.401-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB" time=2024-12-22T18:13:09.418-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-22T18:13:09.650-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="112.7 GiB" before.free_swap="152.4 GiB" now.total="191.7 GiB" now.free="116.8 GiB" now.free_swap="152.4 GiB" time=2024-12-22T18:13:09.666-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB" time=2024-12-22T18:13:09.682-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-22T18:13:09.886-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="116.8 GiB" before.free_swap="152.4 GiB" now.total="191.7 GiB" now.free="120.0 GiB" now.free_swap="152.4 GiB" time=2024-12-22T18:13:09.900-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB" time=2024-12-22T18:13:09.916-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-22T18:13:10.135-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="120.0 GiB" before.free_swap="152.4 GiB" now.total="191.7 GiB" now.free="123.6 GiB" now.free_swap="190.6 GiB" time=2024-12-22T18:13:10.151-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB" time=2024-12-22T18:13:10.166-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-22T18:13:10.399-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="123.6 GiB" before.free_swap="190.6 GiB" now.total="191.7 GiB" now.free="129.0 GiB" now.free_swap="190.6 GiB" time=2024-12-22T18:13:10.415-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB" time=2024-12-22T18:13:10.431-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-22T18:13:10.650-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="129.0 GiB" before.free_swap="190.6 GiB" now.total="191.7 GiB" now.free="133.7 GiB" now.free_swap="190.6 GiB" time=2024-12-22T18:13:10.666-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB" time=2024-12-22T18:13:10.681-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-22T18:13:10.899-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="133.7 GiB" before.free_swap="190.6 GiB" now.total="191.7 GiB" now.free="137.9 GiB" now.free_swap="190.6 GiB" time=2024-12-22T18:13:10.914-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB" time=2024-12-22T18:13:10.929-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-22T18:13:11.135-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0418316 model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 time=2024-12-22T18:13:11.135-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="137.9 GiB" before.free_swap="190.6 GiB" now.total="191.7 GiB" now.free="142.5 GiB" now.free_swap="190.6 GiB" time=2024-12-22T18:13:11.168-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB" time=2024-12-22T18:13:11.181-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-22T18:13:11.399-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.3064613 model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 time=2024-12-22T18:13:12.236-05:00 level=DEBUG source=server.go:1085 msg="llama server stopped" time=2024-12-22T18:13:12.236-05:00 level=DEBUG source=sched.go:380 msg="runner released" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 time=2024-12-22T18:13:12.236-05:00 level=DEBUG source=sched.go:384 msg="sending an unloaded event" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 time=2024-12-22T18:13:12.236-05:00 level=DEBUG source=sched.go:302 msg="unload completed" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 time=2024-12-22T18:13:12.236-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="142.5 GiB" before.free_swap="190.6 GiB" now.total="191.7 GiB" now.free="161.9 GiB" now.free_swap="228.9 GiB" time=2024-12-22T18:13:12.259-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB" time=2024-12-22T18:13:12.274-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-22T18:13:12.274-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=6.1810826 model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 time=2024-12-22T18:13:12.275-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="161.9 GiB" before.free_swap="228.9 GiB" now.total="191.7 GiB" now.free="161.9 GiB" now.free_swap="228.9 GiB" time=2024-12-22T18:13:12.289-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB" time=2024-12-22T18:13:12.304-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-22T18:13:12.345-05:00 level=DEBUG source=sched.go:224 msg="loading first model" model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 time=2024-12-22T18:13:12.345-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]" time=2024-12-22T18:13:12.345-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.1 GiB]" time=2024-12-22T18:13:12.346-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]" time=2024-12-22T18:13:12.346-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.1 GiB]" time=2024-12-22T18:13:12.348-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 22.1 GiB]" time=2024-12-22T18:13:12.349-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 22.1 GiB]" time=2024-12-22T18:13:12.349-05:00 level=INFO source=sched.go:730 msg="new model will fit in available VRAM, loading" model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 library=cuda parallel=1 required="39.5 GiB" time=2024-12-22T18:13:12.349-05:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="191.7 GiB" before.free="161.9 GiB" before.free_swap="228.9 GiB" now.total="191.7 GiB" now.free="161.9 GiB" now.free_swap="228.9 GiB" time=2024-12-22T18:13:12.458-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.1 GiB" now.total="24.0 GiB" now.free="22.1 GiB" now.used="1.9 GiB" time=2024-12-22T18:13:12.473-05:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="481.7 MiB" releasing nvml library time=2024-12-22T18:13:12.474-05:00 level=INFO source=server.go:105 msg="system memory" total="191.7 GiB" free="161.9 GiB" free_swap="228.9 GiB" time=2024-12-22T18:13:12.474-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 22.1 GiB]" time=2024-12-22T18:13:12.475-05:00 level=INFO source=memory.go:343 msg="offload to cuda" layers.requested=48 layers.model=81 layers.offload=48 layers.split=24,24 memory.available="[22.5 GiB 22.1 GiB]" memory.gpu_overhead="1.5 GiB" memory.required.full="59.7 GiB" memory.required.partial="39.5 GiB" memory.required.kv="10.0 GiB" memory.required.allocations="[19.7 GiB 19.7 GiB]" memory.weights.total="48.2 GiB" memory.weights.repeating="47.4 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="4.3 GiB" memory.graph.partial="4.3 GiB" time=2024-12-22T18:13:12.476-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu\\ollama_llama_server.exe" time=2024-12-22T18:13:12.476-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx\\ollama_llama_server.exe" time=2024-12-22T18:13:12.476-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe" time=2024-12-22T18:13:12.476-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v11\\ollama_llama_server.exe" time=2024-12-22T18:13:12.476-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12\\ollama_llama_server.exe" time=2024-12-22T18:13:12.476-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\rocm\\ollama_llama_server.exe" time=2024-12-22T18:13:12.476-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu\\ollama_llama_server.exe" time=2024-12-22T18:13:12.476-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx\\ollama_llama_server.exe" time=2024-12-22T18:13:12.476-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe" time=2024-12-22T18:13:12.476-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v11\\ollama_llama_server.exe" time=2024-12-22T18:13:12.476-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12\\ollama_llama_server.exe" time=2024-12-22T18:13:12.476-05:00 level=DEBUG source=common.go:294 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\rocm\\ollama_llama_server.exe" time=2024-12-22T18:13:12.481-05:00 level=INFO source=server.go:380 msg="starting llama server" cmd="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12\\ollama_llama_server.exe --model E:\\DeepLearning\\LLM\\blobs\\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 --ctx-size 32768 --batch-size 512 --n-gpu-layers 48 --verbose --threads 8 --flash-attn --no-mmap --parallel 1 --tensor-split 24,24 --port 64176" time=2024-12-22T18:13:12.481-05:00 level=DEBUG source=server.go:397 msg=subprocess environment="[CUDA_PATH=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6 CUDA_PATH_V12_1=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1 CUDA_PATH_V12_3=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3 CUDA_PATH_V12_6=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6 CUDA_VISIBLE_DEVICES=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8,GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 PATH=C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\libnvvp;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\libnvvp;C:\\Program Files\\Common Files\\Oracle\\Java\\javapath;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\libnvvp;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\;C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\;C:\\Program Files (x86)\\Incredibuild;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Program Files\\nodejs\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA app\\NvDLISR;C:\\Program Files\\Git\\cmd;C:\\Program Files\\PuTTY\\;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files\\Go\\bin;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2024.3.2\\;C:\\Program Files (x86)\\Common Files\\Intel\\Shared Libraries\\redist\\intel64_win\\compiler;C:\\Users\\User\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\User\\.dotnet\\tools;C:\\Users\\User\\miniconda3;C:\\Users\\User\\miniconda3\\Library\\mingw-w64\\bin;C:\\Users\\User\\miniconda3\\Library\\usr\\bin;C:\\Users\\User\\miniconda3\\Library\\bin;C:\\Users\\User\\miniconda3\\Scripts;C:\\Users\\User\\AppData\\Roaming\\npm;C:\\Program Files\\7-Zip;C:\\ffmpeg\\bin;;C:\\Users\\User\\AppData\\Local\\Programs\\Ollama;C:\\Users\\User\\.cache\\lm-studio\\bin;C:\\Users\\User\\go\\bin;C:\\Users\\User\\.dotnet\\tools]" time=2024-12-22T18:13:12.489-05:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2024-12-22T18:13:12.489-05:00 level=INFO source=server.go:559 msg="waiting for llama runner to start responding" time=2024-12-22T18:13:12.492-05:00 level=INFO source=server.go:593 msg="waiting for server to become available" status="llm server error" time=2024-12-22T18:13:12.585-05:00 level=INFO source=runner.go:939 msg="starting go runner" time=2024-12-22T18:13:12.585-05:00 level=INFO source=runner.go:940 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(clang)" threads=8 time=2024-12-22T18:13:12.586-05:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:64176" llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.1 70B ArliAI RPMax v1.3 llama_model_loader: - kv 3: general.version str = v1.3 llama_model_loader: - kv 4: general.finetune str = ArliAI-RPMax llama_model_loader: - kv 5: general.basename str = Llama-3.1 llama_model_loader: - kv 6: general.size_label str = 70B llama_model_loader: - kv 7: llama.block_count u32 = 80 llama_model_loader: - kv 8: llama.context_length u32 = 131072 llama_model_loader: - kv 9: llama.embedding_length u32 = 8192 llama_model_loader: - kv 10: llama.feed_forward_length u32 = 28672 llama_model_loader: - kv 11: llama.attention.head_count u32 = 64 llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 13: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 15: llama.attention.key_length u32 = 128 llama_model_loader: - kv 16: llama.attention.value_length u32 = 128 llama_model_loader: - kv 17: general.file_type u32 = 15 llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 27: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 28: general.quantization_version u32 = 2 llama_model_loader: - type f32: 162 tensors llama_model_loader: - type q4_K: 441 tensors llama_model_loader: - type q5_K: 40 tensors llama_model_loader: - type q6_K: 81 tensors time=2024-12-22T18:13:12.757-05:00 level=INFO source=server.go:593 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_layer = 80 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 28672 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 70.55 B llm_load_print_meta: model size = 39.59 GiB (4.82 BPW) llm_load_print_meta: general.name = Llama 3.1 70B ArliAI RPMax v1.3 llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes llm_load_tensors: ggml ctx size = 1.02 MiB llm_load_tensors: offloading 48 repeating layers to GPU llm_load_tensors: offloaded 48/81 layers to GPU llm_load_tensors: CUDA_Host buffer size = 17107.43 MiB llm_load_tensors: CUDA0 buffer size = 11512.01 MiB llm_load_tensors: CUDA1 buffer size = 11923.69 MiB time=2024-12-22T18:13:17.388-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.04" time=2024-12-22T18:13:17.651-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.06" time=2024-12-22T18:13:17.902-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.09" time=2024-12-22T18:13:18.164-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.13" time=2024-12-22T18:13:18.414-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.16" time=2024-12-22T18:13:18.680-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.20" time=2024-12-22T18:13:18.943-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.24" time=2024-12-22T18:13:19.208-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.28" time=2024-12-22T18:13:19.471-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.31" time=2024-12-22T18:13:19.734-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.35" time=2024-12-22T18:13:19.999-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.39" time=2024-12-22T18:13:20.260-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.42" time=2024-12-22T18:13:20.525-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.45" time=2024-12-22T18:13:20.791-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.48" time=2024-12-22T18:13:21.054-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.51" time=2024-12-22T18:13:21.318-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.54" time=2024-12-22T18:13:21.582-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.57" time=2024-12-22T18:13:21.832-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.60" time=2024-12-22T18:13:22.095-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.64" time=2024-12-22T18:13:22.359-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.67" time=2024-12-22T18:13:22.624-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.70" time=2024-12-22T18:13:22.888-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.72" time=2024-12-22T18:13:23.153-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.75" time=2024-12-22T18:13:23.417-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.78" time=2024-12-22T18:13:23.680-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.81" time=2024-12-22T18:13:23.942-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.84" time=2024-12-22T18:13:24.203-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.88" time=2024-12-22T18:13:24.468-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.92" time=2024-12-22T18:13:24.732-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.95" time=2024-12-22T18:13:24.994-05:00 level=DEBUG source=server.go:604 msg="model load progress 0.99" llama_new_context_with_model: n_ctx = 32768 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 time=2024-12-22T18:13:25.258-05:00 level=DEBUG source=server.go:604 msg="model load progress 1.00" time=2024-12-22T18:13:25.509-05:00 level=DEBUG source=server.go:607 msg="model load completed, waiting for server to become available" status="llm server loading model" llama_kv_cache_init: CUDA_Host KV buffer size = 4096.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 3072.00 MiB llama_kv_cache_init: CUDA1 KV buffer size = 3072.00 MiB llama_new_context_with_model: KV self size = 10240.00 MiB, K (f16): 5120.00 MiB, V (f16): 5120.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.52 MiB llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB llama_new_context_with_model: CUDA1 compute buffer size = 176.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 80.01 MiB llama_new_context_with_model: graph nodes = 2247 llama_new_context_with_model: graph splits = 421 time=2024-12-22T18:13:26.563-05:00 level=INFO source=server.go:598 msg="llama runner started in 14.07 seconds" time=2024-12-22T18:13:26.563-05:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 time=2024-12-22T18:13:26.564-05:00 level=DEBUG source=server.go:962 msg="new runner detected, loading model for cgo tokenization" llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.1 70B ArliAI RPMax v1.3 llama_model_loader: - kv 3: general.version str = v1.3 llama_model_loader: - kv 4: general.finetune str = ArliAI-RPMax llama_model_loader: - kv 5: general.basename str = Llama-3.1 llama_model_loader: - kv 6: general.size_label str = 70B llama_model_loader: - kv 7: llama.block_count u32 = 80 llama_model_loader: - kv 8: llama.context_length u32 = 131072 llama_model_loader: - kv 9: llama.embedding_length u32 = 8192 llama_model_loader: - kv 10: llama.feed_forward_length u32 = 28672 llama_model_loader: - kv 11: llama.attention.head_count u32 = 64 llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 13: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 15: llama.attention.key_length u32 = 128 llama_model_loader: - kv 16: llama.attention.value_length u32 = 128 llama_model_loader: - kv 17: general.file_type u32 = 15 llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 27: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 28: general.quantization_version u32 = 2 llama_model_loader: - type f32: 162 tensors llama_model_loader: - type q4_K: 441 tensors llama_model_loader: - type q5_K: 40 tensors llama_model_loader: - type q6_K: 81 tensors llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 1 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = all F32 llm_load_print_meta: model params = 70.55 B llm_load_print_meta: model size = 39.59 GiB (4.82 BPW) llm_load_print_meta: general.name = Llama 3.1 70B ArliAI RPMax v1.3 llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llama_model_load: vocab only - skipping tensors time=2024-12-22T18:13:26.937-05:00 level=DEBUG source=routes.go:1466 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\nInstruction: Describe the following scene in detail, add many light hearted conversation indicating closeness. Both Stephanie and her mom, Sarah, are very beautiful and sexy women. They and I were at their house, came straight from the church. Sarah told me that they had some leak in the roof near bedroom and asked me to help. Don't describe the background too much. Describe that their house is a 3-bedroom, 2 bathroom house, not too big, not too small. The roof is quite simple. Make it about 120 sentences\nScene: Open the scene with Sarah's more detailed description on the leak. Stephanie just came from the shed carrying the ladder and several roof shingles and a toolbox. I thanked Stephanie for the ladder and propped it. I took of my shirt, bare chested. I could tell that both Sarah and Stephanie admire my body. I took the shingles and the toolbox up the roof. I managed to replace a few obvious broken shingles, but a section may need a more thorough treatment. So, I climbed down and told them the situation. I was drenched in sweat. Sarah was worried about the cost. I assured her that I would cover all the cost for her. She was overjoyed. Stephanie hugged me tight and kissed me in the lips. Sarah also hugged me tight and kissed me on the cheeks. Because I was sweaty, that made their clothes wet a bit. Just at that moment, Sarah's estranged husband, Mario, came, reeked in alcohol. He was furious. He thought that I was a roofer who tried to take advantage of his wife and daughter. We tried to explain the situation but he was still angry. I wear my shirt again. Sarah told him that I offered her to pay for the roof repair, but he didn't care about the roof. He told to bugger off and don't come back. He then walked away from us, leaving us alone.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nSarah explained that there had been a<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nInstruction: Describe the following scene in detail, add many light hearted conversation indicating closeness. Both Stephanie and her mom, Sarah, are very beautiful and sexy women. They and I were at their house, came straight from the church. Sarah told me that they had some leak in the roof near bedroom and asked me to help. Don't describe the background too much. Describe that their house is a 3-bedroom, 2 bathroom house, not too big, not too small. The roof is quite simple. Make it about 120 sentences\nScene: Open the scene with Sarah's more detailed description on the leak. Stephanie just came from the shed carrying the ladder and several roof shingles and a toolbox. I thanked Stephanie for the ladder and propped it. I took of my shirt, bare chested. I could tell that both Sarah and Stephanie admire my body. I took the shingles and the toolbox up the roof. I managed to replace a few obvious broken shingles, but a section may need a more thorough treatment. So, I climbed down and told them the situation. I was drenched in sweat. Sarah was worried about the cost. I assured her that I would cover all the cost for her. She was overjoyed. Stephanie hugged me tight and kissed me in the lips. Sarah also hugged me tight and kissed me on the cheeks. Because I was sweaty, that made their clothes wet a bit. Just at that moment, Sarah's estranged husband, Mario, came, reeked in alcohol. He was furious. He thought that I was a roofer who tried to take advantage of his wife and daughter. We tried to explain the situation but he was still angry. I wear my shirt again. Sarah told him that I offered her to pay for the roof repair, but he didn't care about the roof. He told to bugger off and don't come back. He then walked away from us, leaving us alone.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" time=2024-12-22T18:13:26.938-05:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=825 used=0 remaining=825 time=2024-12-22T18:13:37.417-05:00 level=DEBUG source=sched.go:466 msg="context for request finished" time=2024-12-22T18:13:37.417-05:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 duration=1h0m0s time=2024-12-22T18:13:37.417-05:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=E:\DeepLearning\LLM\blobs\sha256-77eea14c7e91f0a31ea3fe0c6a9674699d6762f7d1edaee846c13c39bf41fb42 refCount=0 [GIN] 2024/12/22 - 18:13:37 | 200 | 31.3430724s | 127.0.0.1 | POST "/api/chat"
Author
Owner

@YonTracks commented on GitHub (Dec 23, 2024):

great progress cheers.
what have we learned / confirmed, should make a list.

I think but want to be sure:

  1. when using multi-GPU's the params are only being set on the main gpu, and the others switch to default seems (0.4.7) kv context and grammar updates.
    edit^ according to the logs, seems 0.4.7 is also switching back to default ctx on the second gpu.
    check 2cd11ae365

  2. default num_ctx can be too small in some cases, was this effected in earlier versions, and I wonder if slightly larger default would help or hinder overall.

  3. windows and or linux etc?

Thank you for the insight. I tried version 0.3.14 and set the context (num_ctx) to 131072 and IT WORKED!!!!! THANK YOU SO MUCH!!!

I tested the following model:

ollama.exe run hf.co/ArliAI/Llama-3.1-70B-ArliAI-RPMax-v1.3-GGUF:Q4_K_M

<!-- gh-comment-id:2558672728 --> @YonTracks commented on GitHub (Dec 23, 2024): great progress cheers. what have we learned / confirmed, should make a list. I think but want to be sure: 1. when using multi-GPU's the params are only being set on the main gpu, and the others switch to default seems (0.4.7) kv context and grammar updates. edit^ according to the logs, seems 0.4.7 is also switching back to default ctx on the second gpu. check https://github.com/ollama/ollama/commit/2cd11ae365a9423578069457312dce6b9e1e5a37 2. default num_ctx can be too small in some cases, was this effected in earlier versions, and I wonder if slightly larger default would help or hinder overall. 3. windows and or linux etc? > Thank you for the insight. I tried version 0.3.14 and set the context (num_ctx) to 131072 and IT WORKED!!!!! THANK YOU SO MUCH!!! > > I tested the following model: > > > ollama.exe run hf.co/ArliAI/Llama-3.1-70B-ArliAI-RPMax-v1.3-GGUF:Q4_K_M >
Author
Owner

@robbyjo commented on GitHub (Dec 23, 2024):

@YonTracks Thanks.

Not sure about your first question, but for your second question usually for Llama 3.1, 32K context should be enough. Also, for your third question, I can only confirm Windows. Cannot confirm Linux since I only have Win 11 machines.

For 0.4.x, I noticed that the GPU VRAM utilization was quite low (~6GB for GPU 0 and ~5GB for GPU 1) and it was a lot slower than single GPU (GPU 1) for v0.5.4.

<!-- gh-comment-id:2558676416 --> @robbyjo commented on GitHub (Dec 23, 2024): @YonTracks Thanks. Not sure about your first question, but for your second question usually for Llama 3.1, 32K context should be enough. Also, for your third question, I can only confirm Windows. Cannot confirm Linux since I only have Win 11 machines. For 0.4.x, I noticed that the GPU VRAM utilization was quite low (~6GB for GPU 0 and ~5GB for GPU 1) and it was a lot slower than single GPU (GPU 1) for v0.5.4.
Author
Owner

@YonTracks commented on GitHub (Dec 23, 2024):

@YonTracks Thanks.

Not sure about your first question, but for your second question usually for Llama 3.1, 32K context should be enough. Also, for your third question, I can only confirm Windows. Cannot confirm Linux since I only have Win 11 machines.

For 0.4.x, I noticed that the GPU VRAM utilization was quite low (~6GB for GPU 0 and ~5GB for GPU 1) and it was a lot slower than single GPU (GPU 1) for v0.5.4.

need to confirm the params are not being used on the other gpus and why?

<!-- gh-comment-id:2558688671 --> @YonTracks commented on GitHub (Dec 23, 2024): > @YonTracks Thanks. > > Not sure about your first question, but for your second question usually for Llama 3.1, 32K context should be enough. Also, for your third question, I can only confirm Windows. Cannot confirm Linux since I only have Win 11 machines. > > For 0.4.x, I noticed that the GPU VRAM utilization was quite low (~6GB for GPU 0 and ~5GB for GPU 1) and it was a lot slower than single GPU (GPU 1) for v0.5.4. need to confirm the params are not being used on the other gpus and why?
Author
Owner

@YonTracks commented on GitHub (Dec 23, 2024):

hardcoded-ctx.txt
I think with a hardcoded num_ctx we can confirm this? if the num_ctx still dont get used on the other gpu's... theres your problem lol.

		"--model", model,
		"--ctx-size", strconv.Itoa(32768),
		"--batch-size", strconv.Itoa(opts.NumBatch),
	}
<!-- gh-comment-id:2558690264 --> @YonTracks commented on GitHub (Dec 23, 2024): [hardcoded-ctx.txt](https://github.com/user-attachments/files/18223824/hardcoded-ctx.txt) I think with a hardcoded num_ctx we can confirm this? if the num_ctx still dont get used on the other gpu's... theres your problem lol. ``` params := []string{ "--model", model, "--ctx-size", strconv.Itoa(32768), "--batch-size", strconv.Itoa(opts.NumBatch), } ```
Author
Owner

@robbyjo commented on GitHub (Dec 23, 2024):

I'd be more than happy to test this, but how?

<!-- gh-comment-id:2558700004 --> @robbyjo commented on GitHub (Dec 23, 2024): I'd be more than happy to test this, but how?
Author
Owner

@YonTracks commented on GitHub (Dec 23, 2024):

I'd be more than happy to test this, but how?

need a local dev build and multi-GPU's?
I think the pro's have enough info?
super cheers, I will leave it to those fine folks.
unless your keen and have a dev build.

<!-- gh-comment-id:2558703575 --> @YonTracks commented on GitHub (Dec 23, 2024): > I'd be more than happy to test this, but how? need a local dev build and multi-GPU's? I think the pro's have enough info? super cheers, I will leave it to those fine folks. unless your keen and have a dev build.
Author
Owner

@robbyjo commented on GitHub (Dec 23, 2024):

I currently do not have a dev build. I have MSVC 2022, MSYS2, and Go 1.23 installed. But unsure how to proceed.

<!-- gh-comment-id:2558705030 --> @robbyjo commented on GitHub (Dec 23, 2024): I currently do not have a dev build. I have MSVC 2022, MSYS2, and Go 1.23 installed. But unsure how to proceed.
Author
Owner

@YonTracks commented on GitHub (Dec 23, 2024):

what have we learned / confirmed, should make a list.

I think but want to be sure:

  1. when using multi-GPU's the params are only being set on the main gpu, and the others switch to default seems (0.4.7) kv context and grammar updates.
    edit^ according to the logs, seems 0.4.7 is also switching back to default ctx on the second gpu.
    check the following commit
    2cd11ae365
    and
    3478b2cf14

I bet jesse will know should we ping it is holidays, rick will know?

  1. default num_ctx can be too small in some cases, was this effected in earlier versions, and I wonder if slightly larger default would help or hinder overall.
  2. windows and or linux etc?

I think with a hardcoded num_ctx we can confirm this? if the num_ctx still dont get used on the other gpu's... theres your >problem lol.

I currently do not have a dev build. I have MSVC 2022, MSYS2, and Go 1.23 installed. But unsure how to proceed.

yep, windows dev build for me was tricky, and even when I thought I had it compiled and all correct, and it still ollama will work but issues.
it wasn't until I could 'make' via the .iss script and build a setup exe I could confirm the dev build was correct, so safest to just not?

but if keen for learning even, like me (as long as we don't hinder lol).
https://github.com/ollama/ollama/blob/main/docs/development.md

good luck super cheers for your help

<!-- gh-comment-id:2558715546 --> @YonTracks commented on GitHub (Dec 23, 2024): >what have we learned / confirmed, should make a list. >I think but want to be sure: >1. when using multi-GPU's the params are only being set on the main gpu, and the others switch to default seems (0.4.7) kv context and grammar updates. edit^ according to the logs, seems 0.4.7 is also switching back to default ctx on the second gpu. check the following commit https://github.com/ollama/ollama/commit/2cd11ae365a9423578069457312dce6b9e1e5a37 and https://github.com/ollama/ollama/commit/3478b2cf14c3fa2661c03f7fd5764a63a496293a I bet jesse will know should we ping it is holidays, rick will know? >2. default num_ctx can be too small in some cases, was this effected in earlier versions, and I wonder if slightly larger default would help or hinder overall. >3. windows and or linux etc? > >I think with a hardcoded num_ctx we can confirm this? if the num_ctx still dont get used on the other gpu's... theres your >problem lol. > I currently do not have a dev build. I have MSVC 2022, MSYS2, and Go 1.23 installed. But unsure how to proceed. yep, windows dev build for me was tricky, and even when I thought I had it compiled and all correct, and it still ollama will work but issues. it wasn't until I could 'make' via the .iss script and build a setup exe I could confirm the dev build was correct, so safest to just not? but if keen for learning even, like me (as long as we don't hinder lol). https://github.com/ollama/ollama/blob/main/docs/development.md good luck super cheers for your help
Author
Owner

@rick-github commented on GitHub (Dec 23, 2024):

There is no per-GPU context window. The context window is set for the runner via opts.NumCtx and hardcoding it in server.go is exactly the same as "options":{"num_ctx":32768}. The change in ctx-size in the logs is due to the model being loaded with the default context size for one API call and then being loaded with a difference context size when the API specifies num_ctx. The KV buffer allocated on each GPU will vary by the number of layers assigned to the GPU, but the total of all KV buffers will always sum to the proportional value of ctx-size given to the runner. Reloading the model when the context size changes is a problem in some circumstances that could be alleviated with https://github.com/ollama/ollama/pull/8029.

<!-- gh-comment-id:2558777844 --> @rick-github commented on GitHub (Dec 23, 2024): There is no per-GPU context window. The context window is set for the runner via `opts.NumCtx` and hardcoding it in `server.go` is exactly the same as `"options":{"num_ctx":32768}`. The change in `ctx-size` in the logs is due to the model being loaded with the default context size for one API call and then being loaded with a difference context size when the API specifies `num_ctx`. The KV buffer allocated on each GPU will vary by the number of layers assigned to the GPU, but the total of all KV buffers will always sum to the proportional value of `ctx-size` given to the runner. Reloading the model when the context size changes is a problem in some circumstances that could be alleviated with https://github.com/ollama/ollama/pull/8029.
Author
Owner

@YonTracks commented on GitHub (Dec 23, 2024):

There is no per-GPU context window. The context window is set for the runner via opts.NumCtx and hardcoding it in server.go is exactly the same as "options":{"num_ctx":32768}. The change in ctx-size in the logs is due to the model being loaded with the default context size for one API call and then being loaded with a difference context size when the API specifies num_ctx. The KV buffer allocated on each GPU will vary by the number of layers assigned to the GPU, but the total of all KV buffers will always sum to the proportional value of ctx-size given to the runner. Reloading the model when the context size changes is a problem in some circumstances that could be alleviated with #8029.

yes, perfect thank you! , thats what I needed to know,
I will investigate.
exactly why, (the reloaded bit) when multi-GPU's, does the num_ctx start with the set num_ctx param correctly, but then switch back to default? and when this happens, the multi-GPU call output gets corrupted, same as if 1 gpu with low ctx for these particular 70b models.
and? is it only windows?

I see in the logs that is says llama_new_context_with_model: n_ctx = 2048 when the issue is happening and 2048 is default?, even when multi-GPU's so can confirm the issue is how that all happens.

expected behavior:
the multi-gpu logs shows the same num_ctx as the main either by using same num or splitting half or whatever but the multi-gpu's are happy?

<!-- gh-comment-id:2558790060 --> @YonTracks commented on GitHub (Dec 23, 2024): > There is no per-GPU context window. The context window is set for the runner via `opts.NumCtx` and hardcoding it in `server.go` is exactly the same as `"options":{"num_ctx":32768}`. The change in `ctx-size` in the logs is due to the model being loaded with the default context size for one API call and then being loaded with a difference context size when the API specifies `num_ctx`. The KV buffer allocated on each GPU will vary by the number of layers assigned to the GPU, but the total of all KV buffers will always sum to the proportional value of `ctx-size` given to the runner. Reloading the model when the context size changes is a problem in some circumstances that could be alleviated with #8029. yes, perfect thank you! , thats what I needed to know, I will investigate. exactly why, (the reloaded bit) when multi-GPU's, does the num_ctx start with the set num_ctx param correctly, but then switch back to default? and when this happens, the multi-GPU call output gets corrupted, same as if 1 gpu with low ctx for these particular 70b models. and? is it only windows? I see in the logs that is says `llama_new_context_with_model: n_ctx = 2048` when the issue is happening and 2048 is default?, even when multi-GPU's so can confirm the issue is how that all happens. expected behavior: the multi-gpu logs shows the same num_ctx as the main either by using same num or splitting half or whatever but the multi-gpu's are happy?
Author
Owner

@YonTracks commented on GitHub (Dec 23, 2024):

There is no per-GPU context window. The context window is set for the runner via opts.NumCtx and hardcoding it in server.go is exactly the same as "options":{"num_ctx":32768}. The change in ctx-size in the logs is due to the model being loaded with the default context size for one API call and then being loaded with a difference context size when the API specifies num_ctx. The KV buffer allocated on each GPU will vary by the number of layers assigned to the GPU, but the total of all KV buffers will always sum to the proportional value of ctx-size given to the runner. Reloading the model when the context size changes is a problem in some circumstances that could be alleviated with #8029.

"Awesome! I think that's it, or very closely related. Thank you so much for your hard work and dedication!"

<!-- gh-comment-id:2558817961 --> @YonTracks commented on GitHub (Dec 23, 2024): > There is no per-GPU context window. The context window is set for the runner via `opts.NumCtx` and hardcoding it in `server.go` is exactly the same as `"options":{"num_ctx":32768}`. The change in `ctx-size` in the logs is due to the model being loaded with the default context size for one API call and then being loaded with a difference context size when the API specifies `num_ctx`. The KV buffer allocated on each GPU will vary by the number of layers assigned to the GPU, but the total of all KV buffers will always sum to the proportional value of `ctx-size` given to the runner. Reloading the model when the context size changes is a problem in some circumstances that could be alleviated with #8029. "Awesome! I think that's it, or very closely related. Thank you so much for your hard work and dedication!"
Author
Owner

@robbyjo commented on GitHub (Jan 3, 2025):

FYI, Kobold 1.80.3 seems to have the same issue. So perhaps this is a llama.cpp issue.

<!-- gh-comment-id:2568767164 --> @robbyjo commented on GitHub (Jan 3, 2025): FYI, Kobold 1.80.3 seems to have the same issue. So perhaps this is a llama.cpp issue.
Author
Owner

@rick-github commented on GitHub (Jan 3, 2025):

It wasn't the switch to go runners, and the only big change from 0.4.7 to 0.5.0 was the k/v quantization which makes use of parts of llama.cpp that weren't used before. It would be helpful to retry the tests with OLLAMA_FLASH_ATTENTION=0. The initial post said that on/off FA had been tried, but all of the logs have it on.

<!-- gh-comment-id:2568778430 --> @rick-github commented on GitHub (Jan 3, 2025): It wasn't the switch to go runners, and the only big change from 0.4.7 to 0.5.0 was the k/v quantization which makes use of parts of llama.cpp that weren't used before. It would be helpful to retry the tests with `OLLAMA_FLASH_ATTENTION=0`. The initial post said that on/off FA had been tried, but all of the logs have it on.
Author
Owner

@robbyjo commented on GitHub (Jan 3, 2025):

I turned off OLLAMA_FLASH_ATTENTION. Its output is garbled. Here is a sample of it:

1&'D!84E,C#++$>.!1";2E$'>?6E=H<49>AGE>64?%

Here is the debug output of server.log:

2025/01/03 08:49:45 routes.go:1259: INFO server config env="map[CUDA_VISIBLE_DEVICES:0,1 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:E:\DeepLearning\LLM OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-01-03T08:49:45.460-05:00 level=INFO source=images.go:757 msg="total blobs: 79"
time=2025-01-03T08:49:45.462-05:00 level=INFO source=images.go:764 msg="total unused blobs removed: 0"
time=2025-01-03T08:49:45.463-05:00 level=INFO source=routes.go:1310 msg="Listening on 127.0.0.1:11434 (version 0.5.4)"
time=2025-01-03T08:49:45.463-05:00 level=DEBUG source=common.go:80 msg="runners located" dir="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners"
time=2025-01-03T08:49:45.463-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx\ollama_llama_server.exe"
time=2025-01-03T08:49:45.463-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe"
time=2025-01-03T08:49:45.463-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v11_avx\ollama_llama_server.exe"
time=2025-01-03T08:49:45.463-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe"
time=2025-01-03T08:49:45.463-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm_avx\ollama_llama_server.exe"
time=2025-01-03T08:49:45.464-05:00 level=INFO source=routes.go:1339 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11_avx cuda_v12_avx rocm_avx]"
time=2025-01-03T08:49:45.464-05:00 level=DEBUG source=routes.go:1340 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
time=2025-01-03T08:49:45.464-05:00 level=DEBUG source=sched.go:105 msg="starting llm scheduler"
time=2025-01-03T08:49:45.464-05:00 level=INFO source=gpu.go:226 msg="looking for compatible GPUs"
time=2025-01-03T08:49:45.464-05:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-01-03T08:49:45.464-05:00 level=INFO source=gpu_windows.go:183 msg="efficiency cores detected" maxEfficiencyClass=1
time=2025-01-03T08:49:45.464-05:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=24 efficiency=16 threads=32
time=2025-01-03T08:49:45.464-05:00 level=DEBUG source=gpu.go:99 msg="searching for GPU discovery libraries for NVIDIA"
time=2025-01-03T08:49:45.464-05:00 level=DEBUG source=gpu.go:517 msg="Searching for GPU library" name=nvml.dll
time=2025-01-03T08:49:45.464-05:00 level=DEBUG source=gpu.go:543 msg="gpu library search" globs="[C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\nvml.dll C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\nvml.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin\nvml.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\libnvvp\nvml.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin\nvml.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\libnvvp\nvml.dll C:\Program Files\Common Files\Oracle\Java\javapath\nvml.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin\nvml.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp\nvml.dll C:\Windows\system32\nvml.dll C:\Windows\nvml.dll C:\Windows\System32\Wbem\nvml.dll C:\Windows\System32\WindowsPowerShell\v1.0\nvml.dll C:\Windows\System32\OpenSSH\nvml.dll C:\Program Files\dotnet\nvml.dll C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\nvml.dll C:\Program Files\Microsoft SQL Server\150\Tools\Binn\nvml.dll C:\Program Files\Microsoft SQL Server\Client SDK\ODBC\170\Tools\Binn\nvml.dll C:\Program Files (x86)\Incredibuild\nvml.dll C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\nvml.dll C:\Program Files\nodejs\nvml.dll C:\Program Files\NVIDIA Corporation\NVIDIA app\NvDLISR\nvml.dll C:\Program Files\Git\cmd\nvml.dll C:\Program Files\PuTTY\nvml.dll C:\Program Files\Docker\Docker\resources\bin\nvml.dll C:\Program Files\Go\bin\nvml.dll C:\Program Files\NVIDIA Corporation\Nsight Compute 2024.3.2\nvml.dll C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.42.34433\bin\Hostx64\x64\nvml.dll C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64_win\compiler\nvml.dll C:\Users\User\AppData\Local\Microsoft\WindowsApps\nvml.dll C:\Users\User\.dotnet\tools\nvml.dll C:\Users\User\miniconda3\nvml.dll C:\Users\User\miniconda3\Library\mingw-w64\bin\nvml.dll C:\Users\User\miniconda3\Library\usr\bin\nvml.dll C:\Users\User\miniconda3\Library\bin\nvml.dll C:\Users\User\miniconda3\Scripts\nvml.dll C:\Users\User\AppData\Roaming\npm\nvml.dll C:\Program Files\7-Zip\nvml.dll C:\ffmpeg\bin\nvml.dll C:\Users\User\AppData\Local\Programs\Ollama\nvml.dll C:\Users\User\AppData\Local\Programs\Ollama\nvml.dll C:\Users\User\.cache\lm-studio\bin\nvml.dll C:\Users\User\go\bin\nvml.dll C:\Users\User\.dotnet\tools\nvml.dll c:\Windows\System32\nvml.dll]"
time=2025-01-03T08:49:45.464-05:00 level=DEBUG source=gpu.go:548 msg="skipping PhysX cuda library path" path="C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\nvml.dll"
time=2025-01-03T08:49:45.465-05:00 level=DEBUG source=gpu.go:577 msg="discovered GPU libraries" paths="[C:\Windows\system32\nvml.dll c:\Windows\System32\nvml.dll]"
time=2025-01-03T08:49:45.479-05:00 level=DEBUG source=gpu.go:120 msg="nvidia-ml loaded" library=C:\Windows\system32\nvml.dll
time=2025-01-03T08:49:45.479-05:00 level=DEBUG source=gpu.go:517 msg="Searching for GPU library" name=nvcuda.dll
time=2025-01-03T08:49:45.479-05:00 level=DEBUG source=gpu.go:543 msg="gpu library search" globs="[C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\nvcuda.dll C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\nvcuda.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin\nvcuda.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\libnvvp\nvcuda.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin\nvcuda.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\libnvvp\nvcuda.dll C:\Program Files\Common Files\Oracle\Java\javapath\nvcuda.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin\nvcuda.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp\nvcuda.dll C:\Windows\system32\nvcuda.dll C:\Windows\nvcuda.dll C:\Windows\System32\Wbem\nvcuda.dll C:\Windows\System32\WindowsPowerShell\v1.0\nvcuda.dll C:\Windows\System32\OpenSSH\nvcuda.dll C:\Program Files\dotnet\nvcuda.dll C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\nvcuda.dll C:\Program Files\Microsoft SQL Server\150\Tools\Binn\nvcuda.dll C:\Program Files\Microsoft SQL Server\Client SDK\ODBC\170\Tools\Binn\nvcuda.dll C:\Program Files (x86)\Incredibuild\nvcuda.dll C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\nvcuda.dll C:\Program Files\nodejs\nvcuda.dll C:\Program Files\NVIDIA Corporation\NVIDIA app\NvDLISR\nvcuda.dll C:\Program Files\Git\cmd\nvcuda.dll C:\Program Files\PuTTY\nvcuda.dll C:\Program Files\Docker\Docker\resources\bin\nvcuda.dll C:\Program Files\Go\bin\nvcuda.dll C:\Program Files\NVIDIA Corporation\Nsight Compute 2024.3.2\nvcuda.dll C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.42.34433\bin\Hostx64\x64\nvcuda.dll C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64_win\compiler\nvcuda.dll C:\Users\User\AppData\Local\Microsoft\WindowsApps\nvcuda.dll C:\Users\User\.dotnet\tools\nvcuda.dll C:\Users\User\miniconda3\nvcuda.dll C:\Users\User\miniconda3\Library\mingw-w64\bin\nvcuda.dll C:\Users\User\miniconda3\Library\usr\bin\nvcuda.dll C:\Users\User\miniconda3\Library\bin\nvcuda.dll C:\Users\User\miniconda3\Scripts\nvcuda.dll C:\Users\User\AppData\Roaming\npm\nvcuda.dll C:\Program Files\7-Zip\nvcuda.dll C:\ffmpeg\bin\nvcuda.dll C:\Users\User\AppData\Local\Programs\Ollama\nvcuda.dll C:\Users\User\AppData\Local\Programs\Ollama\nvcuda.dll C:\Users\User\.cache\lm-studio\bin\nvcuda.dll C:\Users\User\go\bin\nvcuda.dll C:\Users\User\.dotnet\tools\nvcuda.dll c:\windows\system
\nvcuda.dll]"
time=2025-01-03T08:49:45.479-05:00 level=DEBUG source=gpu.go:548 msg="skipping PhysX cuda library path" path="C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\nvcuda.dll"
time=2025-01-03T08:49:45.480-05:00 level=DEBUG source=gpu.go:577 msg="discovered GPU libraries" paths=[C:\Windows\system32\nvcuda.dll]
initializing C:\Windows\system32\nvcuda.dll
dlsym: cuInit - 00007FFD00D34D20
dlsym: cuDriverGetVersion - 00007FFD00D34DC0
dlsym: cuDeviceGetCount - 00007FFD00D355B6
dlsym: cuDeviceGet - 00007FFD00D355B0
dlsym: cuDeviceGetAttribute - 00007FFD00D34F10
dlsym: cuDeviceGetUuid - 00007FFD00D355C2
dlsym: cuDeviceGetName - 00007FFD00D355BC
dlsym: cuCtxCreate_v3 - 00007FFD00D35634
dlsym: cuMemGetInfo_v2 - 00007FFD00D35736
dlsym: cuCtxDestroy - 00007FFD00D35646
calling cuInit
calling cuDriverGetVersion
raw version 0x2f26
CUDA driver version: 12.7
calling cuDeviceGetCount
device count 2
time=2025-01-03T08:49:45.500-05:00 level=DEBUG source=gpu.go:134 msg="detected GPUs" count=2 library=C:\Windows\system32\nvcuda.dll
[GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23] CUDA totalMem 24563 mb
[GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23] CUDA freeMem 22994 mb
[GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23] Compute Capability 8.9
[GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8] CUDA totalMem 24563 mb
[GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8] CUDA freeMem 22994 mb
[GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8] Compute Capability 8.9
time=2025-01-03T08:49:45.718-05:00 level=INFO source=gpu.go:334 msg="detected OS VRAM overhead" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB"
time=2025-01-03T08:49:45.719-05:00 level=DEBUG source=amd_windows.go:35 msg="unable to load amdhip64_6.dll, please make sure to upgrade to the latest amd driver: The specified module could not be found."
releasing cuda driver library
releasing nvml library
time=2025-01-03T08:49:45.720-05:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB"
time=2025-01-03T08:49:45.720-05:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB"
[GIN] 2025/01/03 - 08:49:58 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/03 - 08:49:58 | 200 | 22.0758ms | 127.0.0.1 | POST "/api/show"
time=2025-01-03T08:49:58.273-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="154.9 GiB" before.free_swap="159.5 GiB" now.total="191.7 GiB" now.free="154.9 GiB" now.free_swap="159.5 GiB"
time=2025-01-03T08:49:58.289-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="21.8 GiB" now.used="2.1 GiB"
time=2025-01-03T08:49:58.305-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="441.1 MiB"
releasing nvml library
time=2025-01-03T08:49:58.308-05:00 level=DEBUG source=sched.go:181 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=0x7ff73eed4620 gpu_count=2
time=2025-01-03T08:49:58.343-05:00 level=DEBUG source=sched.go:224 msg="loading first model" model=E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba
time=2025-01-03T08:49:58.343-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]"
time=2025-01-03T08:49:58.344-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[21.8 GiB]"
time=2025-01-03T08:49:58.344-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]"
time=2025-01-03T08:49:58.345-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[21.8 GiB]"
time=2025-01-03T08:49:58.346-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 21.8 GiB]"
time=2025-01-03T08:49:58.347-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 21.8 GiB]"
time=2025-01-03T08:49:58.348-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="154.9 GiB" before.free_swap="159.5 GiB" now.total="191.7 GiB" now.free="154.9 GiB" now.free_swap="159.4 GiB"
time=2025-01-03T08:49:58.367-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="21.8 GiB" now.total="24.0 GiB" now.free="21.8 GiB" now.used="2.1 GiB"
time=2025-01-03T08:49:58.382-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="441.1 MiB"
releasing nvml library
time=2025-01-03T08:49:58.384-05:00 level=INFO source=server.go:104 msg="system memory" total="191.7 GiB" free="154.9 GiB" free_swap="159.4 GiB"
time=2025-01-03T08:49:58.384-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[21.8 GiB 22.5 GiB]"
time=2025-01-03T08:49:58.384-05:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=71 layers.split=35,36 memory.available="[21.8 GiB 22.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="49.4 GiB" memory.required.partial="43.6 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[21.5 GiB 22.1 GiB]" memory.weights.total="44.5 GiB" memory.weights.repeating="43.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB"
time=2025-01-03T08:49:58.384-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx\ollama_llama_server.exe"
time=2025-01-03T08:49:58.384-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe"
time=2025-01-03T08:49:58.384-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v11_avx\ollama_llama_server.exe"
time=2025-01-03T08:49:58.384-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe"
time=2025-01-03T08:49:58.384-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm_avx\ollama_llama_server.exe"
time=2025-01-03T08:49:58.384-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx\ollama_llama_server.exe"
time=2025-01-03T08:49:58.384-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe"
time=2025-01-03T08:49:58.384-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v11_avx\ollama_llama_server.exe"
time=2025-01-03T08:49:58.384-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe"
time=2025-01-03T08:49:58.384-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm_avx\ollama_llama_server.exe"
time=2025-01-03T08:49:58.389-05:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe runner --model E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba --ctx-size 2048 --batch-size 512 --n-gpu-layers 71 --verbose --threads 8 --no-mmap --parallel 1 --tensor-split 35,36 --port 50946"
time=2025-01-03T08:49:58.389-05:00 level=DEBUG source=server.go:393 msg=subprocess environment="[CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6 CUDA_PATH_V12_1=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1 CUDA_PATH_V12_3=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3 CUDA_PATH_V12_6=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6 CUDA_VISIBLE_DEVICES=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23,GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 PATH=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\libnvvp;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\libnvvp;C:\Program Files\Common Files\Oracle\Java\javapath;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Program Files\dotnet\;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\Microsoft SQL Server\150\Tools\Binn\;C:\Program Files\Microsoft SQL Server\Client SDK\ODBC\170\Tools\Binn\;C:\Program Files (x86)\Incredibuild;C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\;C:\Program Files\nodejs\;C:\Program Files\NVIDIA Corporation\NVIDIA app\NvDLISR;C:\Program Files\Git\cmd;C:\Program Files\PuTTY\;C:\Program Files\Docker\Docker\resources\bin;C:\Program Files\Go\bin;C:\Program Files\NVIDIA Corporation\Nsight Compute 2024.3.2\;C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.42.34433\bin\Hostx64\x64;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64_win\compiler;C:\Users\User\AppData\Local\Microsoft\WindowsApps;C:\Users\User\.dotnet\tools;C:\Users\User\miniconda3;C:\Users\User\miniconda3\Library\mingw-w64\bin;C:\Users\User\miniconda3\Library\usr\bin;C:\Users\User\miniconda3\Library\bin;C:\Users\User\miniconda3\Scripts;C:\Users\User\AppData\Roaming\npm;C:\Program Files\7-Zip;C:\ffmpeg\bin;;C:\Users\User\AppData\Local\Programs\Ollama;C:\Users\User\.cache\lm-studio\bin;C:\Users\User\go\bin;C:\Users\User\.dotnet\tools]"
time=2025-01-03T08:49:58.392-05:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-03T08:49:58.392-05:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-03T08:49:58.393-05:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-03T08:49:58.466-05:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
time=2025-01-03T08:49:58.559-05:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=8
time=2025-01-03T08:49:58.560-05:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:50946"
time=2025-01-03T08:49:58.644-05:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
llama_load_model_from_file: using device CUDA1 (NVIDIA GeForce RTX 4090) - 22994 MiB free
llama_model_loader: loaded meta data with 40 key-value pairs and 724 tensors from E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.3 70B Instruct Abliterated
llama_model_loader: - kv 3: general.finetune str = Instruct-abliterated
llama_model_loader: - kv 4: general.basename str = Llama-3.3
llama_model_loader: - kv 5: general.size_label str = 70B
llama_model_loader: - kv 6: general.license str = llama3.3
llama_model_loader: - kv 7: general.base_model.count u32 = 1
llama_model_loader: - kv 8: general.base_model.0.name str = Llama 3.3 70B Instruct
llama_model_loader: - kv 9: general.base_model.0.organization str = Meta Llama
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv 11: general.tags arr[str,7] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 12: general.languages arr[str,8] = ["en", "fr", "it", "pt", "hi", "es", ...
llama_model_loader: - kv 13: llama.block_count u32 = 80
llama_model_loader: - kv 14: llama.context_length u32 = 131072
llama_model_loader: - kv 15: llama.embedding_length u32 = 8192
llama_model_loader: - kv 16: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 17: llama.attention.head_count u32 = 64
llama_model_loader: - kv 18: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 19: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 20: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 21: llama.attention.key_length u32 = 128
llama_model_loader: - kv 22: llama.attention.value_length u32 = 128
llama_model_loader: - kv 23: general.file_type u32 = 16
llama_model_loader: - kv 24: llama.vocab_size u32 = 128256
llama_model_loader: - kv 25: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 27: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 32: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 34: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 35: general.quantization_version u32 = 2
llama_model_loader: - kv 36: quantize.imatrix.file str = /models_out/Llama-3.3-70B-Instruct-ab...
llama_model_loader: - kv 37: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt
llama_model_loader: - kv 38: quantize.imatrix.entries_count i32 = 560
llama_model_loader: - kv 39: quantize.imatrix.chunks_count i32 = 125
llama_model_loader: - type f32: 162 tensors
llama_model_loader: - type q5_K: 561 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: control token: 128254 '<|reserved_special_token_246|>' is not marked as EOG
llm_load_vocab: control token: 128252 '<|reserved_special_token_244|>' is not marked as EOG
llm_load_vocab: control token: 128251 '<|reserved_special_token_243|>' is not marked as EOG
llm_load_vocab: control token: 128250 '<|reserved_special_token_242|>' is not marked as EOG
llm_load_vocab: control token: 128248 '<|reserved_special_token_240|>' is not marked as EOG
llm_load_vocab: control token: 128247 '<|reserved_special_token_239|>' is not marked as EOG
llm_load_vocab: control token: 128245 '<|reserved_special_token_237|>' is not marked as EOG
llm_load_vocab: control token: 128244 '<|reserved_special_token_236|>' is not marked as EOG
llm_load_vocab: control token: 128243 '<|reserved_special_token_235|>' is not marked as EOG
llm_load_vocab: control token: 128240 '<|reserved_special_token_232|>' is not marked as EOG
llm_load_vocab: control token: 128238 '<|reserved_special_token_230|>' is not marked as EOG
llm_load_vocab: control token: 128235 '<|reserved_special_token_227|>' is not marked as EOG
llm_load_vocab: control token: 128234 '<|reserved_special_token_226|>' is not marked as EOG
llm_load_vocab: control token: 128229 '<|reserved_special_token_221|>' is not marked as EOG
llm_load_vocab: control token: 128227 '<|reserved_special_token_219|>' is not marked as EOG
llm_load_vocab: control token: 128226 '<|reserved_special_token_218|>' is not marked as EOG
llm_load_vocab: control token: 128224 '<|reserved_special_token_216|>' is not marked as EOG
llm_load_vocab: control token: 128223 '<|reserved_special_token_215|>' is not marked as EOG
llm_load_vocab: control token: 128221 '<|reserved_special_token_213|>' is not marked as EOG
llm_load_vocab: control token: 128219 '<|reserved_special_token_211|>' is not marked as EOG
llm_load_vocab: control token: 128218 '<|reserved_special_token_210|>' is not marked as EOG
llm_load_vocab: control token: 128217 '<|reserved_special_token_209|>' is not marked as EOG
llm_load_vocab: control token: 128216 '<|reserved_special_token_208|>' is not marked as EOG
llm_load_vocab: control token: 128215 '<|reserved_special_token_207|>' is not marked as EOG
llm_load_vocab: control token: 128213 '<|reserved_special_token_205|>' is not marked as EOG
llm_load_vocab: control token: 128211 '<|reserved_special_token_203|>' is not marked as EOG
llm_load_vocab: control token: 128210 '<|reserved_special_token_202|>' is not marked as EOG
llm_load_vocab: control token: 128209 '<|reserved_special_token_201|>' is not marked as EOG
llm_load_vocab: control token: 128208 '<|reserved_special_token_200|>' is not marked as EOG
llm_load_vocab: control token: 128207 '<|reserved_special_token_199|>' is not marked as EOG
llm_load_vocab: control token: 128204 '<|reserved_special_token_196|>' is not marked as EOG
llm_load_vocab: control token: 128202 '<|reserved_special_token_194|>' is not marked as EOG
llm_load_vocab: control token: 128197 '<|reserved_special_token_189|>' is not marked as EOG
llm_load_vocab: control token: 128195 '<|reserved_special_token_187|>' is not marked as EOG
llm_load_vocab: control token: 128194 '<|reserved_special_token_186|>' is not marked as EOG
llm_load_vocab: control token: 128191 '<|reserved_special_token_183|>' is not marked as EOG
llm_load_vocab: control token: 128190 '<|reserved_special_token_182|>' is not marked as EOG
llm_load_vocab: control token: 128188 '<|reserved_special_token_180|>' is not marked as EOG
llm_load_vocab: control token: 128187 '<|reserved_special_token_179|>' is not marked as EOG
llm_load_vocab: control token: 128185 '<|reserved_special_token_177|>' is not marked as EOG
llm_load_vocab: control token: 128184 '<|reserved_special_token_176|>' is not marked as EOG
llm_load_vocab: control token: 128183 '<|reserved_special_token_175|>' is not marked as EOG
llm_load_vocab: control token: 128178 '<|reserved_special_token_170|>' is not marked as EOG
llm_load_vocab: control token: 128177 '<|reserved_special_token_169|>' is not marked as EOG
llm_load_vocab: control token: 128176 '<|reserved_special_token_168|>' is not marked as EOG
llm_load_vocab: control token: 128175 '<|reserved_special_token_167|>' is not marked as EOG
llm_load_vocab: control token: 128174 '<|reserved_special_token_166|>' is not marked as EOG
llm_load_vocab: control token: 128173 '<|reserved_special_token_165|>' is not marked as EOG
llm_load_vocab: control token: 128172 '<|reserved_special_token_164|>' is not marked as EOG
llm_load_vocab: control token: 128169 '<|reserved_special_token_161|>' is not marked as EOG
llm_load_vocab: control token: 128167 '<|reserved_special_token_159|>' is not marked as EOG
llm_load_vocab: control token: 128166 '<|reserved_special_token_158|>' is not marked as EOG
llm_load_vocab: control token: 128160 '<|reserved_special_token_152|>' is not marked as EOG
llm_load_vocab: control token: 128159 '<|reserved_special_token_151|>' is not marked as EOG
llm_load_vocab: control token: 128157 '<|reserved_special_token_149|>' is not marked as EOG
llm_load_vocab: control token: 128156 '<|reserved_special_token_148|>' is not marked as EOG
llm_load_vocab: control token: 128154 '<|reserved_special_token_146|>' is not marked as EOG
llm_load_vocab: control token: 128152 '<|reserved_special_token_144|>' is not marked as EOG
llm_load_vocab: control token: 128151 '<|reserved_special_token_143|>' is not marked as EOG
llm_load_vocab: control token: 128150 '<|reserved_special_token_142|>' is not marked as EOG
llm_load_vocab: control token: 128147 '<|reserved_special_token_139|>' is not marked as EOG
llm_load_vocab: control token: 128144 '<|reserved_special_token_136|>' is not marked as EOG
llm_load_vocab: control token: 128142 '<|reserved_special_token_134|>' is not marked as EOG
llm_load_vocab: control token: 128141 '<|reserved_special_token_133|>' is not marked as EOG
llm_load_vocab: control token: 128140 '<|reserved_special_token_132|>' is not marked as EOG
llm_load_vocab: control token: 128133 '<|reserved_special_token_125|>' is not marked as EOG
llm_load_vocab: control token: 128130 '<|reserved_special_token_122|>' is not marked as EOG
llm_load_vocab: control token: 128128 '<|reserved_special_token_120|>' is not marked as EOG
llm_load_vocab: control token: 128127 '<|reserved_special_token_119|>' is not marked as EOG
llm_load_vocab: control token: 128126 '<|reserved_special_token_118|>' is not marked as EOG
llm_load_vocab: control token: 128125 '<|reserved_special_token_117|>' is not marked as EOG
llm_load_vocab: control token: 128124 '<|reserved_special_token_116|>' is not marked as EOG
llm_load_vocab: control token: 128123 '<|reserved_special_token_115|>' is not marked as EOG
llm_load_vocab: control token: 128122 '<|reserved_special_token_114|>' is not marked as EOG
llm_load_vocab: control token: 128121 '<|reserved_special_token_113|>' is not marked as EOG
llm_load_vocab: control token: 128120 '<|reserved_special_token_112|>' is not marked as EOG
llm_load_vocab: control token: 128119 '<|reserved_special_token_111|>' is not marked as EOG
llm_load_vocab: control token: 128116 '<|reserved_special_token_108|>' is not marked as EOG
llm_load_vocab: control token: 128115 '<|reserved_special_token_107|>' is not marked as EOG
llm_load_vocab: control token: 128114 '<|reserved_special_token_106|>' is not marked as EOG
llm_load_vocab: control token: 128113 '<|reserved_special_token_105|>' is not marked as EOG
llm_load_vocab: control token: 128111 '<|reserved_special_token_103|>' is not marked as EOG
llm_load_vocab: control token: 128110 '<|reserved_special_token_102|>' is not marked as EOG
llm_load_vocab: control token: 128107 '<|reserved_special_token_99|>' is not marked as EOG
llm_load_vocab: control token: 128106 '<|reserved_special_token_98|>' is not marked as EOG
llm_load_vocab: control token: 128105 '<|reserved_special_token_97|>' is not marked as EOG
llm_load_vocab: control token: 128104 '<|reserved_special_token_96|>' is not marked as EOG
llm_load_vocab: control token: 128103 '<|reserved_special_token_95|>' is not marked as EOG
llm_load_vocab: control token: 128100 '<|reserved_special_token_92|>' is not marked as EOG
llm_load_vocab: control token: 128097 '<|reserved_special_token_89|>' is not marked as EOG
llm_load_vocab: control token: 128096 '<|reserved_special_token_88|>' is not marked as EOG
llm_load_vocab: control token: 128094 '<|reserved_special_token_86|>' is not marked as EOG
llm_load_vocab: control token: 128093 '<|reserved_special_token_85|>' is not marked as EOG
llm_load_vocab: control token: 128090 '<|reserved_special_token_82|>' is not marked as EOG
llm_load_vocab: control token: 128089 '<|reserved_special_token_81|>' is not marked as EOG
llm_load_vocab: control token: 128087 '<|reserved_special_token_79|>' is not marked as EOG
llm_load_vocab: control token: 128085 '<|reserved_special_token_77|>' is not marked as EOG
llm_load_vocab: control token: 128080 '<|reserved_special_token_72|>' is not marked as EOG
llm_load_vocab: control token: 128077 '<|reserved_special_token_69|>' is not marked as EOG
llm_load_vocab: control token: 128076 '<|reserved_special_token_68|>' is not marked as EOG
llm_load_vocab: control token: 128073 '<|reserved_special_token_65|>' is not marked as EOG
llm_load_vocab: control token: 128070 '<|reserved_special_token_62|>' is not marked as EOG
llm_load_vocab: control token: 128069 '<|reserved_special_token_61|>' is not marked as EOG
llm_load_vocab: control token: 128067 '<|reserved_special_token_59|>' is not marked as EOG
llm_load_vocab: control token: 128064 '<|reserved_special_token_56|>' is not marked as EOG
llm_load_vocab: control token: 128062 '<|reserved_special_token_54|>' is not marked as EOG
llm_load_vocab: control token: 128061 '<|reserved_special_token_53|>' is not marked as EOG
llm_load_vocab: control token: 128060 '<|reserved_special_token_52|>' is not marked as EOG
llm_load_vocab: control token: 128054 '<|reserved_special_token_46|>' is not marked as EOG
llm_load_vocab: control token: 128045 '<|reserved_special_token_37|>' is not marked as EOG
llm_load_vocab: control token: 128044 '<|reserved_special_token_36|>' is not marked as EOG
llm_load_vocab: control token: 128043 '<|reserved_special_token_35|>' is not marked as EOG
llm_load_vocab: control token: 128042 '<|reserved_special_token_34|>' is not marked as EOG
llm_load_vocab: control token: 128038 '<|reserved_special_token_30|>' is not marked as EOG
llm_load_vocab: control token: 128037 '<|reserved_special_token_29|>' is not marked as EOG
llm_load_vocab: control token: 128035 '<|reserved_special_token_27|>' is not marked as EOG
llm_load_vocab: control token: 128034 '<|reserved_special_token_26|>' is not marked as EOG
llm_load_vocab: control token: 128033 '<|reserved_special_token_25|>' is not marked as EOG
llm_load_vocab: control token: 128032 '<|reserved_special_token_24|>' is not marked as EOG
llm_load_vocab: control token: 128030 '<|reserved_special_token_22|>' is not marked as EOG
llm_load_vocab: control token: 128029 '<|reserved_special_token_21|>' is not marked as EOG
llm_load_vocab: control token: 128028 '<|reserved_special_token_20|>' is not marked as EOG
llm_load_vocab: control token: 128026 '<|reserved_special_token_18|>' is not marked as EOG
llm_load_vocab: control token: 128025 '<|reserved_special_token_17|>' is not marked as EOG
llm_load_vocab: control token: 128024 '<|reserved_special_token_16|>' is not marked as EOG
llm_load_vocab: control token: 128022 '<|reserved_special_token_14|>' is not marked as EOG
llm_load_vocab: control token: 128020 '<|reserved_special_token_12|>' is not marked as EOG
llm_load_vocab: control token: 128017 '<|reserved_special_token_9|>' is not marked as EOG
llm_load_vocab: control token: 128016 '<|reserved_special_token_8|>' is not marked as EOG
llm_load_vocab: control token: 128015 '<|reserved_special_token_7|>' is not marked as EOG
llm_load_vocab: control token: 128014 '<|reserved_special_token_6|>' is not marked as EOG
llm_load_vocab: control token: 128013 '<|reserved_special_token_5|>' is not marked as EOG
llm_load_vocab: control token: 128011 '<|reserved_special_token_3|>' is not marked as EOG
llm_load_vocab: control token: 128010 '<|python_tag|>' is not marked as EOG
llm_load_vocab: control token: 128006 '<|start_header_id|>' is not marked as EOG
llm_load_vocab: control token: 128003 '<|reserved_special_token_1|>' is not marked as EOG
llm_load_vocab: control token: 128002 '<|reserved_special_token_0|>' is not marked as EOG
llm_load_vocab: control token: 128000 '<|begin_of_text|>' is not marked as EOG
llm_load_vocab: control token: 128041 '<|reserved_special_token_33|>' is not marked as EOG
llm_load_vocab: control token: 128063 '<|reserved_special_token_55|>' is not marked as EOG
llm_load_vocab: control token: 128046 '<|reserved_special_token_38|>' is not marked as EOG
llm_load_vocab: control token: 128007 '<|end_header_id|>' is not marked as EOG
llm_load_vocab: control token: 128065 '<|reserved_special_token_57|>' is not marked as EOG
llm_load_vocab: control token: 128171 '<|reserved_special_token_163|>' is not marked as EOG
llm_load_vocab: control token: 128162 '<|reserved_special_token_154|>' is not marked as EOG
llm_load_vocab: control token: 128165 '<|reserved_special_token_157|>' is not marked as EOG
llm_load_vocab: control token: 128057 '<|reserved_special_token_49|>' is not marked as EOG
llm_load_vocab: control token: 128050 '<|reserved_special_token_42|>' is not marked as EOG
llm_load_vocab: control token: 128056 '<|reserved_special_token_48|>' is not marked as EOG
llm_load_vocab: control token: 128230 '<|reserved_special_token_222|>' is not marked as EOG
llm_load_vocab: control token: 128098 '<|reserved_special_token_90|>' is not marked as EOG
llm_load_vocab: control token: 128153 '<|reserved_special_token_145|>' is not marked as EOG
llm_load_vocab: control token: 128084 '<|reserved_special_token_76|>' is not marked as EOG
llm_load_vocab: control token: 128082 '<|reserved_special_token_74|>' is not marked as EOG
llm_load_vocab: control token: 128102 '<|reserved_special_token_94|>' is not marked as EOG
llm_load_vocab: control token: 128253 '<|reserved_special_token_245|>' is not marked as EOG
llm_load_vocab: control token: 128179 '<|reserved_special_token_171|>' is not marked as EOG
llm_load_vocab: control token: 128071 '<|reserved_special_token_63|>' is not marked as EOG
llm_load_vocab: control token: 128135 '<|reserved_special_token_127|>' is not marked as EOG
llm_load_vocab: control token: 128161 '<|reserved_special_token_153|>' is not marked as EOG
llm_load_vocab: control token: 128164 '<|reserved_special_token_156|>' is not marked as EOG
llm_load_vocab: control token: 128134 '<|reserved_special_token_126|>' is not marked as EOG
llm_load_vocab: control token: 128249 '<|reserved_special_token_241|>' is not marked as EOG
llm_load_vocab: control token: 128004 '<|finetune_right_pad_id|>' is not marked as EOG
llm_load_vocab: control token: 128036 '<|reserved_special_token_28|>' is not marked as EOG
llm_load_vocab: control token: 128148 '<|reserved_special_token_140|>' is not marked as EOG
llm_load_vocab: control token: 128181 '<|reserved_special_token_173|>' is not marked as EOG
llm_load_vocab: control token: 128222 '<|reserved_special_token_214|>' is not marked as EOG
llm_load_vocab: control token: 128075 '<|reserved_special_token_67|>' is not marked as EOG
llm_load_vocab: control token: 128241 '<|reserved_special_token_233|>' is not marked as EOG
llm_load_vocab: control token: 128051 '<|reserved_special_token_43|>' is not marked as EOG
llm_load_vocab: control token: 128068 '<|reserved_special_token_60|>' is not marked as EOG
llm_load_vocab: control token: 128149 '<|reserved_special_token_141|>' is not marked as EOG
llm_load_vocab: control token: 128201 '<|reserved_special_token_193|>' is not marked as EOG
llm_load_vocab: control token: 128058 '<|reserved_special_token_50|>' is not marked as EOG
llm_load_vocab: control token: 128146 '<|reserved_special_token_138|>' is not marked as EOG
llm_load_vocab: control token: 128143 '<|reserved_special_token_135|>' is not marked as EOG
llm_load_vocab: control token: 128023 '<|reserved_special_token_15|>' is not marked as EOG
llm_load_vocab: control token: 128039 '<|reserved_special_token_31|>' is not marked as EOG
llm_load_vocab: control token: 128132 '<|reserved_special_token_124|>' is not marked as EOG
llm_load_vocab: control token: 128101 '<|reserved_special_token_93|>' is not marked as EOG
llm_load_vocab: control token: 128212 '<|reserved_special_token_204|>' is not marked as EOG
llm_load_vocab: control token: 128189 '<|reserved_special_token_181|>' is not marked as EOG
llm_load_vocab: control token: 128225 '<|reserved_special_token_217|>' is not marked as EOG
llm_load_vocab: control token: 128129 '<|reserved_special_token_121|>' is not marked as EOG
llm_load_vocab: control token: 128005 '<|reserved_special_token_2|>' is not marked as EOG
llm_load_vocab: control token: 128078 '<|reserved_special_token_70|>' is not marked as EOG
llm_load_vocab: control token: 128163 '<|reserved_special_token_155|>' is not marked as EOG
llm_load_vocab: control token: 128072 '<|reserved_special_token_64|>' is not marked as EOG
llm_load_vocab: control token: 128112 '<|reserved_special_token_104|>' is not marked as EOG
llm_load_vocab: control token: 128186 '<|reserved_special_token_178|>' is not marked as EOG
llm_load_vocab: control token: 128095 '<|reserved_special_token_87|>' is not marked as EOG
llm_load_vocab: control token: 128109 '<|reserved_special_token_101|>' is not marked as EOG
llm_load_vocab: control token: 128099 '<|reserved_special_token_91|>' is not marked as EOG
llm_load_vocab: control token: 128138 '<|reserved_special_token_130|>' is not marked as EOG
llm_load_vocab: control token: 128193 '<|reserved_special_token_185|>' is not marked as EOG
llm_load_vocab: control token: 128199 '<|reserved_special_token_191|>' is not marked as EOG
llm_load_vocab: control token: 128048 '<|reserved_special_token_40|>' is not marked as EOG
llm_load_vocab: control token: 128088 '<|reserved_special_token_80|>' is not marked as EOG
llm_load_vocab: control token: 128192 '<|reserved_special_token_184|>' is not marked as EOG
llm_load_vocab: control token: 128136 '<|reserved_special_token_128|>' is not marked as EOG
llm_load_vocab: control token: 128092 '<|reserved_special_token_84|>' is not marked as EOG
llm_load_vocab: control token: 128158 '<|reserved_special_token_150|>' is not marked as EOG
llm_load_vocab: control token: 128001 '<|end_of_text|>' is not marked as EOG
llm_load_vocab: control token: 128049 '<|reserved_special_token_41|>' is not marked as EOG
llm_load_vocab: control token: 128031 '<|reserved_special_token_23|>' is not marked as EOG
llm_load_vocab: control token: 128255 '<|reserved_special_token_247|>' is not marked as EOG
llm_load_vocab: control token: 128182 '<|reserved_special_token_174|>' is not marked as EOG
llm_load_vocab: control token: 128066 '<|reserved_special_token_58|>' is not marked as EOG
llm_load_vocab: control token: 128180 '<|reserved_special_token_172|>' is not marked as EOG
llm_load_vocab: control token: 128233 '<|reserved_special_token_225|>' is not marked as EOG
llm_load_vocab: control token: 128079 '<|reserved_special_token_71|>' is not marked as EOG
llm_load_vocab: control token: 128081 '<|reserved_special_token_73|>' is not marked as EOG
llm_load_vocab: control token: 128231 '<|reserved_special_token_223|>' is not marked as EOG
llm_load_vocab: control token: 128196 '<|reserved_special_token_188|>' is not marked as EOG
llm_load_vocab: control token: 128047 '<|reserved_special_token_39|>' is not marked as EOG
llm_load_vocab: control token: 128083 '<|reserved_special_token_75|>' is not marked as EOG
llm_load_vocab: control token: 128139 '<|reserved_special_token_131|>' is not marked as EOG
llm_load_vocab: control token: 128131 '<|reserved_special_token_123|>' is not marked as EOG
llm_load_vocab: control token: 128118 '<|reserved_special_token_110|>' is not marked as EOG
llm_load_vocab: control token: 128053 '<|reserved_special_token_45|>' is not marked as EOG
llm_load_vocab: control token: 128220 '<|reserved_special_token_212|>' is not marked as EOG
llm_load_vocab: control token: 128108 '<|reserved_special_token_100|>' is not marked as EOG
llm_load_vocab: control token: 128091 '<|reserved_special_token_83|>' is not marked as EOG
llm_load_vocab: control token: 128203 '<|reserved_special_token_195|>' is not marked as EOG
llm_load_vocab: control token: 128059 '<|reserved_special_token_51|>' is not marked as EOG
llm_load_vocab: control token: 128019 '<|reserved_special_token_11|>' is not marked as EOG
llm_load_vocab: control token: 128170 '<|reserved_special_token_162|>' is not marked as EOG
llm_load_vocab: control token: 128205 '<|reserved_special_token_197|>' is not marked as EOG
llm_load_vocab: control token: 128040 '<|reserved_special_token_32|>' is not marked as EOG
llm_load_vocab: control token: 128200 '<|reserved_special_token_192|>' is not marked as EOG
llm_load_vocab: control token: 128236 '<|reserved_special_token_228|>' is not marked as EOG
llm_load_vocab: control token: 128145 '<|reserved_special_token_137|>' is not marked as EOG
llm_load_vocab: control token: 128168 '<|reserved_special_token_160|>' is not marked as EOG
llm_load_vocab: control token: 128214 '<|reserved_special_token_206|>' is not marked as EOG
llm_load_vocab: control token: 128137 '<|reserved_special_token_129|>' is not marked as EOG
llm_load_vocab: control token: 128232 '<|reserved_special_token_224|>' is not marked as EOG
llm_load_vocab: control token: 128239 '<|reserved_special_token_231|>' is not marked as EOG
llm_load_vocab: control token: 128055 '<|reserved_special_token_47|>' is not marked as EOG
llm_load_vocab: control token: 128228 '<|reserved_special_token_220|>' is not marked as EOG
llm_load_vocab: control token: 128206 '<|reserved_special_token_198|>' is not marked as EOG
llm_load_vocab: control token: 128018 '<|reserved_special_token_10|>' is not marked as EOG
llm_load_vocab: control token: 128012 '<|reserved_special_token_4|>' is not marked as EOG
llm_load_vocab: control token: 128198 '<|reserved_special_token_190|>' is not marked as EOG
llm_load_vocab: control token: 128021 '<|reserved_special_token_13|>' is not marked as EOG
llm_load_vocab: control token: 128086 '<|reserved_special_token_78|>' is not marked as EOG
llm_load_vocab: control token: 128074 '<|reserved_special_token_66|>' is not marked as EOG
llm_load_vocab: control token: 128027 '<|reserved_special_token_19|>' is not marked as EOG
llm_load_vocab: control token: 128242 '<|reserved_special_token_234|>' is not marked as EOG
llm_load_vocab: control token: 128155 '<|reserved_special_token_147|>' is not marked as EOG
llm_load_vocab: control token: 128052 '<|reserved_special_token_44|>' is not marked as EOG
llm_load_vocab: control token: 128246 '<|reserved_special_token_238|>' is not marked as EOG
llm_load_vocab: control token: 128117 '<|reserved_special_token_109|>' is not marked as EOG
llm_load_vocab: control token: 128237 '<|reserved_special_token_229|>' is not marked as EOG
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = Q5_K - Small
llm_load_print_meta: model params = 70.55 B
llm_load_print_meta: model size = 45.31 GiB (5.52 BPW)
llm_load_print_meta: general.name = Llama 3.3 70B Instruct Abliterated
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: tensor 'token_embd.weight' (q5_K) (and 92 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead
llm_load_tensors: offloading 71 repeating layers to GPU
llm_load_tensors: offloaded 71/81 layers to GPU
llm_load_tensors: CPU model buffer size = 688.88 MiB
llm_load_tensors: CUDA_Host model buffer size = 5871.55 MiB
llm_load_tensors: CUDA0 model buffer size = 19637.19 MiB
llm_load_tensors: CUDA1 model buffer size = 20198.25 MiB
load_all_data: no device found for buffer type CPU for async uploads
load_all_data: buffer type CUDA_Host is not the default buffer type for device CUDA0 for async uploads
time=2025-01-03T08:50:00.396-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.04"
time=2025-01-03T08:50:00.646-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.07"
time=2025-01-03T08:50:00.897-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.10"
time=2025-01-03T08:50:01.147-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.13"
load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
time=2025-01-03T08:50:01.397-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.16"
time=2025-01-03T08:50:01.648-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.18"
time=2025-01-03T08:50:01.899-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.21"
time=2025-01-03T08:50:02.149-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.23"
time=2025-01-03T08:50:02.399-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.26"
time=2025-01-03T08:50:02.650-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.28"
time=2025-01-03T08:50:02.900-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.31"
time=2025-01-03T08:50:03.150-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.34"
time=2025-01-03T08:50:03.401-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.36"
time=2025-01-03T08:50:03.651-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.39"
time=2025-01-03T08:50:03.901-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.41"
time=2025-01-03T08:50:04.151-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.44"
time=2025-01-03T08:50:04.401-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.46"
time=2025-01-03T08:50:04.652-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.49"
time=2025-01-03T08:50:04.902-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.52"
time=2025-01-03T08:50:05.153-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.55"
load_all_data: using async uploads for device CUDA1, buffer type CUDA1, backend CUDA1
time=2025-01-03T08:50:05.404-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.58"
time=2025-01-03T08:50:05.654-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.61"
time=2025-01-03T08:50:05.904-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.64"
time=2025-01-03T08:50:06.154-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.67"
time=2025-01-03T08:50:06.404-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.70"
time=2025-01-03T08:50:06.655-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.74"
time=2025-01-03T08:50:06.905-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.76"
time=2025-01-03T08:50:07.156-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.79"
time=2025-01-03T08:50:07.406-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.82"
time=2025-01-03T08:50:07.657-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.85"
time=2025-01-03T08:50:07.907-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.88"
time=2025-01-03T08:50:08.157-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.90"
time=2025-01-03T08:50:08.407-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.93"
time=2025-01-03T08:50:08.658-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.95"
time=2025-01-03T08:50:08.908-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.98"
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 72.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 280.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 288.00 MiB
llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.52 MiB
time=2025-01-03T08:50:09.159-05:00 level=DEBUG source=server.go:600 msg="model load progress 1.00"
llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 324.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB
llama_new_context_with_model: graph nodes = 2566
llama_new_context_with_model: graph splits = 105 (with bs=512), 4 (with bs=1)
time=2025-01-03T08:50:09.409-05:00 level=INFO source=server.go:594 msg="llama runner started in 11.02 seconds"
time=2025-01-03T08:50:09.409-05:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba
[GIN] 2025/01/03 - 08:50:09 | 200 | 11.1587535s | 127.0.0.1 | POST "/api/generate"
time=2025-01-03T08:50:09.409-05:00 level=DEBUG source=sched.go:466 msg="context for request finished"
time=2025-01-03T08:50:09.409-05:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba duration=1h0m0s
time=2025-01-03T08:50:09.409-05:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba refCount=0
time=2025-01-03T08:50:51.802-05:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba
time=2025-01-03T08:50:51.802-05:00 level=DEBUG source=sched.go:283 msg="resetting model to expire immediately to make room" modelPath=E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba refCount=0
time=2025-01-03T08:50:51.802-05:00 level=DEBUG source=sched.go:296 msg="waiting for pending requests to complete and unload to occur" modelPath=E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba
time=2025-01-03T08:50:51.802-05:00 level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba
time=2025-01-03T08:50:51.802-05:00 level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba
time=2025-01-03T08:50:51.802-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="154.9 GiB" before.free_swap="159.4 GiB" now.total="191.7 GiB" now.free="147.9 GiB" now.free_swap="110.7 GiB"
time=2025-01-03T08:50:51.817-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="21.8 GiB" now.total="24.0 GiB" now.free="974.7 MiB" now.used="23.0 GiB"
time=2025-01-03T08:50:51.833-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="1.8 GiB" now.used="21.1 GiB"
releasing nvml library
time=2025-01-03T08:50:51.834-05:00 level=DEBUG source=server.go:1080 msg="stopping llama server"
time=2025-01-03T08:50:51.834-05:00 level=DEBUG source=server.go:1086 msg="waiting for llama server to exit"
time=2025-01-03T08:50:52.085-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="147.9 GiB" before.free_swap="110.7 GiB" now.total="191.7 GiB" now.free="148.0 GiB" now.free_swap="148.7 GiB"
time=2025-01-03T08:50:52.222-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="974.7 MiB" now.total="24.0 GiB" now.free="21.8 GiB" now.used="2.1 GiB"
time=2025-01-03T08:50:52.237-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="1.8 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="441.1 MiB"
releasing nvml library
time=2025-01-03T08:50:52.239-05:00 level=DEBUG source=sched.go:659 msg="gpu VRAM free memory converged after 0.44 seconds" model=E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba
time=2025-01-03T08:50:52.490-05:00 level=DEBUG source=server.go:1090 msg="llama server stopped"
time=2025-01-03T08:50:52.490-05:00 level=DEBUG source=sched.go:380 msg="runner released" modelPath=E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba
time=2025-01-03T08:50:52.490-05:00 level=DEBUG source=sched.go:384 msg="sending an unloaded event" modelPath=E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba
time=2025-01-03T08:50:52.490-05:00 level=DEBUG source=sched.go:302 msg="unload completed" modelPath=E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba
time=2025-01-03T08:50:52.490-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="148.0 GiB" before.free_swap="148.7 GiB" now.total="191.7 GiB" now.free="154.9 GiB" now.free_swap="159.4 GiB"
time=2025-01-03T08:50:52.505-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="21.8 GiB" now.total="24.0 GiB" now.free="21.8 GiB" now.used="2.1 GiB"
time=2025-01-03T08:50:52.521-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="441.1 MiB"
releasing nvml library
time=2025-01-03T08:50:52.541-05:00 level=DEBUG source=sched.go:224 msg="loading first model" model=E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba
time=2025-01-03T08:50:52.541-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]"
time=2025-01-03T08:50:52.541-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[21.8 GiB]"
time=2025-01-03T08:50:52.542-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]"
time=2025-01-03T08:50:52.542-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[21.8 GiB]"
time=2025-01-03T08:50:52.543-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 21.8 GiB]"
time=2025-01-03T08:50:52.543-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 21.8 GiB]"
time=2025-01-03T08:50:52.543-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="154.9 GiB" before.free_swap="159.4 GiB" now.total="191.7 GiB" now.free="154.9 GiB" now.free_swap="159.4 GiB"
time=2025-01-03T08:50:52.568-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="21.8 GiB" now.total="24.0 GiB" now.free="21.8 GiB" now.used="2.1 GiB"
time=2025-01-03T08:50:52.584-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="441.1 MiB"
releasing nvml library
time=2025-01-03T08:50:52.585-05:00 level=INFO source=server.go:104 msg="system memory" total="191.7 GiB" free="154.9 GiB" free_swap="159.4 GiB"
time=2025-01-03T08:50:52.585-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[21.8 GiB 22.5 GiB]"
time=2025-01-03T08:50:52.585-05:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=61 layers.split=30,31 memory.available="[21.8 GiB 22.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="56.1 GiB" memory.required.partial="43.7 GiB" memory.required.kv="5.0 GiB" memory.required.allocations="[21.5 GiB 22.1 GiB]" memory.weights.total="48.8 GiB" memory.weights.repeating="48.0 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="2.2 GiB" memory.graph.partial="2.2 GiB"
time=2025-01-03T08:50:52.585-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx\ollama_llama_server.exe"
time=2025-01-03T08:50:52.585-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe"
time=2025-01-03T08:50:52.585-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v11_avx\ollama_llama_server.exe"
time=2025-01-03T08:50:52.585-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe"
time=2025-01-03T08:50:52.585-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm_avx\ollama_llama_server.exe"
time=2025-01-03T08:50:52.586-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx\ollama_llama_server.exe"
time=2025-01-03T08:50:52.586-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe"
time=2025-01-03T08:50:52.586-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v11_avx\ollama_llama_server.exe"
time=2025-01-03T08:50:52.586-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe"
time=2025-01-03T08:50:52.586-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm_avx\ollama_llama_server.exe"
time=2025-01-03T08:50:52.588-05:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe runner --model E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba --ctx-size 16384 --batch-size 512 --n-gpu-layers 61 --verbose --threads 8 --no-mmap --parallel 1 --tensor-split 30,31 --port 51104"
time=2025-01-03T08:50:52.588-05:00 level=DEBUG source=server.go:393 msg=subprocess environment="[CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6 CUDA_PATH_V12_1=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1 CUDA_PATH_V12_3=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3 CUDA_PATH_V12_6=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6 CUDA_VISIBLE_DEVICES=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23,GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 PATH=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama;C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\libnvvp;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\libnvvp;C:\Program Files\Common Files\Oracle\Java\javapath;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Program Files\dotnet\;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\Microsoft SQL Server\150\Tools\Binn\;C:\Program Files\Microsoft SQL Server\Client SDK\ODBC\170\Tools\Binn\;C:\Program Files (x86)\Incredibuild;C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\;C:\Program Files\nodejs\;C:\Program Files\NVIDIA Corporation\NVIDIA app\NvDLISR;C:\Program Files\Git\cmd;C:\Program Files\PuTTY\;C:\Program Files\Docker\Docker\resources\bin;C:\Program Files\Go\bin;C:\Program Files\NVIDIA Corporation\Nsight Compute 2024.3.2\;C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.42.34433\bin\Hostx64\x64;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64_win\compiler;C:\Users\User\AppData\Local\Microsoft\WindowsApps;C:\Users\User\.dotnet\tools;C:\Users\User\miniconda3;C:\Users\User\miniconda3\Library\mingw-w64\bin;C:\Users\User\miniconda3\Library\usr\bin;C:\Users\User\miniconda3\Library\bin;C:\Users\User\miniconda3\Scripts;C:\Users\User\AppData\Roaming\npm;C:\Program Files\7-Zip;C:\ffmpeg\bin;;C:\Users\User\AppData\Local\Programs\Ollama;C:\Users\User\.cache\lm-studio\bin;C:\Users\User\go\bin;C:\Users\User\.dotnet\tools]"
time=2025-01-03T08:50:52.605-05:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-03T08:50:52.605-05:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-03T08:50:52.605-05:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-03T08:50:52.689-05:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
time=2025-01-03T08:50:52.791-05:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=8
time=2025-01-03T08:50:52.792-05:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:51104"
time=2025-01-03T08:50:52.856-05:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
llama_load_model_from_file: using device CUDA1 (NVIDIA GeForce RTX 4090) - 22994 MiB free
llama_model_loader: loaded meta data with 40 key-value pairs and 724 tensors from E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.3 70B Instruct Abliterated
llama_model_loader: - kv 3: general.finetune str = Instruct-abliterated
llama_model_loader: - kv 4: general.basename str = Llama-3.3
llama_model_loader: - kv 5: general.size_label str = 70B
llama_model_loader: - kv 6: general.license str = llama3.3
llama_model_loader: - kv 7: general.base_model.count u32 = 1
llama_model_loader: - kv 8: general.base_model.0.name str = Llama 3.3 70B Instruct
llama_model_loader: - kv 9: general.base_model.0.organization str = Meta Llama
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv 11: general.tags arr[str,7] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 12: general.languages arr[str,8] = ["en", "fr", "it", "pt", "hi", "es", ...
llama_model_loader: - kv 13: llama.block_count u32 = 80
llama_model_loader: - kv 14: llama.context_length u32 = 131072
llama_model_loader: - kv 15: llama.embedding_length u32 = 8192
llama_model_loader: - kv 16: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 17: llama.attention.head_count u32 = 64
llama_model_loader: - kv 18: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 19: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 20: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 21: llama.attention.key_length u32 = 128
llama_model_loader: - kv 22: llama.attention.value_length u32 = 128
llama_model_loader: - kv 23: general.file_type u32 = 16
llama_model_loader: - kv 24: llama.vocab_size u32 = 128256
llama_model_loader: - kv 25: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 27: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 32: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 34: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 35: general.quantization_version u32 = 2
llama_model_loader: - kv 36: quantize.imatrix.file str = /models_out/Llama-3.3-70B-Instruct-ab...
llama_model_loader: - kv 37: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt
llama_model_loader: - kv 38: quantize.imatrix.entries_count i32 = 560
llama_model_loader: - kv 39: quantize.imatrix.chunks_count i32 = 125
llama_model_loader: - type f32: 162 tensors
llama_model_loader: - type q5_K: 561 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: control token: 128254 '<|reserved_special_token_246|>' is not marked as EOG
llm_load_vocab: control token: 128252 '<|reserved_special_token_244|>' is not marked as EOG
llm_load_vocab: control token: 128251 '<|reserved_special_token_243|>' is not marked as EOG
llm_load_vocab: control token: 128250 '<|reserved_special_token_242|>' is not marked as EOG
llm_load_vocab: control token: 128248 '<|reserved_special_token_240|>' is not marked as EOG
llm_load_vocab: control token: 128247 '<|reserved_special_token_239|>' is not marked as EOG
llm_load_vocab: control token: 128245 '<|reserved_special_token_237|>' is not marked as EOG
llm_load_vocab: control token: 128244 '<|reserved_special_token_236|>' is not marked as EOG
llm_load_vocab: control token: 128243 '<|reserved_special_token_235|>' is not marked as EOG
llm_load_vocab: control token: 128240 '<|reserved_special_token_232|>' is not marked as EOG
llm_load_vocab: control token: 128238 '<|reserved_special_token_230|>' is not marked as EOG
llm_load_vocab: control token: 128235 '<|reserved_special_token_227|>' is not marked as EOG
llm_load_vocab: control token: 128234 '<|reserved_special_token_226|>' is not marked as EOG
llm_load_vocab: control token: 128229 '<|reserved_special_token_221|>' is not marked as EOG
llm_load_vocab: control token: 128227 '<|reserved_special_token_219|>' is not marked as EOG
llm_load_vocab: control token: 128226 '<|reserved_special_token_218|>' is not marked as EOG
llm_load_vocab: control token: 128224 '<|reserved_special_token_216|>' is not marked as EOG
llm_load_vocab: control token: 128223 '<|reserved_special_token_215|>' is not marked as EOG
llm_load_vocab: control token: 128221 '<|reserved_special_token_213|>' is not marked as EOG
llm_load_vocab: control token: 128219 '<|reserved_special_token_211|>' is not marked as EOG
llm_load_vocab: control token: 128218 '<|reserved_special_token_210|>' is not marked as EOG
llm_load_vocab: control token: 128217 '<|reserved_special_token_209|>' is not marked as EOG
llm_load_vocab: control token: 128216 '<|reserved_special_token_208|>' is not marked as EOG
llm_load_vocab: control token: 128215 '<|reserved_special_token_207|>' is not marked as EOG
llm_load_vocab: control token: 128213 '<|reserved_special_token_205|>' is not marked as EOG
llm_load_vocab: control token: 128211 '<|reserved_special_token_203|>' is not marked as EOG
llm_load_vocab: control token: 128210 '<|reserved_special_token_202|>' is not marked as EOG
llm_load_vocab: control token: 128209 '<|reserved_special_token_201|>' is not marked as EOG
llm_load_vocab: control token: 128208 '<|reserved_special_token_200|>' is not marked as EOG
llm_load_vocab: control token: 128207 '<|reserved_special_token_199|>' is not marked as EOG
llm_load_vocab: control token: 128204 '<|reserved_special_token_196|>' is not marked as EOG
llm_load_vocab: control token: 128202 '<|reserved_special_token_194|>' is not marked as EOG
llm_load_vocab: control token: 128197 '<|reserved_special_token_189|>' is not marked as EOG
llm_load_vocab: control token: 128195 '<|reserved_special_token_187|>' is not marked as EOG
llm_load_vocab: control token: 128194 '<|reserved_special_token_186|>' is not marked as EOG
llm_load_vocab: control token: 128191 '<|reserved_special_token_183|>' is not marked as EOG
llm_load_vocab: control token: 128190 '<|reserved_special_token_182|>' is not marked as EOG
llm_load_vocab: control token: 128188 '<|reserved_special_token_180|>' is not marked as EOG
llm_load_vocab: control token: 128187 '<|reserved_special_token_179|>' is not marked as EOG
llm_load_vocab: control token: 128185 '<|reserved_special_token_177|>' is not marked as EOG
llm_load_vocab: control token: 128184 '<|reserved_special_token_176|>' is not marked as EOG
llm_load_vocab: control token: 128183 '<|reserved_special_token_175|>' is not marked as EOG
llm_load_vocab: control token: 128178 '<|reserved_special_token_170|>' is not marked as EOG
llm_load_vocab: control token: 128177 '<|reserved_special_token_169|>' is not marked as EOG
llm_load_vocab: control token: 128176 '<|reserved_special_token_168|>' is not marked as EOG
llm_load_vocab: control token: 128175 '<|reserved_special_token_167|>' is not marked as EOG
llm_load_vocab: control token: 128174 '<|reserved_special_token_166|>' is not marked as EOG
llm_load_vocab: control token: 128173 '<|reserved_special_token_165|>' is not marked as EOG
llm_load_vocab: control token: 128172 '<|reserved_special_token_164|>' is not marked as EOG
llm_load_vocab: control token: 128169 '<|reserved_special_token_161|>' is not marked as EOG
llm_load_vocab: control token: 128167 '<|reserved_special_token_159|>' is not marked as EOG
llm_load_vocab: control token: 128166 '<|reserved_special_token_158|>' is not marked as EOG
llm_load_vocab: control token: 128160 '<|reserved_special_token_152|>' is not marked as EOG
llm_load_vocab: control token: 128159 '<|reserved_special_token_151|>' is not marked as EOG
llm_load_vocab: control token: 128157 '<|reserved_special_token_149|>' is not marked as EOG
llm_load_vocab: control token: 128156 '<|reserved_special_token_148|>' is not marked as EOG
llm_load_vocab: control token: 128154 '<|reserved_special_token_146|>' is not marked as EOG
llm_load_vocab: control token: 128152 '<|reserved_special_token_144|>' is not marked as EOG
llm_load_vocab: control token: 128151 '<|reserved_special_token_143|>' is not marked as EOG
llm_load_vocab: control token: 128150 '<|reserved_special_token_142|>' is not marked as EOG
llm_load_vocab: control token: 128147 '<|reserved_special_token_139|>' is not marked as EOG
llm_load_vocab: control token: 128144 '<|reserved_special_token_136|>' is not marked as EOG
llm_load_vocab: control token: 128142 '<|reserved_special_token_134|>' is not marked as EOG
llm_load_vocab: control token: 128141 '<|reserved_special_token_133|>' is not marked as EOG
llm_load_vocab: control token: 128140 '<|reserved_special_token_132|>' is not marked as EOG
llm_load_vocab: control token: 128133 '<|reserved_special_token_125|>' is not marked as EOG
llm_load_vocab: control token: 128130 '<|reserved_special_token_122|>' is not marked as EOG
llm_load_vocab: control token: 128128 '<|reserved_special_token_120|>' is not marked as EOG
llm_load_vocab: control token: 128127 '<|reserved_special_token_119|>' is not marked as EOG
llm_load_vocab: control token: 128126 '<|reserved_special_token_118|>' is not marked as EOG
llm_load_vocab: control token: 128125 '<|reserved_special_token_117|>' is not marked as EOG
llm_load_vocab: control token: 128124 '<|reserved_special_token_116|>' is not marked as EOG
llm_load_vocab: control token: 128123 '<|reserved_special_token_115|>' is not marked as EOG
llm_load_vocab: control token: 128122 '<|reserved_special_token_114|>' is not marked as EOG
llm_load_vocab: control token: 128121 '<|reserved_special_token_113|>' is not marked as EOG
llm_load_vocab: control token: 128120 '<|reserved_special_token_112|>' is not marked as EOG
llm_load_vocab: control token: 128119 '<|reserved_special_token_111|>' is not marked as EOG
llm_load_vocab: control token: 128116 '<|reserved_special_token_108|>' is not marked as EOG
llm_load_vocab: control token: 128115 '<|reserved_special_token_107|>' is not marked as EOG
llm_load_vocab: control token: 128114 '<|reserved_special_token_106|>' is not marked as EOG
llm_load_vocab: control token: 128113 '<|reserved_special_token_105|>' is not marked as EOG
llm_load_vocab: control token: 128111 '<|reserved_special_token_103|>' is not marked as EOG
llm_load_vocab: control token: 128110 '<|reserved_special_token_102|>' is not marked as EOG
llm_load_vocab: control token: 128107 '<|reserved_special_token_99|>' is not marked as EOG
llm_load_vocab: control token: 128106 '<|reserved_special_token_98|>' is not marked as EOG
llm_load_vocab: control token: 128105 '<|reserved_special_token_97|>' is not marked as EOG
llm_load_vocab: control token: 128104 '<|reserved_special_token_96|>' is not marked as EOG
llm_load_vocab: control token: 128103 '<|reserved_special_token_95|>' is not marked as EOG
llm_load_vocab: control token: 128100 '<|reserved_special_token_92|>' is not marked as EOG
llm_load_vocab: control token: 128097 '<|reserved_special_token_89|>' is not marked as EOG
llm_load_vocab: control token: 128096 '<|reserved_special_token_88|>' is not marked as EOG
llm_load_vocab: control token: 128094 '<|reserved_special_token_86|>' is not marked as EOG
llm_load_vocab: control token: 128093 '<|reserved_special_token_85|>' is not marked as EOG
llm_load_vocab: control token: 128090 '<|reserved_special_token_82|>' is not marked as EOG
llm_load_vocab: control token: 128089 '<|reserved_special_token_81|>' is not marked as EOG
llm_load_vocab: control token: 128087 '<|reserved_special_token_79|>' is not marked as EOG
llm_load_vocab: control token: 128085 '<|reserved_special_token_77|>' is not marked as EOG
llm_load_vocab: control token: 128080 '<|reserved_special_token_72|>' is not marked as EOG
llm_load_vocab: control token: 128077 '<|reserved_special_token_69|>' is not marked as EOG
llm_load_vocab: control token: 128076 '<|reserved_special_token_68|>' is not marked as EOG
llm_load_vocab: control token: 128073 '<|reserved_special_token_65|>' is not marked as EOG
llm_load_vocab: control token: 128070 '<|reserved_special_token_62|>' is not marked as EOG
llm_load_vocab: control token: 128069 '<|reserved_special_token_61|>' is not marked as EOG
llm_load_vocab: control token: 128067 '<|reserved_special_token_59|>' is not marked as EOG
llm_load_vocab: control token: 128064 '<|reserved_special_token_56|>' is not marked as EOG
llm_load_vocab: control token: 128062 '<|reserved_special_token_54|>' is not marked as EOG
llm_load_vocab: control token: 128061 '<|reserved_special_token_53|>' is not marked as EOG
llm_load_vocab: control token: 128060 '<|reserved_special_token_52|>' is not marked as EOG
llm_load_vocab: control token: 128054 '<|reserved_special_token_46|>' is not marked as EOG
llm_load_vocab: control token: 128045 '<|reserved_special_token_37|>' is not marked as EOG
llm_load_vocab: control token: 128044 '<|reserved_special_token_36|>' is not marked as EOG
llm_load_vocab: control token: 128043 '<|reserved_special_token_35|>' is not marked as EOG
llm_load_vocab: control token: 128042 '<|reserved_special_token_34|>' is not marked as EOG
llm_load_vocab: control token: 128038 '<|reserved_special_token_30|>' is not marked as EOG
llm_load_vocab: control token: 128037 '<|reserved_special_token_29|>' is not marked as EOG
llm_load_vocab: control token: 128035 '<|reserved_special_token_27|>' is not marked as EOG
llm_load_vocab: control token: 128034 '<|reserved_special_token_26|>' is not marked as EOG
llm_load_vocab: control token: 128033 '<|reserved_special_token_25|>' is not marked as EOG
llm_load_vocab: control token: 128032 '<|reserved_special_token_24|>' is not marked as EOG
llm_load_vocab: control token: 128030 '<|reserved_special_token_22|>' is not marked as EOG
llm_load_vocab: control token: 128029 '<|reserved_special_token_21|>' is not marked as EOG
llm_load_vocab: control token: 128028 '<|reserved_special_token_20|>' is not marked as EOG
llm_load_vocab: control token: 128026 '<|reserved_special_token_18|>' is not marked as EOG
llm_load_vocab: control token: 128025 '<|reserved_special_token_17|>' is not marked as EOG
llm_load_vocab: control token: 128024 '<|reserved_special_token_16|>' is not marked as EOG
llm_load_vocab: control token: 128022 '<|reserved_special_token_14|>' is not marked as EOG
llm_load_vocab: control token: 128020 '<|reserved_special_token_12|>' is not marked as EOG
llm_load_vocab: control token: 128017 '<|reserved_special_token_9|>' is not marked as EOG
llm_load_vocab: control token: 128016 '<|reserved_special_token_8|>' is not marked as EOG
llm_load_vocab: control token: 128015 '<|reserved_special_token_7|>' is not marked as EOG
llm_load_vocab: control token: 128014 '<|reserved_special_token_6|>' is not marked as EOG
llm_load_vocab: control token: 128013 '<|reserved_special_token_5|>' is not marked as EOG
llm_load_vocab: control token: 128011 '<|reserved_special_token_3|>' is not marked as EOG
llm_load_vocab: control token: 128010 '<|python_tag|>' is not marked as EOG
llm_load_vocab: control token: 128006 '<|start_header_id|>' is not marked as EOG
llm_load_vocab: control token: 128003 '<|reserved_special_token_1|>' is not marked as EOG
llm_load_vocab: control token: 128002 '<|reserved_special_token_0|>' is not marked as EOG
llm_load_vocab: control token: 128000 '<|begin_of_text|>' is not marked as EOG
llm_load_vocab: control token: 128041 '<|reserved_special_token_33|>' is not marked as EOG
llm_load_vocab: control token: 128063 '<|reserved_special_token_55|>' is not marked as EOG
llm_load_vocab: control token: 128046 '<|reserved_special_token_38|>' is not marked as EOG
llm_load_vocab: control token: 128007 '<|end_header_id|>' is not marked as EOG
llm_load_vocab: control token: 128065 '<|reserved_special_token_57|>' is not marked as EOG
llm_load_vocab: control token: 128171 '<|reserved_special_token_163|>' is not marked as EOG
llm_load_vocab: control token: 128162 '<|reserved_special_token_154|>' is not marked as EOG
llm_load_vocab: control token: 128165 '<|reserved_special_token_157|>' is not marked as EOG
llm_load_vocab: control token: 128057 '<|reserved_special_token_49|>' is not marked as EOG
llm_load_vocab: control token: 128050 '<|reserved_special_token_42|>' is not marked as EOG
llm_load_vocab: control token: 128056 '<|reserved_special_token_48|>' is not marked as EOG
llm_load_vocab: control token: 128230 '<|reserved_special_token_222|>' is not marked as EOG
llm_load_vocab: control token: 128098 '<|reserved_special_token_90|>' is not marked as EOG
llm_load_vocab: control token: 128153 '<|reserved_special_token_145|>' is not marked as EOG
llm_load_vocab: control token: 128084 '<|reserved_special_token_76|>' is not marked as EOG
llm_load_vocab: control token: 128082 '<|reserved_special_token_74|>' is not marked as EOG
llm_load_vocab: control token: 128102 '<|reserved_special_token_94|>' is not marked as EOG
llm_load_vocab: control token: 128253 '<|reserved_special_token_245|>' is not marked as EOG
llm_load_vocab: control token: 128179 '<|reserved_special_token_171|>' is not marked as EOG
llm_load_vocab: control token: 128071 '<|reserved_special_token_63|>' is not marked as EOG
llm_load_vocab: control token: 128135 '<|reserved_special_token_127|>' is not marked as EOG
llm_load_vocab: control token: 128161 '<|reserved_special_token_153|>' is not marked as EOG
llm_load_vocab: control token: 128164 '<|reserved_special_token_156|>' is not marked as EOG
llm_load_vocab: control token: 128134 '<|reserved_special_token_126|>' is not marked as EOG
llm_load_vocab: control token: 128249 '<|reserved_special_token_241|>' is not marked as EOG
llm_load_vocab: control token: 128004 '<|finetune_right_pad_id|>' is not marked as EOG
llm_load_vocab: control token: 128036 '<|reserved_special_token_28|>' is not marked as EOG
llm_load_vocab: control token: 128148 '<|reserved_special_token_140|>' is not marked as EOG
llm_load_vocab: control token: 128181 '<|reserved_special_token_173|>' is not marked as EOG
llm_load_vocab: control token: 128222 '<|reserved_special_token_214|>' is not marked as EOG
llm_load_vocab: control token: 128075 '<|reserved_special_token_67|>' is not marked as EOG
llm_load_vocab: control token: 128241 '<|reserved_special_token_233|>' is not marked as EOG
llm_load_vocab: control token: 128051 '<|reserved_special_token_43|>' is not marked as EOG
llm_load_vocab: control token: 128068 '<|reserved_special_token_60|>' is not marked as EOG
llm_load_vocab: control token: 128149 '<|reserved_special_token_141|>' is not marked as EOG
llm_load_vocab: control token: 128201 '<|reserved_special_token_193|>' is not marked as EOG
llm_load_vocab: control token: 128058 '<|reserved_special_token_50|>' is not marked as EOG
llm_load_vocab: control token: 128146 '<|reserved_special_token_138|>' is not marked as EOG
llm_load_vocab: control token: 128143 '<|reserved_special_token_135|>' is not marked as EOG
llm_load_vocab: control token: 128023 '<|reserved_special_token_15|>' is not marked as EOG
llm_load_vocab: control token: 128039 '<|reserved_special_token_31|>' is not marked as EOG
llm_load_vocab: control token: 128132 '<|reserved_special_token_124|>' is not marked as EOG
llm_load_vocab: control token: 128101 '<|reserved_special_token_93|>' is not marked as EOG
llm_load_vocab: control token: 128212 '<|reserved_special_token_204|>' is not marked as EOG
llm_load_vocab: control token: 128189 '<|reserved_special_token_181|>' is not marked as EOG
llm_load_vocab: control token: 128225 '<|reserved_special_token_217|>' is not marked as EOG
llm_load_vocab: control token: 128129 '<|reserved_special_token_121|>' is not marked as EOG
llm_load_vocab: control token: 128005 '<|reserved_special_token_2|>' is not marked as EOG
llm_load_vocab: control token: 128078 '<|reserved_special_token_70|>' is not marked as EOG
llm_load_vocab: control token: 128163 '<|reserved_special_token_155|>' is not marked as EOG
llm_load_vocab: control token: 128072 '<|reserved_special_token_64|>' is not marked as EOG
llm_load_vocab: control token: 128112 '<|reserved_special_token_104|>' is not marked as EOG
llm_load_vocab: control token: 128186 '<|reserved_special_token_178|>' is not marked as EOG
llm_load_vocab: control token: 128095 '<|reserved_special_token_87|>' is not marked as EOG
llm_load_vocab: control token: 128109 '<|reserved_special_token_101|>' is not marked as EOG
llm_load_vocab: control token: 128099 '<|reserved_special_token_91|>' is not marked as EOG
llm_load_vocab: control token: 128138 '<|reserved_special_token_130|>' is not marked as EOG
llm_load_vocab: control token: 128193 '<|reserved_special_token_185|>' is not marked as EOG
llm_load_vocab: control token: 128199 '<|reserved_special_token_191|>' is not marked as EOG
llm_load_vocab: control token: 128048 '<|reserved_special_token_40|>' is not marked as EOG
llm_load_vocab: control token: 128088 '<|reserved_special_token_80|>' is not marked as EOG
llm_load_vocab: control token: 128192 '<|reserved_special_token_184|>' is not marked as EOG
llm_load_vocab: control token: 128136 '<|reserved_special_token_128|>' is not marked as EOG
llm_load_vocab: control token: 128092 '<|reserved_special_token_84|>' is not marked as EOG
llm_load_vocab: control token: 128158 '<|reserved_special_token_150|>' is not marked as EOG
llm_load_vocab: control token: 128001 '<|end_of_text|>' is not marked as EOG
llm_load_vocab: control token: 128049 '<|reserved_special_token_41|>' is not marked as EOG
llm_load_vocab: control token: 128031 '<|reserved_special_token_23|>' is not marked as EOG
llm_load_vocab: control token: 128255 '<|reserved_special_token_247|>' is not marked as EOG
llm_load_vocab: control token: 128182 '<|reserved_special_token_174|>' is not marked as EOG
llm_load_vocab: control token: 128066 '<|reserved_special_token_58|>' is not marked as EOG
llm_load_vocab: control token: 128180 '<|reserved_special_token_172|>' is not marked as EOG
llm_load_vocab: control token: 128233 '<|reserved_special_token_225|>' is not marked as EOG
llm_load_vocab: control token: 128079 '<|reserved_special_token_71|>' is not marked as EOG
llm_load_vocab: control token: 128081 '<|reserved_special_token_73|>' is not marked as EOG
llm_load_vocab: control token: 128231 '<|reserved_special_token_223|>' is not marked as EOG
llm_load_vocab: control token: 128196 '<|reserved_special_token_188|>' is not marked as EOG
llm_load_vocab: control token: 128047 '<|reserved_special_token_39|>' is not marked as EOG
llm_load_vocab: control token: 128083 '<|reserved_special_token_75|>' is not marked as EOG
llm_load_vocab: control token: 128139 '<|reserved_special_token_131|>' is not marked as EOG
llm_load_vocab: control token: 128131 '<|reserved_special_token_123|>' is not marked as EOG
llm_load_vocab: control token: 128118 '<|reserved_special_token_110|>' is not marked as EOG
llm_load_vocab: control token: 128053 '<|reserved_special_token_45|>' is not marked as EOG
llm_load_vocab: control token: 128220 '<|reserved_special_token_212|>' is not marked as EOG
llm_load_vocab: control token: 128108 '<|reserved_special_token_100|>' is not marked as EOG
llm_load_vocab: control token: 128091 '<|reserved_special_token_83|>' is not marked as EOG
llm_load_vocab: control token: 128203 '<|reserved_special_token_195|>' is not marked as EOG
llm_load_vocab: control token: 128059 '<|reserved_special_token_51|>' is not marked as EOG
llm_load_vocab: control token: 128019 '<|reserved_special_token_11|>' is not marked as EOG
llm_load_vocab: control token: 128170 '<|reserved_special_token_162|>' is not marked as EOG
llm_load_vocab: control token: 128205 '<|reserved_special_token_197|>' is not marked as EOG
llm_load_vocab: control token: 128040 '<|reserved_special_token_32|>' is not marked as EOG
llm_load_vocab: control token: 128200 '<|reserved_special_token_192|>' is not marked as EOG
llm_load_vocab: control token: 128236 '<|reserved_special_token_228|>' is not marked as EOG
llm_load_vocab: control token: 128145 '<|reserved_special_token_137|>' is not marked as EOG
llm_load_vocab: control token: 128168 '<|reserved_special_token_160|>' is not marked as EOG
llm_load_vocab: control token: 128214 '<|reserved_special_token_206|>' is not marked as EOG
llm_load_vocab: control token: 128137 '<|reserved_special_token_129|>' is not marked as EOG
llm_load_vocab: control token: 128232 '<|reserved_special_token_224|>' is not marked as EOG
llm_load_vocab: control token: 128239 '<|reserved_special_token_231|>' is not marked as EOG
llm_load_vocab: control token: 128055 '<|reserved_special_token_47|>' is not marked as EOG
llm_load_vocab: control token: 128228 '<|reserved_special_token_220|>' is not marked as EOG
llm_load_vocab: control token: 128206 '<|reserved_special_token_198|>' is not marked as EOG
llm_load_vocab: control token: 128018 '<|reserved_special_token_10|>' is not marked as EOG
llm_load_vocab: control token: 128012 '<|reserved_special_token_4|>' is not marked as EOG
llm_load_vocab: control token: 128198 '<|reserved_special_token_190|>' is not marked as EOG
llm_load_vocab: control token: 128021 '<|reserved_special_token_13|>' is not marked as EOG
llm_load_vocab: control token: 128086 '<|reserved_special_token_78|>' is not marked as EOG
llm_load_vocab: control token: 128074 '<|reserved_special_token_66|>' is not marked as EOG
llm_load_vocab: control token: 128027 '<|reserved_special_token_19|>' is not marked as EOG
llm_load_vocab: control token: 128242 '<|reserved_special_token_234|>' is not marked as EOG
llm_load_vocab: control token: 128155 '<|reserved_special_token_147|>' is not marked as EOG
llm_load_vocab: control token: 128052 '<|reserved_special_token_44|>' is not marked as EOG
llm_load_vocab: control token: 128246 '<|reserved_special_token_238|>' is not marked as EOG
llm_load_vocab: control token: 128117 '<|reserved_special_token_109|>' is not marked as EOG
llm_load_vocab: control token: 128237 '<|reserved_special_token_229|>' is not marked as EOG
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = Q5_K - Small
llm_load_print_meta: model params = 70.55 B
llm_load_print_meta: model size = 45.31 GiB (5.52 BPW)
llm_load_print_meta: general.name = Llama 3.3 70B Instruct Abliterated
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: tensor 'token_embd.weight' (q5_K) (and 192 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead
llm_load_tensors: offloading 61 repeating layers to GPU
llm_load_tensors: offloaded 61/81 layers to GPU
llm_load_tensors: CPU model buffer size = 688.88 MiB
llm_load_tensors: CUDA_Host model buffer size = 11482.17 MiB
llm_load_tensors: CUDA0 model buffer size = 16831.88 MiB
llm_load_tensors: CUDA1 model buffer size = 17392.94 MiB
load_all_data: no device found for buffer type CPU for async uploads
load_all_data: buffer type CUDA_Host is not the default buffer type for device CUDA0 for async uploads
time=2025-01-03T08:50:55.861-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.01"
time=2025-01-03T08:50:56.111-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.05"
time=2025-01-03T08:50:56.362-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.08"
time=2025-01-03T08:50:56.612-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.11"
time=2025-01-03T08:50:56.862-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.14"
time=2025-01-03T08:50:57.113-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.17"
time=2025-01-03T08:50:57.363-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.20"
time=2025-01-03T08:50:57.614-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.23"
load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
time=2025-01-03T08:50:57.864-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.26"
time=2025-01-03T08:50:58.114-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.29"
time=2025-01-03T08:50:58.365-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.32"
time=2025-01-03T08:50:58.616-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.34"
time=2025-01-03T08:50:58.866-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.37"
time=2025-01-03T08:50:59.116-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.40"
time=2025-01-03T08:50:59.367-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.42"
time=2025-01-03T08:50:59.617-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.45"
time=2025-01-03T08:50:59.867-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.47"
time=2025-01-03T08:51:00.118-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.50"
time=2025-01-03T08:51:00.368-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.53"
time=2025-01-03T08:51:00.618-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.57"
time=2025-01-03T08:51:00.869-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.59"
load_all_data: using async uploads for device CUDA1, buffer type CUDA1, backend CUDA1
time=2025-01-03T08:51:01.119-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.63"
time=2025-01-03T08:51:01.369-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.63"
time=2025-01-03T08:51:01.620-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.67"
time=2025-01-03T08:51:01.870-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.69"
time=2025-01-03T08:51:02.120-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.72"
time=2025-01-03T08:51:02.371-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.75"
time=2025-01-03T08:51:02.621-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.78"
time=2025-01-03T08:51:02.872-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.81"
time=2025-01-03T08:51:03.122-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.84"
time=2025-01-03T08:51:03.373-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.87"
time=2025-01-03T08:51:03.623-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.89"
time=2025-01-03T08:51:03.873-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.92"
time=2025-01-03T08:51:04.123-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.94"
time=2025-01-03T08:51:04.374-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.97"
time=2025-01-03T08:51:04.624-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.99"
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 16384
llama_new_context_with_model: n_ctx_per_seq = 16384
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
time=2025-01-03T08:51:04.875-05:00 level=DEBUG source=server.go:600 msg="model load progress 1.00"
llama_kv_cache_init: CPU KV buffer size = 1216.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 1920.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 1984.00 MiB
llama_new_context_with_model: KV self size = 5120.00 MiB, K (f16): 2560.00 MiB, V (f16): 2560.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.52 MiB
time=2025-01-03T08:51:05.126-05:00 level=DEBUG source=server.go:603 msg="model load completed, waiting for server to become available" status="llm server loading model"
llama_new_context_with_model: CUDA0 compute buffer size = 2224.00 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 2144.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 48.01 MiB
llama_new_context_with_model: graph nodes = 2566
llama_new_context_with_model: graph splits = 215 (with bs=512), 4 (with bs=1)
time=2025-01-03T08:51:05.376-05:00 level=INFO source=server.go:594 msg="llama runner started in 12.77 seconds"
time=2025-01-03T08:51:05.376-05:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba
time=2025-01-03T08:51:05.377-05:00 level=DEBUG source=routes.go:1542 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\nWrite the game of Tetris in Python<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
time=2025-01-03T08:51:05.378-05:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=18 used=0 remaining=18
time=2025-01-03T08:51:24.053-05:00 level=DEBUG source=sched.go:466 msg="context for request finished"
time=2025-01-03T08:51:24.053-05:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba duration=1h0m0s
time=2025-01-03T08:51:24.053-05:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba refCount=0
[GIN] 2025/01/03 - 08:51:24 | 200 | 32.2631045s | 127.0.0.1 | POST "/api/chat"

<!-- gh-comment-id:2569252809 --> @robbyjo commented on GitHub (Jan 3, 2025): I turned off OLLAMA_FLASH_ATTENTION. Its output is garbled. Here is a sample of it: > 1&'D!84E,C#++$$>.!1"$;2E$'>?6E=H<49>AGE>64?% Here is the debug output of server.log: >2025/01/03 08:49:45 routes.go:1259: INFO server config env="map[CUDA_VISIBLE_DEVICES:0,1 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:E:\\DeepLearning\\LLM OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2025-01-03T08:49:45.460-05:00 level=INFO source=images.go:757 msg="total blobs: 79" time=2025-01-03T08:49:45.462-05:00 level=INFO source=images.go:764 msg="total unused blobs removed: 0" time=2025-01-03T08:49:45.463-05:00 level=INFO source=routes.go:1310 msg="Listening on 127.0.0.1:11434 (version 0.5.4)" time=2025-01-03T08:49:45.463-05:00 level=DEBUG source=common.go:80 msg="runners located" dir="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners" time=2025-01-03T08:49:45.463-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx\\ollama_llama_server.exe" time=2025-01-03T08:49:45.463-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe" time=2025-01-03T08:49:45.463-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v11_avx\\ollama_llama_server.exe" time=2025-01-03T08:49:45.463-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx\\ollama_llama_server.exe" time=2025-01-03T08:49:45.463-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\rocm_avx\\ollama_llama_server.exe" time=2025-01-03T08:49:45.464-05:00 level=INFO source=routes.go:1339 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11_avx cuda_v12_avx rocm_avx]" time=2025-01-03T08:49:45.464-05:00 level=DEBUG source=routes.go:1340 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY" time=2025-01-03T08:49:45.464-05:00 level=DEBUG source=sched.go:105 msg="starting llm scheduler" time=2025-01-03T08:49:45.464-05:00 level=INFO source=gpu.go:226 msg="looking for compatible GPUs" time=2025-01-03T08:49:45.464-05:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2025-01-03T08:49:45.464-05:00 level=INFO source=gpu_windows.go:183 msg="efficiency cores detected" maxEfficiencyClass=1 time=2025-01-03T08:49:45.464-05:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=24 efficiency=16 threads=32 time=2025-01-03T08:49:45.464-05:00 level=DEBUG source=gpu.go:99 msg="searching for GPU discovery libraries for NVIDIA" time=2025-01-03T08:49:45.464-05:00 level=DEBUG source=gpu.go:517 msg="Searching for GPU library" name=nvml.dll time=2025-01-03T08:49:45.464-05:00 level=DEBUG source=gpu.go:543 msg="gpu library search" globs="[C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvml.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\bin\\nvml.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\libnvvp\\nvml.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\bin\\nvml.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\libnvvp\\nvml.dll C:\\Program Files\\Common Files\\Oracle\\Java\\javapath\\nvml.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\bin\\nvml.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\libnvvp\\nvml.dll C:\\Windows\\system32\\nvml.dll C:\\Windows\\nvml.dll C:\\Windows\\System32\\Wbem\\nvml.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvml.dll C:\\Windows\\System32\\OpenSSH\\nvml.dll C:\\Program Files\\dotnet\\nvml.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\nvml.dll C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\nvml.dll C:\\Program Files (x86)\\Incredibuild\\nvml.dll C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\nvml.dll C:\\Program Files\\nodejs\\nvml.dll C:\\Program Files\\NVIDIA Corporation\\NVIDIA app\\NvDLISR\\nvml.dll C:\\Program Files\\Git\\cmd\\nvml.dll C:\\Program Files\\PuTTY\\nvml.dll C:\\Program Files\\Docker\\Docker\\resources\\bin\\nvml.dll C:\\Program Files\\Go\\bin\\nvml.dll C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2024.3.2\\nvml.dll C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Tools\\MSVC\\14.42.34433\\bin\\Hostx64\\x64\\nvml.dll C:\\Program Files (x86)\\Common Files\\Intel\\Shared Libraries\\redist\\intel64_win\\compiler\\nvml.dll C:\\Users\\User\\AppData\\Local\\Microsoft\\WindowsApps\\nvml.dll C:\\Users\\User\\.dotnet\\tools\\nvml.dll C:\\Users\\User\\miniconda3\\nvml.dll C:\\Users\\User\\miniconda3\\Library\\mingw-w64\\bin\\nvml.dll C:\\Users\\User\\miniconda3\\Library\\usr\\bin\\nvml.dll C:\\Users\\User\\miniconda3\\Library\\bin\\nvml.dll C:\\Users\\User\\miniconda3\\Scripts\\nvml.dll C:\\Users\\User\\AppData\\Roaming\\npm\\nvml.dll C:\\Program Files\\7-Zip\\nvml.dll C:\\ffmpeg\\bin\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\nvml.dll C:\\Users\\User\\.cache\\lm-studio\\bin\\nvml.dll C:\\Users\\User\\go\\bin\\nvml.dll C:\\Users\\User\\.dotnet\\tools\\nvml.dll c:\\Windows\\System32\\nvml.dll]" time=2025-01-03T08:49:45.464-05:00 level=DEBUG source=gpu.go:548 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll" time=2025-01-03T08:49:45.465-05:00 level=DEBUG source=gpu.go:577 msg="discovered GPU libraries" paths="[C:\\Windows\\system32\\nvml.dll c:\\Windows\\System32\\nvml.dll]" time=2025-01-03T08:49:45.479-05:00 level=DEBUG source=gpu.go:120 msg="nvidia-ml loaded" library=C:\Windows\system32\nvml.dll time=2025-01-03T08:49:45.479-05:00 level=DEBUG source=gpu.go:517 msg="Searching for GPU library" name=nvcuda.dll time=2025-01-03T08:49:45.479-05:00 level=DEBUG source=gpu.go:543 msg="gpu library search" globs="[C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvcuda.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\bin\\nvcuda.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\libnvvp\\nvcuda.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\bin\\nvcuda.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\libnvvp\\nvcuda.dll C:\\Program Files\\Common Files\\Oracle\\Java\\javapath\\nvcuda.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\bin\\nvcuda.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\libnvvp\\nvcuda.dll C:\\Windows\\system32\\nvcuda.dll C:\\Windows\\nvcuda.dll C:\\Windows\\System32\\Wbem\\nvcuda.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvcuda.dll C:\\Windows\\System32\\OpenSSH\\nvcuda.dll C:\\Program Files\\dotnet\\nvcuda.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\nvcuda.dll C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\nvcuda.dll C:\\Program Files (x86)\\Incredibuild\\nvcuda.dll C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\nvcuda.dll C:\\Program Files\\nodejs\\nvcuda.dll C:\\Program Files\\NVIDIA Corporation\\NVIDIA app\\NvDLISR\\nvcuda.dll C:\\Program Files\\Git\\cmd\\nvcuda.dll C:\\Program Files\\PuTTY\\nvcuda.dll C:\\Program Files\\Docker\\Docker\\resources\\bin\\nvcuda.dll C:\\Program Files\\Go\\bin\\nvcuda.dll C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2024.3.2\\nvcuda.dll C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Tools\\MSVC\\14.42.34433\\bin\\Hostx64\\x64\\nvcuda.dll C:\\Program Files (x86)\\Common Files\\Intel\\Shared Libraries\\redist\\intel64_win\\compiler\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Microsoft\\WindowsApps\\nvcuda.dll C:\\Users\\User\\.dotnet\\tools\\nvcuda.dll C:\\Users\\User\\miniconda3\\nvcuda.dll C:\\Users\\User\\miniconda3\\Library\\mingw-w64\\bin\\nvcuda.dll C:\\Users\\User\\miniconda3\\Library\\usr\\bin\\nvcuda.dll C:\\Users\\User\\miniconda3\\Library\\bin\\nvcuda.dll C:\\Users\\User\\miniconda3\\Scripts\\nvcuda.dll C:\\Users\\User\\AppData\\Roaming\\npm\\nvcuda.dll C:\\Program Files\\7-Zip\\nvcuda.dll C:\\ffmpeg\\bin\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\nvcuda.dll C:\\Users\\User\\.cache\\lm-studio\\bin\\nvcuda.dll C:\\Users\\User\\go\\bin\\nvcuda.dll C:\\Users\\User\\.dotnet\\tools\\nvcuda.dll c:\\windows\\system*\\nvcuda.dll]" time=2025-01-03T08:49:45.479-05:00 level=DEBUG source=gpu.go:548 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll" time=2025-01-03T08:49:45.480-05:00 level=DEBUG source=gpu.go:577 msg="discovered GPU libraries" paths=[C:\Windows\system32\nvcuda.dll] initializing C:\Windows\system32\nvcuda.dll dlsym: cuInit - 00007FFD00D34D20 dlsym: cuDriverGetVersion - 00007FFD00D34DC0 dlsym: cuDeviceGetCount - 00007FFD00D355B6 dlsym: cuDeviceGet - 00007FFD00D355B0 dlsym: cuDeviceGetAttribute - 00007FFD00D34F10 dlsym: cuDeviceGetUuid - 00007FFD00D355C2 dlsym: cuDeviceGetName - 00007FFD00D355BC dlsym: cuCtxCreate_v3 - 00007FFD00D35634 dlsym: cuMemGetInfo_v2 - 00007FFD00D35736 dlsym: cuCtxDestroy - 00007FFD00D35646 calling cuInit calling cuDriverGetVersion raw version 0x2f26 CUDA driver version: 12.7 calling cuDeviceGetCount device count 2 time=2025-01-03T08:49:45.500-05:00 level=DEBUG source=gpu.go:134 msg="detected GPUs" count=2 library=C:\Windows\system32\nvcuda.dll [GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23] CUDA totalMem 24563 mb [GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23] CUDA freeMem 22994 mb [GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23] Compute Capability 8.9 [GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8] CUDA totalMem 24563 mb [GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8] CUDA freeMem 22994 mb [GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8] Compute Capability 8.9 time=2025-01-03T08:49:45.718-05:00 level=INFO source=gpu.go:334 msg="detected OS VRAM overhead" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" time=2025-01-03T08:49:45.719-05:00 level=DEBUG source=amd_windows.go:35 msg="unable to load amdhip64_6.dll, please make sure to upgrade to the latest amd driver: The specified module could not be found." releasing cuda driver library releasing nvml library time=2025-01-03T08:49:45.720-05:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB" time=2025-01-03T08:49:45.720-05:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB" [GIN] 2025/01/03 - 08:49:58 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/01/03 - 08:49:58 | 200 | 22.0758ms | 127.0.0.1 | POST "/api/show" time=2025-01-03T08:49:58.273-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="154.9 GiB" before.free_swap="159.5 GiB" now.total="191.7 GiB" now.free="154.9 GiB" now.free_swap="159.5 GiB" time=2025-01-03T08:49:58.289-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="21.8 GiB" now.used="2.1 GiB" time=2025-01-03T08:49:58.305-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="441.1 MiB" releasing nvml library time=2025-01-03T08:49:58.308-05:00 level=DEBUG source=sched.go:181 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=0x7ff73eed4620 gpu_count=2 time=2025-01-03T08:49:58.343-05:00 level=DEBUG source=sched.go:224 msg="loading first model" model=E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba time=2025-01-03T08:49:58.343-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]" time=2025-01-03T08:49:58.344-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[21.8 GiB]" time=2025-01-03T08:49:58.344-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]" time=2025-01-03T08:49:58.345-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[21.8 GiB]" time=2025-01-03T08:49:58.346-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 21.8 GiB]" time=2025-01-03T08:49:58.347-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 21.8 GiB]" time=2025-01-03T08:49:58.348-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="154.9 GiB" before.free_swap="159.5 GiB" now.total="191.7 GiB" now.free="154.9 GiB" now.free_swap="159.4 GiB" time=2025-01-03T08:49:58.367-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="21.8 GiB" now.total="24.0 GiB" now.free="21.8 GiB" now.used="2.1 GiB" time=2025-01-03T08:49:58.382-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="441.1 MiB" releasing nvml library time=2025-01-03T08:49:58.384-05:00 level=INFO source=server.go:104 msg="system memory" total="191.7 GiB" free="154.9 GiB" free_swap="159.4 GiB" time=2025-01-03T08:49:58.384-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[21.8 GiB 22.5 GiB]" time=2025-01-03T08:49:58.384-05:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=71 layers.split=35,36 memory.available="[21.8 GiB 22.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="49.4 GiB" memory.required.partial="43.6 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[21.5 GiB 22.1 GiB]" memory.weights.total="44.5 GiB" memory.weights.repeating="43.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB" time=2025-01-03T08:49:58.384-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx\\ollama_llama_server.exe" time=2025-01-03T08:49:58.384-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe" time=2025-01-03T08:49:58.384-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v11_avx\\ollama_llama_server.exe" time=2025-01-03T08:49:58.384-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx\\ollama_llama_server.exe" time=2025-01-03T08:49:58.384-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\rocm_avx\\ollama_llama_server.exe" time=2025-01-03T08:49:58.384-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx\\ollama_llama_server.exe" time=2025-01-03T08:49:58.384-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe" time=2025-01-03T08:49:58.384-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v11_avx\\ollama_llama_server.exe" time=2025-01-03T08:49:58.384-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx\\ollama_llama_server.exe" time=2025-01-03T08:49:58.384-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\rocm_avx\\ollama_llama_server.exe" time=2025-01-03T08:49:58.389-05:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx\\ollama_llama_server.exe runner --model E:\\DeepLearning\\LLM\\blobs\\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba --ctx-size 2048 --batch-size 512 --n-gpu-layers 71 --verbose --threads 8 --no-mmap --parallel 1 --tensor-split 35,36 --port 50946" time=2025-01-03T08:49:58.389-05:00 level=DEBUG source=server.go:393 msg=subprocess environment="[CUDA_PATH=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6 CUDA_PATH_V12_1=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1 CUDA_PATH_V12_3=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3 CUDA_PATH_V12_6=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6 CUDA_VISIBLE_DEVICES=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23,GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 PATH=C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\libnvvp;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\libnvvp;C:\\Program Files\\Common Files\\Oracle\\Java\\javapath;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\libnvvp;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\;C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\;C:\\Program Files (x86)\\Incredibuild;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Program Files\\nodejs\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA app\\NvDLISR;C:\\Program Files\\Git\\cmd;C:\\Program Files\\PuTTY\\;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files\\Go\\bin;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2024.3.2\\;C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Tools\\MSVC\\14.42.34433\\bin\\Hostx64\\x64;C:\\Program Files (x86)\\Common Files\\Intel\\Shared Libraries\\redist\\intel64_win\\compiler;C:\\Users\\User\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\User\\.dotnet\\tools;C:\\Users\\User\\miniconda3;C:\\Users\\User\\miniconda3\\Library\\mingw-w64\\bin;C:\\Users\\User\\miniconda3\\Library\\usr\\bin;C:\\Users\\User\\miniconda3\\Library\\bin;C:\\Users\\User\\miniconda3\\Scripts;C:\\Users\\User\\AppData\\Roaming\\npm;C:\\Program Files\\7-Zip;C:\\ffmpeg\\bin;;C:\\Users\\User\\AppData\\Local\\Programs\\Ollama;C:\\Users\\User\\.cache\\lm-studio\\bin;C:\\Users\\User\\go\\bin;C:\\Users\\User\\.dotnet\\tools]" time=2025-01-03T08:49:58.392-05:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2025-01-03T08:49:58.392-05:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding" time=2025-01-03T08:49:58.393-05:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" time=2025-01-03T08:49:58.466-05:00 level=INFO source=runner.go:945 msg="starting go runner" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes time=2025-01-03T08:49:58.559-05:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=8 time=2025-01-03T08:49:58.560-05:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:50946" time=2025-01-03T08:49:58.644-05:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free llama_load_model_from_file: using device CUDA1 (NVIDIA GeForce RTX 4090) - 22994 MiB free llama_model_loader: loaded meta data with 40 key-value pairs and 724 tensors from E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.3 70B Instruct Abliterated llama_model_loader: - kv 3: general.finetune str = Instruct-abliterated llama_model_loader: - kv 4: general.basename str = Llama-3.3 llama_model_loader: - kv 5: general.size_label str = 70B llama_model_loader: - kv 6: general.license str = llama3.3 llama_model_loader: - kv 7: general.base_model.count u32 = 1 llama_model_loader: - kv 8: general.base_model.0.name str = Llama 3.3 70B Instruct llama_model_loader: - kv 9: general.base_model.0.organization str = Meta Llama llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla... llama_model_loader: - kv 11: general.tags arr[str,7] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 12: general.languages arr[str,8] = ["en", "fr", "it", "pt", "hi", "es", ... llama_model_loader: - kv 13: llama.block_count u32 = 80 llama_model_loader: - kv 14: llama.context_length u32 = 131072 llama_model_loader: - kv 15: llama.embedding_length u32 = 8192 llama_model_loader: - kv 16: llama.feed_forward_length u32 = 28672 llama_model_loader: - kv 17: llama.attention.head_count u32 = 64 llama_model_loader: - kv 18: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 19: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 20: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 21: llama.attention.key_length u32 = 128 llama_model_loader: - kv 22: llama.attention.value_length u32 = 128 llama_model_loader: - kv 23: general.file_type u32 = 16 llama_model_loader: - kv 24: llama.vocab_size u32 = 128256 llama_model_loader: - kv 25: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 27: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 32: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 128004 llama_model_loader: - kv 34: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 35: general.quantization_version u32 = 2 llama_model_loader: - kv 36: quantize.imatrix.file str = /models_out/Llama-3.3-70B-Instruct-ab... llama_model_loader: - kv 37: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt llama_model_loader: - kv 38: quantize.imatrix.entries_count i32 = 560 llama_model_loader: - kv 39: quantize.imatrix.chunks_count i32 = 125 llama_model_loader: - type f32: 162 tensors llama_model_loader: - type q5_K: 561 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: control token: 128254 '<|reserved_special_token_246|>' is not marked as EOG llm_load_vocab: control token: 128252 '<|reserved_special_token_244|>' is not marked as EOG llm_load_vocab: control token: 128251 '<|reserved_special_token_243|>' is not marked as EOG llm_load_vocab: control token: 128250 '<|reserved_special_token_242|>' is not marked as EOG llm_load_vocab: control token: 128248 '<|reserved_special_token_240|>' is not marked as EOG llm_load_vocab: control token: 128247 '<|reserved_special_token_239|>' is not marked as EOG llm_load_vocab: control token: 128245 '<|reserved_special_token_237|>' is not marked as EOG llm_load_vocab: control token: 128244 '<|reserved_special_token_236|>' is not marked as EOG llm_load_vocab: control token: 128243 '<|reserved_special_token_235|>' is not marked as EOG llm_load_vocab: control token: 128240 '<|reserved_special_token_232|>' is not marked as EOG llm_load_vocab: control token: 128238 '<|reserved_special_token_230|>' is not marked as EOG llm_load_vocab: control token: 128235 '<|reserved_special_token_227|>' is not marked as EOG llm_load_vocab: control token: 128234 '<|reserved_special_token_226|>' is not marked as EOG llm_load_vocab: control token: 128229 '<|reserved_special_token_221|>' is not marked as EOG llm_load_vocab: control token: 128227 '<|reserved_special_token_219|>' is not marked as EOG llm_load_vocab: control token: 128226 '<|reserved_special_token_218|>' is not marked as EOG llm_load_vocab: control token: 128224 '<|reserved_special_token_216|>' is not marked as EOG llm_load_vocab: control token: 128223 '<|reserved_special_token_215|>' is not marked as EOG llm_load_vocab: control token: 128221 '<|reserved_special_token_213|>' is not marked as EOG llm_load_vocab: control token: 128219 '<|reserved_special_token_211|>' is not marked as EOG llm_load_vocab: control token: 128218 '<|reserved_special_token_210|>' is not marked as EOG llm_load_vocab: control token: 128217 '<|reserved_special_token_209|>' is not marked as EOG llm_load_vocab: control token: 128216 '<|reserved_special_token_208|>' is not marked as EOG llm_load_vocab: control token: 128215 '<|reserved_special_token_207|>' is not marked as EOG llm_load_vocab: control token: 128213 '<|reserved_special_token_205|>' is not marked as EOG llm_load_vocab: control token: 128211 '<|reserved_special_token_203|>' is not marked as EOG llm_load_vocab: control token: 128210 '<|reserved_special_token_202|>' is not marked as EOG llm_load_vocab: control token: 128209 '<|reserved_special_token_201|>' is not marked as EOG llm_load_vocab: control token: 128208 '<|reserved_special_token_200|>' is not marked as EOG llm_load_vocab: control token: 128207 '<|reserved_special_token_199|>' is not marked as EOG llm_load_vocab: control token: 128204 '<|reserved_special_token_196|>' is not marked as EOG llm_load_vocab: control token: 128202 '<|reserved_special_token_194|>' is not marked as EOG llm_load_vocab: control token: 128197 '<|reserved_special_token_189|>' is not marked as EOG llm_load_vocab: control token: 128195 '<|reserved_special_token_187|>' is not marked as EOG llm_load_vocab: control token: 128194 '<|reserved_special_token_186|>' is not marked as EOG llm_load_vocab: control token: 128191 '<|reserved_special_token_183|>' is not marked as EOG llm_load_vocab: control token: 128190 '<|reserved_special_token_182|>' is not marked as EOG llm_load_vocab: control token: 128188 '<|reserved_special_token_180|>' is not marked as EOG llm_load_vocab: control token: 128187 '<|reserved_special_token_179|>' is not marked as EOG llm_load_vocab: control token: 128185 '<|reserved_special_token_177|>' is not marked as EOG llm_load_vocab: control token: 128184 '<|reserved_special_token_176|>' is not marked as EOG llm_load_vocab: control token: 128183 '<|reserved_special_token_175|>' is not marked as EOG llm_load_vocab: control token: 128178 '<|reserved_special_token_170|>' is not marked as EOG llm_load_vocab: control token: 128177 '<|reserved_special_token_169|>' is not marked as EOG llm_load_vocab: control token: 128176 '<|reserved_special_token_168|>' is not marked as EOG llm_load_vocab: control token: 128175 '<|reserved_special_token_167|>' is not marked as EOG llm_load_vocab: control token: 128174 '<|reserved_special_token_166|>' is not marked as EOG llm_load_vocab: control token: 128173 '<|reserved_special_token_165|>' is not marked as EOG llm_load_vocab: control token: 128172 '<|reserved_special_token_164|>' is not marked as EOG llm_load_vocab: control token: 128169 '<|reserved_special_token_161|>' is not marked as EOG llm_load_vocab: control token: 128167 '<|reserved_special_token_159|>' is not marked as EOG llm_load_vocab: control token: 128166 '<|reserved_special_token_158|>' is not marked as EOG llm_load_vocab: control token: 128160 '<|reserved_special_token_152|>' is not marked as EOG llm_load_vocab: control token: 128159 '<|reserved_special_token_151|>' is not marked as EOG llm_load_vocab: control token: 128157 '<|reserved_special_token_149|>' is not marked as EOG llm_load_vocab: control token: 128156 '<|reserved_special_token_148|>' is not marked as EOG llm_load_vocab: control token: 128154 '<|reserved_special_token_146|>' is not marked as EOG llm_load_vocab: control token: 128152 '<|reserved_special_token_144|>' is not marked as EOG llm_load_vocab: control token: 128151 '<|reserved_special_token_143|>' is not marked as EOG llm_load_vocab: control token: 128150 '<|reserved_special_token_142|>' is not marked as EOG llm_load_vocab: control token: 128147 '<|reserved_special_token_139|>' is not marked as EOG llm_load_vocab: control token: 128144 '<|reserved_special_token_136|>' is not marked as EOG llm_load_vocab: control token: 128142 '<|reserved_special_token_134|>' is not marked as EOG llm_load_vocab: control token: 128141 '<|reserved_special_token_133|>' is not marked as EOG llm_load_vocab: control token: 128140 '<|reserved_special_token_132|>' is not marked as EOG llm_load_vocab: control token: 128133 '<|reserved_special_token_125|>' is not marked as EOG llm_load_vocab: control token: 128130 '<|reserved_special_token_122|>' is not marked as EOG llm_load_vocab: control token: 128128 '<|reserved_special_token_120|>' is not marked as EOG llm_load_vocab: control token: 128127 '<|reserved_special_token_119|>' is not marked as EOG llm_load_vocab: control token: 128126 '<|reserved_special_token_118|>' is not marked as EOG llm_load_vocab: control token: 128125 '<|reserved_special_token_117|>' is not marked as EOG llm_load_vocab: control token: 128124 '<|reserved_special_token_116|>' is not marked as EOG llm_load_vocab: control token: 128123 '<|reserved_special_token_115|>' is not marked as EOG llm_load_vocab: control token: 128122 '<|reserved_special_token_114|>' is not marked as EOG llm_load_vocab: control token: 128121 '<|reserved_special_token_113|>' is not marked as EOG llm_load_vocab: control token: 128120 '<|reserved_special_token_112|>' is not marked as EOG llm_load_vocab: control token: 128119 '<|reserved_special_token_111|>' is not marked as EOG llm_load_vocab: control token: 128116 '<|reserved_special_token_108|>' is not marked as EOG llm_load_vocab: control token: 128115 '<|reserved_special_token_107|>' is not marked as EOG llm_load_vocab: control token: 128114 '<|reserved_special_token_106|>' is not marked as EOG llm_load_vocab: control token: 128113 '<|reserved_special_token_105|>' is not marked as EOG llm_load_vocab: control token: 128111 '<|reserved_special_token_103|>' is not marked as EOG llm_load_vocab: control token: 128110 '<|reserved_special_token_102|>' is not marked as EOG llm_load_vocab: control token: 128107 '<|reserved_special_token_99|>' is not marked as EOG llm_load_vocab: control token: 128106 '<|reserved_special_token_98|>' is not marked as EOG llm_load_vocab: control token: 128105 '<|reserved_special_token_97|>' is not marked as EOG llm_load_vocab: control token: 128104 '<|reserved_special_token_96|>' is not marked as EOG llm_load_vocab: control token: 128103 '<|reserved_special_token_95|>' is not marked as EOG llm_load_vocab: control token: 128100 '<|reserved_special_token_92|>' is not marked as EOG llm_load_vocab: control token: 128097 '<|reserved_special_token_89|>' is not marked as EOG llm_load_vocab: control token: 128096 '<|reserved_special_token_88|>' is not marked as EOG llm_load_vocab: control token: 128094 '<|reserved_special_token_86|>' is not marked as EOG llm_load_vocab: control token: 128093 '<|reserved_special_token_85|>' is not marked as EOG llm_load_vocab: control token: 128090 '<|reserved_special_token_82|>' is not marked as EOG llm_load_vocab: control token: 128089 '<|reserved_special_token_81|>' is not marked as EOG llm_load_vocab: control token: 128087 '<|reserved_special_token_79|>' is not marked as EOG llm_load_vocab: control token: 128085 '<|reserved_special_token_77|>' is not marked as EOG llm_load_vocab: control token: 128080 '<|reserved_special_token_72|>' is not marked as EOG llm_load_vocab: control token: 128077 '<|reserved_special_token_69|>' is not marked as EOG llm_load_vocab: control token: 128076 '<|reserved_special_token_68|>' is not marked as EOG llm_load_vocab: control token: 128073 '<|reserved_special_token_65|>' is not marked as EOG llm_load_vocab: control token: 128070 '<|reserved_special_token_62|>' is not marked as EOG llm_load_vocab: control token: 128069 '<|reserved_special_token_61|>' is not marked as EOG llm_load_vocab: control token: 128067 '<|reserved_special_token_59|>' is not marked as EOG llm_load_vocab: control token: 128064 '<|reserved_special_token_56|>' is not marked as EOG llm_load_vocab: control token: 128062 '<|reserved_special_token_54|>' is not marked as EOG llm_load_vocab: control token: 128061 '<|reserved_special_token_53|>' is not marked as EOG llm_load_vocab: control token: 128060 '<|reserved_special_token_52|>' is not marked as EOG llm_load_vocab: control token: 128054 '<|reserved_special_token_46|>' is not marked as EOG llm_load_vocab: control token: 128045 '<|reserved_special_token_37|>' is not marked as EOG llm_load_vocab: control token: 128044 '<|reserved_special_token_36|>' is not marked as EOG llm_load_vocab: control token: 128043 '<|reserved_special_token_35|>' is not marked as EOG llm_load_vocab: control token: 128042 '<|reserved_special_token_34|>' is not marked as EOG llm_load_vocab: control token: 128038 '<|reserved_special_token_30|>' is not marked as EOG llm_load_vocab: control token: 128037 '<|reserved_special_token_29|>' is not marked as EOG llm_load_vocab: control token: 128035 '<|reserved_special_token_27|>' is not marked as EOG llm_load_vocab: control token: 128034 '<|reserved_special_token_26|>' is not marked as EOG llm_load_vocab: control token: 128033 '<|reserved_special_token_25|>' is not marked as EOG llm_load_vocab: control token: 128032 '<|reserved_special_token_24|>' is not marked as EOG llm_load_vocab: control token: 128030 '<|reserved_special_token_22|>' is not marked as EOG llm_load_vocab: control token: 128029 '<|reserved_special_token_21|>' is not marked as EOG llm_load_vocab: control token: 128028 '<|reserved_special_token_20|>' is not marked as EOG llm_load_vocab: control token: 128026 '<|reserved_special_token_18|>' is not marked as EOG llm_load_vocab: control token: 128025 '<|reserved_special_token_17|>' is not marked as EOG llm_load_vocab: control token: 128024 '<|reserved_special_token_16|>' is not marked as EOG llm_load_vocab: control token: 128022 '<|reserved_special_token_14|>' is not marked as EOG llm_load_vocab: control token: 128020 '<|reserved_special_token_12|>' is not marked as EOG llm_load_vocab: control token: 128017 '<|reserved_special_token_9|>' is not marked as EOG llm_load_vocab: control token: 128016 '<|reserved_special_token_8|>' is not marked as EOG llm_load_vocab: control token: 128015 '<|reserved_special_token_7|>' is not marked as EOG llm_load_vocab: control token: 128014 '<|reserved_special_token_6|>' is not marked as EOG llm_load_vocab: control token: 128013 '<|reserved_special_token_5|>' is not marked as EOG llm_load_vocab: control token: 128011 '<|reserved_special_token_3|>' is not marked as EOG llm_load_vocab: control token: 128010 '<|python_tag|>' is not marked as EOG llm_load_vocab: control token: 128006 '<|start_header_id|>' is not marked as EOG llm_load_vocab: control token: 128003 '<|reserved_special_token_1|>' is not marked as EOG llm_load_vocab: control token: 128002 '<|reserved_special_token_0|>' is not marked as EOG llm_load_vocab: control token: 128000 '<|begin_of_text|>' is not marked as EOG llm_load_vocab: control token: 128041 '<|reserved_special_token_33|>' is not marked as EOG llm_load_vocab: control token: 128063 '<|reserved_special_token_55|>' is not marked as EOG llm_load_vocab: control token: 128046 '<|reserved_special_token_38|>' is not marked as EOG llm_load_vocab: control token: 128007 '<|end_header_id|>' is not marked as EOG llm_load_vocab: control token: 128065 '<|reserved_special_token_57|>' is not marked as EOG llm_load_vocab: control token: 128171 '<|reserved_special_token_163|>' is not marked as EOG llm_load_vocab: control token: 128162 '<|reserved_special_token_154|>' is not marked as EOG llm_load_vocab: control token: 128165 '<|reserved_special_token_157|>' is not marked as EOG llm_load_vocab: control token: 128057 '<|reserved_special_token_49|>' is not marked as EOG llm_load_vocab: control token: 128050 '<|reserved_special_token_42|>' is not marked as EOG llm_load_vocab: control token: 128056 '<|reserved_special_token_48|>' is not marked as EOG llm_load_vocab: control token: 128230 '<|reserved_special_token_222|>' is not marked as EOG llm_load_vocab: control token: 128098 '<|reserved_special_token_90|>' is not marked as EOG llm_load_vocab: control token: 128153 '<|reserved_special_token_145|>' is not marked as EOG llm_load_vocab: control token: 128084 '<|reserved_special_token_76|>' is not marked as EOG llm_load_vocab: control token: 128082 '<|reserved_special_token_74|>' is not marked as EOG llm_load_vocab: control token: 128102 '<|reserved_special_token_94|>' is not marked as EOG llm_load_vocab: control token: 128253 '<|reserved_special_token_245|>' is not marked as EOG llm_load_vocab: control token: 128179 '<|reserved_special_token_171|>' is not marked as EOG llm_load_vocab: control token: 128071 '<|reserved_special_token_63|>' is not marked as EOG llm_load_vocab: control token: 128135 '<|reserved_special_token_127|>' is not marked as EOG llm_load_vocab: control token: 128161 '<|reserved_special_token_153|>' is not marked as EOG llm_load_vocab: control token: 128164 '<|reserved_special_token_156|>' is not marked as EOG llm_load_vocab: control token: 128134 '<|reserved_special_token_126|>' is not marked as EOG llm_load_vocab: control token: 128249 '<|reserved_special_token_241|>' is not marked as EOG llm_load_vocab: control token: 128004 '<|finetune_right_pad_id|>' is not marked as EOG llm_load_vocab: control token: 128036 '<|reserved_special_token_28|>' is not marked as EOG llm_load_vocab: control token: 128148 '<|reserved_special_token_140|>' is not marked as EOG llm_load_vocab: control token: 128181 '<|reserved_special_token_173|>' is not marked as EOG llm_load_vocab: control token: 128222 '<|reserved_special_token_214|>' is not marked as EOG llm_load_vocab: control token: 128075 '<|reserved_special_token_67|>' is not marked as EOG llm_load_vocab: control token: 128241 '<|reserved_special_token_233|>' is not marked as EOG llm_load_vocab: control token: 128051 '<|reserved_special_token_43|>' is not marked as EOG llm_load_vocab: control token: 128068 '<|reserved_special_token_60|>' is not marked as EOG llm_load_vocab: control token: 128149 '<|reserved_special_token_141|>' is not marked as EOG llm_load_vocab: control token: 128201 '<|reserved_special_token_193|>' is not marked as EOG llm_load_vocab: control token: 128058 '<|reserved_special_token_50|>' is not marked as EOG llm_load_vocab: control token: 128146 '<|reserved_special_token_138|>' is not marked as EOG llm_load_vocab: control token: 128143 '<|reserved_special_token_135|>' is not marked as EOG llm_load_vocab: control token: 128023 '<|reserved_special_token_15|>' is not marked as EOG llm_load_vocab: control token: 128039 '<|reserved_special_token_31|>' is not marked as EOG llm_load_vocab: control token: 128132 '<|reserved_special_token_124|>' is not marked as EOG llm_load_vocab: control token: 128101 '<|reserved_special_token_93|>' is not marked as EOG llm_load_vocab: control token: 128212 '<|reserved_special_token_204|>' is not marked as EOG llm_load_vocab: control token: 128189 '<|reserved_special_token_181|>' is not marked as EOG llm_load_vocab: control token: 128225 '<|reserved_special_token_217|>' is not marked as EOG llm_load_vocab: control token: 128129 '<|reserved_special_token_121|>' is not marked as EOG llm_load_vocab: control token: 128005 '<|reserved_special_token_2|>' is not marked as EOG llm_load_vocab: control token: 128078 '<|reserved_special_token_70|>' is not marked as EOG llm_load_vocab: control token: 128163 '<|reserved_special_token_155|>' is not marked as EOG llm_load_vocab: control token: 128072 '<|reserved_special_token_64|>' is not marked as EOG llm_load_vocab: control token: 128112 '<|reserved_special_token_104|>' is not marked as EOG llm_load_vocab: control token: 128186 '<|reserved_special_token_178|>' is not marked as EOG llm_load_vocab: control token: 128095 '<|reserved_special_token_87|>' is not marked as EOG llm_load_vocab: control token: 128109 '<|reserved_special_token_101|>' is not marked as EOG llm_load_vocab: control token: 128099 '<|reserved_special_token_91|>' is not marked as EOG llm_load_vocab: control token: 128138 '<|reserved_special_token_130|>' is not marked as EOG llm_load_vocab: control token: 128193 '<|reserved_special_token_185|>' is not marked as EOG llm_load_vocab: control token: 128199 '<|reserved_special_token_191|>' is not marked as EOG llm_load_vocab: control token: 128048 '<|reserved_special_token_40|>' is not marked as EOG llm_load_vocab: control token: 128088 '<|reserved_special_token_80|>' is not marked as EOG llm_load_vocab: control token: 128192 '<|reserved_special_token_184|>' is not marked as EOG llm_load_vocab: control token: 128136 '<|reserved_special_token_128|>' is not marked as EOG llm_load_vocab: control token: 128092 '<|reserved_special_token_84|>' is not marked as EOG llm_load_vocab: control token: 128158 '<|reserved_special_token_150|>' is not marked as EOG llm_load_vocab: control token: 128001 '<|end_of_text|>' is not marked as EOG llm_load_vocab: control token: 128049 '<|reserved_special_token_41|>' is not marked as EOG llm_load_vocab: control token: 128031 '<|reserved_special_token_23|>' is not marked as EOG llm_load_vocab: control token: 128255 '<|reserved_special_token_247|>' is not marked as EOG llm_load_vocab: control token: 128182 '<|reserved_special_token_174|>' is not marked as EOG llm_load_vocab: control token: 128066 '<|reserved_special_token_58|>' is not marked as EOG llm_load_vocab: control token: 128180 '<|reserved_special_token_172|>' is not marked as EOG llm_load_vocab: control token: 128233 '<|reserved_special_token_225|>' is not marked as EOG llm_load_vocab: control token: 128079 '<|reserved_special_token_71|>' is not marked as EOG llm_load_vocab: control token: 128081 '<|reserved_special_token_73|>' is not marked as EOG llm_load_vocab: control token: 128231 '<|reserved_special_token_223|>' is not marked as EOG llm_load_vocab: control token: 128196 '<|reserved_special_token_188|>' is not marked as EOG llm_load_vocab: control token: 128047 '<|reserved_special_token_39|>' is not marked as EOG llm_load_vocab: control token: 128083 '<|reserved_special_token_75|>' is not marked as EOG llm_load_vocab: control token: 128139 '<|reserved_special_token_131|>' is not marked as EOG llm_load_vocab: control token: 128131 '<|reserved_special_token_123|>' is not marked as EOG llm_load_vocab: control token: 128118 '<|reserved_special_token_110|>' is not marked as EOG llm_load_vocab: control token: 128053 '<|reserved_special_token_45|>' is not marked as EOG llm_load_vocab: control token: 128220 '<|reserved_special_token_212|>' is not marked as EOG llm_load_vocab: control token: 128108 '<|reserved_special_token_100|>' is not marked as EOG llm_load_vocab: control token: 128091 '<|reserved_special_token_83|>' is not marked as EOG llm_load_vocab: control token: 128203 '<|reserved_special_token_195|>' is not marked as EOG llm_load_vocab: control token: 128059 '<|reserved_special_token_51|>' is not marked as EOG llm_load_vocab: control token: 128019 '<|reserved_special_token_11|>' is not marked as EOG llm_load_vocab: control token: 128170 '<|reserved_special_token_162|>' is not marked as EOG llm_load_vocab: control token: 128205 '<|reserved_special_token_197|>' is not marked as EOG llm_load_vocab: control token: 128040 '<|reserved_special_token_32|>' is not marked as EOG llm_load_vocab: control token: 128200 '<|reserved_special_token_192|>' is not marked as EOG llm_load_vocab: control token: 128236 '<|reserved_special_token_228|>' is not marked as EOG llm_load_vocab: control token: 128145 '<|reserved_special_token_137|>' is not marked as EOG llm_load_vocab: control token: 128168 '<|reserved_special_token_160|>' is not marked as EOG llm_load_vocab: control token: 128214 '<|reserved_special_token_206|>' is not marked as EOG llm_load_vocab: control token: 128137 '<|reserved_special_token_129|>' is not marked as EOG llm_load_vocab: control token: 128232 '<|reserved_special_token_224|>' is not marked as EOG llm_load_vocab: control token: 128239 '<|reserved_special_token_231|>' is not marked as EOG llm_load_vocab: control token: 128055 '<|reserved_special_token_47|>' is not marked as EOG llm_load_vocab: control token: 128228 '<|reserved_special_token_220|>' is not marked as EOG llm_load_vocab: control token: 128206 '<|reserved_special_token_198|>' is not marked as EOG llm_load_vocab: control token: 128018 '<|reserved_special_token_10|>' is not marked as EOG llm_load_vocab: control token: 128012 '<|reserved_special_token_4|>' is not marked as EOG llm_load_vocab: control token: 128198 '<|reserved_special_token_190|>' is not marked as EOG llm_load_vocab: control token: 128021 '<|reserved_special_token_13|>' is not marked as EOG llm_load_vocab: control token: 128086 '<|reserved_special_token_78|>' is not marked as EOG llm_load_vocab: control token: 128074 '<|reserved_special_token_66|>' is not marked as EOG llm_load_vocab: control token: 128027 '<|reserved_special_token_19|>' is not marked as EOG llm_load_vocab: control token: 128242 '<|reserved_special_token_234|>' is not marked as EOG llm_load_vocab: control token: 128155 '<|reserved_special_token_147|>' is not marked as EOG llm_load_vocab: control token: 128052 '<|reserved_special_token_44|>' is not marked as EOG llm_load_vocab: control token: 128246 '<|reserved_special_token_238|>' is not marked as EOG llm_load_vocab: control token: 128117 '<|reserved_special_token_109|>' is not marked as EOG llm_load_vocab: control token: 128237 '<|reserved_special_token_229|>' is not marked as EOG llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_layer = 80 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 28672 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = Q5_K - Small llm_load_print_meta: model params = 70.55 B llm_load_print_meta: model size = 45.31 GiB (5.52 BPW) llm_load_print_meta: general.name = Llama 3.3 70B Instruct Abliterated llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llm_load_tensors: tensor 'token_embd.weight' (q5_K) (and 92 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead llm_load_tensors: offloading 71 repeating layers to GPU llm_load_tensors: offloaded 71/81 layers to GPU llm_load_tensors: CPU model buffer size = 688.88 MiB llm_load_tensors: CUDA_Host model buffer size = 5871.55 MiB llm_load_tensors: CUDA0 model buffer size = 19637.19 MiB llm_load_tensors: CUDA1 model buffer size = 20198.25 MiB load_all_data: no device found for buffer type CPU for async uploads load_all_data: buffer type CUDA_Host is not the default buffer type for device CUDA0 for async uploads time=2025-01-03T08:50:00.396-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.04" time=2025-01-03T08:50:00.646-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.07" time=2025-01-03T08:50:00.897-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.10" time=2025-01-03T08:50:01.147-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.13" load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0 time=2025-01-03T08:50:01.397-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.16" time=2025-01-03T08:50:01.648-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.18" time=2025-01-03T08:50:01.899-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.21" time=2025-01-03T08:50:02.149-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.23" time=2025-01-03T08:50:02.399-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.26" time=2025-01-03T08:50:02.650-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.28" time=2025-01-03T08:50:02.900-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.31" time=2025-01-03T08:50:03.150-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.34" time=2025-01-03T08:50:03.401-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.36" time=2025-01-03T08:50:03.651-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.39" time=2025-01-03T08:50:03.901-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.41" time=2025-01-03T08:50:04.151-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.44" time=2025-01-03T08:50:04.401-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.46" time=2025-01-03T08:50:04.652-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.49" time=2025-01-03T08:50:04.902-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.52" time=2025-01-03T08:50:05.153-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.55" load_all_data: using async uploads for device CUDA1, buffer type CUDA1, backend CUDA1 time=2025-01-03T08:50:05.404-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.58" time=2025-01-03T08:50:05.654-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.61" time=2025-01-03T08:50:05.904-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.64" time=2025-01-03T08:50:06.154-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.67" time=2025-01-03T08:50:06.404-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.70" time=2025-01-03T08:50:06.655-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.74" time=2025-01-03T08:50:06.905-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.76" time=2025-01-03T08:50:07.156-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.79" time=2025-01-03T08:50:07.406-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.82" time=2025-01-03T08:50:07.657-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.85" time=2025-01-03T08:50:07.907-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.88" time=2025-01-03T08:50:08.157-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.90" time=2025-01-03T08:50:08.407-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.93" time=2025-01-03T08:50:08.658-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.95" time=2025-01-03T08:50:08.908-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.98" llama_new_context_with_model: n_seq_max = 1 llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_ctx_per_seq = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_kv_cache_init: CPU KV buffer size = 72.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 280.00 MiB llama_kv_cache_init: CUDA1 KV buffer size = 288.00 MiB llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB llama_new_context_with_model: CPU output buffer size = 0.52 MiB time=2025-01-03T08:50:09.159-05:00 level=DEBUG source=server.go:600 msg="model load progress 1.00" llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB llama_new_context_with_model: CUDA1 compute buffer size = 324.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB llama_new_context_with_model: graph nodes = 2566 llama_new_context_with_model: graph splits = 105 (with bs=512), 4 (with bs=1) time=2025-01-03T08:50:09.409-05:00 level=INFO source=server.go:594 msg="llama runner started in 11.02 seconds" time=2025-01-03T08:50:09.409-05:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba [GIN] 2025/01/03 - 08:50:09 | 200 | 11.1587535s | 127.0.0.1 | POST "/api/generate" time=2025-01-03T08:50:09.409-05:00 level=DEBUG source=sched.go:466 msg="context for request finished" time=2025-01-03T08:50:09.409-05:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba duration=1h0m0s time=2025-01-03T08:50:09.409-05:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba refCount=0 time=2025-01-03T08:50:51.802-05:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba time=2025-01-03T08:50:51.802-05:00 level=DEBUG source=sched.go:283 msg="resetting model to expire immediately to make room" modelPath=E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba refCount=0 time=2025-01-03T08:50:51.802-05:00 level=DEBUG source=sched.go:296 msg="waiting for pending requests to complete and unload to occur" modelPath=E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba time=2025-01-03T08:50:51.802-05:00 level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba time=2025-01-03T08:50:51.802-05:00 level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba time=2025-01-03T08:50:51.802-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="154.9 GiB" before.free_swap="159.4 GiB" now.total="191.7 GiB" now.free="147.9 GiB" now.free_swap="110.7 GiB" time=2025-01-03T08:50:51.817-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="21.8 GiB" now.total="24.0 GiB" now.free="974.7 MiB" now.used="23.0 GiB" time=2025-01-03T08:50:51.833-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="1.8 GiB" now.used="21.1 GiB" releasing nvml library time=2025-01-03T08:50:51.834-05:00 level=DEBUG source=server.go:1080 msg="stopping llama server" time=2025-01-03T08:50:51.834-05:00 level=DEBUG source=server.go:1086 msg="waiting for llama server to exit" time=2025-01-03T08:50:52.085-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="147.9 GiB" before.free_swap="110.7 GiB" now.total="191.7 GiB" now.free="148.0 GiB" now.free_swap="148.7 GiB" time=2025-01-03T08:50:52.222-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="974.7 MiB" now.total="24.0 GiB" now.free="21.8 GiB" now.used="2.1 GiB" time=2025-01-03T08:50:52.237-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="1.8 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="441.1 MiB" releasing nvml library time=2025-01-03T08:50:52.239-05:00 level=DEBUG source=sched.go:659 msg="gpu VRAM free memory converged after 0.44 seconds" model=E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba time=2025-01-03T08:50:52.490-05:00 level=DEBUG source=server.go:1090 msg="llama server stopped" time=2025-01-03T08:50:52.490-05:00 level=DEBUG source=sched.go:380 msg="runner released" modelPath=E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba time=2025-01-03T08:50:52.490-05:00 level=DEBUG source=sched.go:384 msg="sending an unloaded event" modelPath=E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba time=2025-01-03T08:50:52.490-05:00 level=DEBUG source=sched.go:302 msg="unload completed" modelPath=E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba time=2025-01-03T08:50:52.490-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="148.0 GiB" before.free_swap="148.7 GiB" now.total="191.7 GiB" now.free="154.9 GiB" now.free_swap="159.4 GiB" time=2025-01-03T08:50:52.505-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="21.8 GiB" now.total="24.0 GiB" now.free="21.8 GiB" now.used="2.1 GiB" time=2025-01-03T08:50:52.521-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="441.1 MiB" releasing nvml library time=2025-01-03T08:50:52.541-05:00 level=DEBUG source=sched.go:224 msg="loading first model" model=E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba time=2025-01-03T08:50:52.541-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]" time=2025-01-03T08:50:52.541-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[21.8 GiB]" time=2025-01-03T08:50:52.542-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]" time=2025-01-03T08:50:52.542-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[21.8 GiB]" time=2025-01-03T08:50:52.543-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 21.8 GiB]" time=2025-01-03T08:50:52.543-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 21.8 GiB]" time=2025-01-03T08:50:52.543-05:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="191.7 GiB" before.free="154.9 GiB" before.free_swap="159.4 GiB" now.total="191.7 GiB" now.free="154.9 GiB" now.free_swap="159.4 GiB" time=2025-01-03T08:50:52.568-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="21.8 GiB" now.total="24.0 GiB" now.free="21.8 GiB" now.used="2.1 GiB" time=2025-01-03T08:50:52.584-05:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="441.1 MiB" releasing nvml library time=2025-01-03T08:50:52.585-05:00 level=INFO source=server.go:104 msg="system memory" total="191.7 GiB" free="154.9 GiB" free_swap="159.4 GiB" time=2025-01-03T08:50:52.585-05:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[21.8 GiB 22.5 GiB]" time=2025-01-03T08:50:52.585-05:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=61 layers.split=30,31 memory.available="[21.8 GiB 22.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="56.1 GiB" memory.required.partial="43.7 GiB" memory.required.kv="5.0 GiB" memory.required.allocations="[21.5 GiB 22.1 GiB]" memory.weights.total="48.8 GiB" memory.weights.repeating="48.0 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="2.2 GiB" memory.graph.partial="2.2 GiB" time=2025-01-03T08:50:52.585-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx\\ollama_llama_server.exe" time=2025-01-03T08:50:52.585-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe" time=2025-01-03T08:50:52.585-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v11_avx\\ollama_llama_server.exe" time=2025-01-03T08:50:52.585-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx\\ollama_llama_server.exe" time=2025-01-03T08:50:52.585-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\rocm_avx\\ollama_llama_server.exe" time=2025-01-03T08:50:52.586-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx\\ollama_llama_server.exe" time=2025-01-03T08:50:52.586-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe" time=2025-01-03T08:50:52.586-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v11_avx\\ollama_llama_server.exe" time=2025-01-03T08:50:52.586-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx\\ollama_llama_server.exe" time=2025-01-03T08:50:52.586-05:00 level=DEBUG source=common.go:124 msg="availableServers : found" file="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\rocm_avx\\ollama_llama_server.exe" time=2025-01-03T08:50:52.588-05:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx\\ollama_llama_server.exe runner --model E:\\DeepLearning\\LLM\\blobs\\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba --ctx-size 16384 --batch-size 512 --n-gpu-layers 61 --verbose --threads 8 --no-mmap --parallel 1 --tensor-split 30,31 --port 51104" time=2025-01-03T08:50:52.588-05:00 level=DEBUG source=server.go:393 msg=subprocess environment="[CUDA_PATH=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6 CUDA_PATH_V12_1=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1 CUDA_PATH_V12_3=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3 CUDA_PATH_V12_6=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6 CUDA_VISIBLE_DEVICES=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23,GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 PATH=C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12_avx;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\libnvvp;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\libnvvp;C:\\Program Files\\Common Files\\Oracle\\Java\\javapath;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\libnvvp;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\;C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn\\;C:\\Program Files (x86)\\Incredibuild;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Program Files\\nodejs\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA app\\NvDLISR;C:\\Program Files\\Git\\cmd;C:\\Program Files\\PuTTY\\;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files\\Go\\bin;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2024.3.2\\;C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Tools\\MSVC\\14.42.34433\\bin\\Hostx64\\x64;C:\\Program Files (x86)\\Common Files\\Intel\\Shared Libraries\\redist\\intel64_win\\compiler;C:\\Users\\User\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\User\\.dotnet\\tools;C:\\Users\\User\\miniconda3;C:\\Users\\User\\miniconda3\\Library\\mingw-w64\\bin;C:\\Users\\User\\miniconda3\\Library\\usr\\bin;C:\\Users\\User\\miniconda3\\Library\\bin;C:\\Users\\User\\miniconda3\\Scripts;C:\\Users\\User\\AppData\\Roaming\\npm;C:\\Program Files\\7-Zip;C:\\ffmpeg\\bin;;C:\\Users\\User\\AppData\\Local\\Programs\\Ollama;C:\\Users\\User\\.cache\\lm-studio\\bin;C:\\Users\\User\\go\\bin;C:\\Users\\User\\.dotnet\\tools]" time=2025-01-03T08:50:52.605-05:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2025-01-03T08:50:52.605-05:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding" time=2025-01-03T08:50:52.605-05:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" time=2025-01-03T08:50:52.689-05:00 level=INFO source=runner.go:945 msg="starting go runner" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes time=2025-01-03T08:50:52.791-05:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=8 time=2025-01-03T08:50:52.792-05:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:51104" time=2025-01-03T08:50:52.856-05:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free llama_load_model_from_file: using device CUDA1 (NVIDIA GeForce RTX 4090) - 22994 MiB free llama_model_loader: loaded meta data with 40 key-value pairs and 724 tensors from E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.3 70B Instruct Abliterated llama_model_loader: - kv 3: general.finetune str = Instruct-abliterated llama_model_loader: - kv 4: general.basename str = Llama-3.3 llama_model_loader: - kv 5: general.size_label str = 70B llama_model_loader: - kv 6: general.license str = llama3.3 llama_model_loader: - kv 7: general.base_model.count u32 = 1 llama_model_loader: - kv 8: general.base_model.0.name str = Llama 3.3 70B Instruct llama_model_loader: - kv 9: general.base_model.0.organization str = Meta Llama llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla... llama_model_loader: - kv 11: general.tags arr[str,7] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 12: general.languages arr[str,8] = ["en", "fr", "it", "pt", "hi", "es", ... llama_model_loader: - kv 13: llama.block_count u32 = 80 llama_model_loader: - kv 14: llama.context_length u32 = 131072 llama_model_loader: - kv 15: llama.embedding_length u32 = 8192 llama_model_loader: - kv 16: llama.feed_forward_length u32 = 28672 llama_model_loader: - kv 17: llama.attention.head_count u32 = 64 llama_model_loader: - kv 18: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 19: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 20: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 21: llama.attention.key_length u32 = 128 llama_model_loader: - kv 22: llama.attention.value_length u32 = 128 llama_model_loader: - kv 23: general.file_type u32 = 16 llama_model_loader: - kv 24: llama.vocab_size u32 = 128256 llama_model_loader: - kv 25: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 27: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 32: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 128004 llama_model_loader: - kv 34: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 35: general.quantization_version u32 = 2 llama_model_loader: - kv 36: quantize.imatrix.file str = /models_out/Llama-3.3-70B-Instruct-ab... llama_model_loader: - kv 37: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt llama_model_loader: - kv 38: quantize.imatrix.entries_count i32 = 560 llama_model_loader: - kv 39: quantize.imatrix.chunks_count i32 = 125 llama_model_loader: - type f32: 162 tensors llama_model_loader: - type q5_K: 561 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: control token: 128254 '<|reserved_special_token_246|>' is not marked as EOG llm_load_vocab: control token: 128252 '<|reserved_special_token_244|>' is not marked as EOG llm_load_vocab: control token: 128251 '<|reserved_special_token_243|>' is not marked as EOG llm_load_vocab: control token: 128250 '<|reserved_special_token_242|>' is not marked as EOG llm_load_vocab: control token: 128248 '<|reserved_special_token_240|>' is not marked as EOG llm_load_vocab: control token: 128247 '<|reserved_special_token_239|>' is not marked as EOG llm_load_vocab: control token: 128245 '<|reserved_special_token_237|>' is not marked as EOG llm_load_vocab: control token: 128244 '<|reserved_special_token_236|>' is not marked as EOG llm_load_vocab: control token: 128243 '<|reserved_special_token_235|>' is not marked as EOG llm_load_vocab: control token: 128240 '<|reserved_special_token_232|>' is not marked as EOG llm_load_vocab: control token: 128238 '<|reserved_special_token_230|>' is not marked as EOG llm_load_vocab: control token: 128235 '<|reserved_special_token_227|>' is not marked as EOG llm_load_vocab: control token: 128234 '<|reserved_special_token_226|>' is not marked as EOG llm_load_vocab: control token: 128229 '<|reserved_special_token_221|>' is not marked as EOG llm_load_vocab: control token: 128227 '<|reserved_special_token_219|>' is not marked as EOG llm_load_vocab: control token: 128226 '<|reserved_special_token_218|>' is not marked as EOG llm_load_vocab: control token: 128224 '<|reserved_special_token_216|>' is not marked as EOG llm_load_vocab: control token: 128223 '<|reserved_special_token_215|>' is not marked as EOG llm_load_vocab: control token: 128221 '<|reserved_special_token_213|>' is not marked as EOG llm_load_vocab: control token: 128219 '<|reserved_special_token_211|>' is not marked as EOG llm_load_vocab: control token: 128218 '<|reserved_special_token_210|>' is not marked as EOG llm_load_vocab: control token: 128217 '<|reserved_special_token_209|>' is not marked as EOG llm_load_vocab: control token: 128216 '<|reserved_special_token_208|>' is not marked as EOG llm_load_vocab: control token: 128215 '<|reserved_special_token_207|>' is not marked as EOG llm_load_vocab: control token: 128213 '<|reserved_special_token_205|>' is not marked as EOG llm_load_vocab: control token: 128211 '<|reserved_special_token_203|>' is not marked as EOG llm_load_vocab: control token: 128210 '<|reserved_special_token_202|>' is not marked as EOG llm_load_vocab: control token: 128209 '<|reserved_special_token_201|>' is not marked as EOG llm_load_vocab: control token: 128208 '<|reserved_special_token_200|>' is not marked as EOG llm_load_vocab: control token: 128207 '<|reserved_special_token_199|>' is not marked as EOG llm_load_vocab: control token: 128204 '<|reserved_special_token_196|>' is not marked as EOG llm_load_vocab: control token: 128202 '<|reserved_special_token_194|>' is not marked as EOG llm_load_vocab: control token: 128197 '<|reserved_special_token_189|>' is not marked as EOG llm_load_vocab: control token: 128195 '<|reserved_special_token_187|>' is not marked as EOG llm_load_vocab: control token: 128194 '<|reserved_special_token_186|>' is not marked as EOG llm_load_vocab: control token: 128191 '<|reserved_special_token_183|>' is not marked as EOG llm_load_vocab: control token: 128190 '<|reserved_special_token_182|>' is not marked as EOG llm_load_vocab: control token: 128188 '<|reserved_special_token_180|>' is not marked as EOG llm_load_vocab: control token: 128187 '<|reserved_special_token_179|>' is not marked as EOG llm_load_vocab: control token: 128185 '<|reserved_special_token_177|>' is not marked as EOG llm_load_vocab: control token: 128184 '<|reserved_special_token_176|>' is not marked as EOG llm_load_vocab: control token: 128183 '<|reserved_special_token_175|>' is not marked as EOG llm_load_vocab: control token: 128178 '<|reserved_special_token_170|>' is not marked as EOG llm_load_vocab: control token: 128177 '<|reserved_special_token_169|>' is not marked as EOG llm_load_vocab: control token: 128176 '<|reserved_special_token_168|>' is not marked as EOG llm_load_vocab: control token: 128175 '<|reserved_special_token_167|>' is not marked as EOG llm_load_vocab: control token: 128174 '<|reserved_special_token_166|>' is not marked as EOG llm_load_vocab: control token: 128173 '<|reserved_special_token_165|>' is not marked as EOG llm_load_vocab: control token: 128172 '<|reserved_special_token_164|>' is not marked as EOG llm_load_vocab: control token: 128169 '<|reserved_special_token_161|>' is not marked as EOG llm_load_vocab: control token: 128167 '<|reserved_special_token_159|>' is not marked as EOG llm_load_vocab: control token: 128166 '<|reserved_special_token_158|>' is not marked as EOG llm_load_vocab: control token: 128160 '<|reserved_special_token_152|>' is not marked as EOG llm_load_vocab: control token: 128159 '<|reserved_special_token_151|>' is not marked as EOG llm_load_vocab: control token: 128157 '<|reserved_special_token_149|>' is not marked as EOG llm_load_vocab: control token: 128156 '<|reserved_special_token_148|>' is not marked as EOG llm_load_vocab: control token: 128154 '<|reserved_special_token_146|>' is not marked as EOG llm_load_vocab: control token: 128152 '<|reserved_special_token_144|>' is not marked as EOG llm_load_vocab: control token: 128151 '<|reserved_special_token_143|>' is not marked as EOG llm_load_vocab: control token: 128150 '<|reserved_special_token_142|>' is not marked as EOG llm_load_vocab: control token: 128147 '<|reserved_special_token_139|>' is not marked as EOG llm_load_vocab: control token: 128144 '<|reserved_special_token_136|>' is not marked as EOG llm_load_vocab: control token: 128142 '<|reserved_special_token_134|>' is not marked as EOG llm_load_vocab: control token: 128141 '<|reserved_special_token_133|>' is not marked as EOG llm_load_vocab: control token: 128140 '<|reserved_special_token_132|>' is not marked as EOG llm_load_vocab: control token: 128133 '<|reserved_special_token_125|>' is not marked as EOG llm_load_vocab: control token: 128130 '<|reserved_special_token_122|>' is not marked as EOG llm_load_vocab: control token: 128128 '<|reserved_special_token_120|>' is not marked as EOG llm_load_vocab: control token: 128127 '<|reserved_special_token_119|>' is not marked as EOG llm_load_vocab: control token: 128126 '<|reserved_special_token_118|>' is not marked as EOG llm_load_vocab: control token: 128125 '<|reserved_special_token_117|>' is not marked as EOG llm_load_vocab: control token: 128124 '<|reserved_special_token_116|>' is not marked as EOG llm_load_vocab: control token: 128123 '<|reserved_special_token_115|>' is not marked as EOG llm_load_vocab: control token: 128122 '<|reserved_special_token_114|>' is not marked as EOG llm_load_vocab: control token: 128121 '<|reserved_special_token_113|>' is not marked as EOG llm_load_vocab: control token: 128120 '<|reserved_special_token_112|>' is not marked as EOG llm_load_vocab: control token: 128119 '<|reserved_special_token_111|>' is not marked as EOG llm_load_vocab: control token: 128116 '<|reserved_special_token_108|>' is not marked as EOG llm_load_vocab: control token: 128115 '<|reserved_special_token_107|>' is not marked as EOG llm_load_vocab: control token: 128114 '<|reserved_special_token_106|>' is not marked as EOG llm_load_vocab: control token: 128113 '<|reserved_special_token_105|>' is not marked as EOG llm_load_vocab: control token: 128111 '<|reserved_special_token_103|>' is not marked as EOG llm_load_vocab: control token: 128110 '<|reserved_special_token_102|>' is not marked as EOG llm_load_vocab: control token: 128107 '<|reserved_special_token_99|>' is not marked as EOG llm_load_vocab: control token: 128106 '<|reserved_special_token_98|>' is not marked as EOG llm_load_vocab: control token: 128105 '<|reserved_special_token_97|>' is not marked as EOG llm_load_vocab: control token: 128104 '<|reserved_special_token_96|>' is not marked as EOG llm_load_vocab: control token: 128103 '<|reserved_special_token_95|>' is not marked as EOG llm_load_vocab: control token: 128100 '<|reserved_special_token_92|>' is not marked as EOG llm_load_vocab: control token: 128097 '<|reserved_special_token_89|>' is not marked as EOG llm_load_vocab: control token: 128096 '<|reserved_special_token_88|>' is not marked as EOG llm_load_vocab: control token: 128094 '<|reserved_special_token_86|>' is not marked as EOG llm_load_vocab: control token: 128093 '<|reserved_special_token_85|>' is not marked as EOG llm_load_vocab: control token: 128090 '<|reserved_special_token_82|>' is not marked as EOG llm_load_vocab: control token: 128089 '<|reserved_special_token_81|>' is not marked as EOG llm_load_vocab: control token: 128087 '<|reserved_special_token_79|>' is not marked as EOG llm_load_vocab: control token: 128085 '<|reserved_special_token_77|>' is not marked as EOG llm_load_vocab: control token: 128080 '<|reserved_special_token_72|>' is not marked as EOG llm_load_vocab: control token: 128077 '<|reserved_special_token_69|>' is not marked as EOG llm_load_vocab: control token: 128076 '<|reserved_special_token_68|>' is not marked as EOG llm_load_vocab: control token: 128073 '<|reserved_special_token_65|>' is not marked as EOG llm_load_vocab: control token: 128070 '<|reserved_special_token_62|>' is not marked as EOG llm_load_vocab: control token: 128069 '<|reserved_special_token_61|>' is not marked as EOG llm_load_vocab: control token: 128067 '<|reserved_special_token_59|>' is not marked as EOG llm_load_vocab: control token: 128064 '<|reserved_special_token_56|>' is not marked as EOG llm_load_vocab: control token: 128062 '<|reserved_special_token_54|>' is not marked as EOG llm_load_vocab: control token: 128061 '<|reserved_special_token_53|>' is not marked as EOG llm_load_vocab: control token: 128060 '<|reserved_special_token_52|>' is not marked as EOG llm_load_vocab: control token: 128054 '<|reserved_special_token_46|>' is not marked as EOG llm_load_vocab: control token: 128045 '<|reserved_special_token_37|>' is not marked as EOG llm_load_vocab: control token: 128044 '<|reserved_special_token_36|>' is not marked as EOG llm_load_vocab: control token: 128043 '<|reserved_special_token_35|>' is not marked as EOG llm_load_vocab: control token: 128042 '<|reserved_special_token_34|>' is not marked as EOG llm_load_vocab: control token: 128038 '<|reserved_special_token_30|>' is not marked as EOG llm_load_vocab: control token: 128037 '<|reserved_special_token_29|>' is not marked as EOG llm_load_vocab: control token: 128035 '<|reserved_special_token_27|>' is not marked as EOG llm_load_vocab: control token: 128034 '<|reserved_special_token_26|>' is not marked as EOG llm_load_vocab: control token: 128033 '<|reserved_special_token_25|>' is not marked as EOG llm_load_vocab: control token: 128032 '<|reserved_special_token_24|>' is not marked as EOG llm_load_vocab: control token: 128030 '<|reserved_special_token_22|>' is not marked as EOG llm_load_vocab: control token: 128029 '<|reserved_special_token_21|>' is not marked as EOG llm_load_vocab: control token: 128028 '<|reserved_special_token_20|>' is not marked as EOG llm_load_vocab: control token: 128026 '<|reserved_special_token_18|>' is not marked as EOG llm_load_vocab: control token: 128025 '<|reserved_special_token_17|>' is not marked as EOG llm_load_vocab: control token: 128024 '<|reserved_special_token_16|>' is not marked as EOG llm_load_vocab: control token: 128022 '<|reserved_special_token_14|>' is not marked as EOG llm_load_vocab: control token: 128020 '<|reserved_special_token_12|>' is not marked as EOG llm_load_vocab: control token: 128017 '<|reserved_special_token_9|>' is not marked as EOG llm_load_vocab: control token: 128016 '<|reserved_special_token_8|>' is not marked as EOG llm_load_vocab: control token: 128015 '<|reserved_special_token_7|>' is not marked as EOG llm_load_vocab: control token: 128014 '<|reserved_special_token_6|>' is not marked as EOG llm_load_vocab: control token: 128013 '<|reserved_special_token_5|>' is not marked as EOG llm_load_vocab: control token: 128011 '<|reserved_special_token_3|>' is not marked as EOG llm_load_vocab: control token: 128010 '<|python_tag|>' is not marked as EOG llm_load_vocab: control token: 128006 '<|start_header_id|>' is not marked as EOG llm_load_vocab: control token: 128003 '<|reserved_special_token_1|>' is not marked as EOG llm_load_vocab: control token: 128002 '<|reserved_special_token_0|>' is not marked as EOG llm_load_vocab: control token: 128000 '<|begin_of_text|>' is not marked as EOG llm_load_vocab: control token: 128041 '<|reserved_special_token_33|>' is not marked as EOG llm_load_vocab: control token: 128063 '<|reserved_special_token_55|>' is not marked as EOG llm_load_vocab: control token: 128046 '<|reserved_special_token_38|>' is not marked as EOG llm_load_vocab: control token: 128007 '<|end_header_id|>' is not marked as EOG llm_load_vocab: control token: 128065 '<|reserved_special_token_57|>' is not marked as EOG llm_load_vocab: control token: 128171 '<|reserved_special_token_163|>' is not marked as EOG llm_load_vocab: control token: 128162 '<|reserved_special_token_154|>' is not marked as EOG llm_load_vocab: control token: 128165 '<|reserved_special_token_157|>' is not marked as EOG llm_load_vocab: control token: 128057 '<|reserved_special_token_49|>' is not marked as EOG llm_load_vocab: control token: 128050 '<|reserved_special_token_42|>' is not marked as EOG llm_load_vocab: control token: 128056 '<|reserved_special_token_48|>' is not marked as EOG llm_load_vocab: control token: 128230 '<|reserved_special_token_222|>' is not marked as EOG llm_load_vocab: control token: 128098 '<|reserved_special_token_90|>' is not marked as EOG llm_load_vocab: control token: 128153 '<|reserved_special_token_145|>' is not marked as EOG llm_load_vocab: control token: 128084 '<|reserved_special_token_76|>' is not marked as EOG llm_load_vocab: control token: 128082 '<|reserved_special_token_74|>' is not marked as EOG llm_load_vocab: control token: 128102 '<|reserved_special_token_94|>' is not marked as EOG llm_load_vocab: control token: 128253 '<|reserved_special_token_245|>' is not marked as EOG llm_load_vocab: control token: 128179 '<|reserved_special_token_171|>' is not marked as EOG llm_load_vocab: control token: 128071 '<|reserved_special_token_63|>' is not marked as EOG llm_load_vocab: control token: 128135 '<|reserved_special_token_127|>' is not marked as EOG llm_load_vocab: control token: 128161 '<|reserved_special_token_153|>' is not marked as EOG llm_load_vocab: control token: 128164 '<|reserved_special_token_156|>' is not marked as EOG llm_load_vocab: control token: 128134 '<|reserved_special_token_126|>' is not marked as EOG llm_load_vocab: control token: 128249 '<|reserved_special_token_241|>' is not marked as EOG llm_load_vocab: control token: 128004 '<|finetune_right_pad_id|>' is not marked as EOG llm_load_vocab: control token: 128036 '<|reserved_special_token_28|>' is not marked as EOG llm_load_vocab: control token: 128148 '<|reserved_special_token_140|>' is not marked as EOG llm_load_vocab: control token: 128181 '<|reserved_special_token_173|>' is not marked as EOG llm_load_vocab: control token: 128222 '<|reserved_special_token_214|>' is not marked as EOG llm_load_vocab: control token: 128075 '<|reserved_special_token_67|>' is not marked as EOG llm_load_vocab: control token: 128241 '<|reserved_special_token_233|>' is not marked as EOG llm_load_vocab: control token: 128051 '<|reserved_special_token_43|>' is not marked as EOG llm_load_vocab: control token: 128068 '<|reserved_special_token_60|>' is not marked as EOG llm_load_vocab: control token: 128149 '<|reserved_special_token_141|>' is not marked as EOG llm_load_vocab: control token: 128201 '<|reserved_special_token_193|>' is not marked as EOG llm_load_vocab: control token: 128058 '<|reserved_special_token_50|>' is not marked as EOG llm_load_vocab: control token: 128146 '<|reserved_special_token_138|>' is not marked as EOG llm_load_vocab: control token: 128143 '<|reserved_special_token_135|>' is not marked as EOG llm_load_vocab: control token: 128023 '<|reserved_special_token_15|>' is not marked as EOG llm_load_vocab: control token: 128039 '<|reserved_special_token_31|>' is not marked as EOG llm_load_vocab: control token: 128132 '<|reserved_special_token_124|>' is not marked as EOG llm_load_vocab: control token: 128101 '<|reserved_special_token_93|>' is not marked as EOG llm_load_vocab: control token: 128212 '<|reserved_special_token_204|>' is not marked as EOG llm_load_vocab: control token: 128189 '<|reserved_special_token_181|>' is not marked as EOG llm_load_vocab: control token: 128225 '<|reserved_special_token_217|>' is not marked as EOG llm_load_vocab: control token: 128129 '<|reserved_special_token_121|>' is not marked as EOG llm_load_vocab: control token: 128005 '<|reserved_special_token_2|>' is not marked as EOG llm_load_vocab: control token: 128078 '<|reserved_special_token_70|>' is not marked as EOG llm_load_vocab: control token: 128163 '<|reserved_special_token_155|>' is not marked as EOG llm_load_vocab: control token: 128072 '<|reserved_special_token_64|>' is not marked as EOG llm_load_vocab: control token: 128112 '<|reserved_special_token_104|>' is not marked as EOG llm_load_vocab: control token: 128186 '<|reserved_special_token_178|>' is not marked as EOG llm_load_vocab: control token: 128095 '<|reserved_special_token_87|>' is not marked as EOG llm_load_vocab: control token: 128109 '<|reserved_special_token_101|>' is not marked as EOG llm_load_vocab: control token: 128099 '<|reserved_special_token_91|>' is not marked as EOG llm_load_vocab: control token: 128138 '<|reserved_special_token_130|>' is not marked as EOG llm_load_vocab: control token: 128193 '<|reserved_special_token_185|>' is not marked as EOG llm_load_vocab: control token: 128199 '<|reserved_special_token_191|>' is not marked as EOG llm_load_vocab: control token: 128048 '<|reserved_special_token_40|>' is not marked as EOG llm_load_vocab: control token: 128088 '<|reserved_special_token_80|>' is not marked as EOG llm_load_vocab: control token: 128192 '<|reserved_special_token_184|>' is not marked as EOG llm_load_vocab: control token: 128136 '<|reserved_special_token_128|>' is not marked as EOG llm_load_vocab: control token: 128092 '<|reserved_special_token_84|>' is not marked as EOG llm_load_vocab: control token: 128158 '<|reserved_special_token_150|>' is not marked as EOG llm_load_vocab: control token: 128001 '<|end_of_text|>' is not marked as EOG llm_load_vocab: control token: 128049 '<|reserved_special_token_41|>' is not marked as EOG llm_load_vocab: control token: 128031 '<|reserved_special_token_23|>' is not marked as EOG llm_load_vocab: control token: 128255 '<|reserved_special_token_247|>' is not marked as EOG llm_load_vocab: control token: 128182 '<|reserved_special_token_174|>' is not marked as EOG llm_load_vocab: control token: 128066 '<|reserved_special_token_58|>' is not marked as EOG llm_load_vocab: control token: 128180 '<|reserved_special_token_172|>' is not marked as EOG llm_load_vocab: control token: 128233 '<|reserved_special_token_225|>' is not marked as EOG llm_load_vocab: control token: 128079 '<|reserved_special_token_71|>' is not marked as EOG llm_load_vocab: control token: 128081 '<|reserved_special_token_73|>' is not marked as EOG llm_load_vocab: control token: 128231 '<|reserved_special_token_223|>' is not marked as EOG llm_load_vocab: control token: 128196 '<|reserved_special_token_188|>' is not marked as EOG llm_load_vocab: control token: 128047 '<|reserved_special_token_39|>' is not marked as EOG llm_load_vocab: control token: 128083 '<|reserved_special_token_75|>' is not marked as EOG llm_load_vocab: control token: 128139 '<|reserved_special_token_131|>' is not marked as EOG llm_load_vocab: control token: 128131 '<|reserved_special_token_123|>' is not marked as EOG llm_load_vocab: control token: 128118 '<|reserved_special_token_110|>' is not marked as EOG llm_load_vocab: control token: 128053 '<|reserved_special_token_45|>' is not marked as EOG llm_load_vocab: control token: 128220 '<|reserved_special_token_212|>' is not marked as EOG llm_load_vocab: control token: 128108 '<|reserved_special_token_100|>' is not marked as EOG llm_load_vocab: control token: 128091 '<|reserved_special_token_83|>' is not marked as EOG llm_load_vocab: control token: 128203 '<|reserved_special_token_195|>' is not marked as EOG llm_load_vocab: control token: 128059 '<|reserved_special_token_51|>' is not marked as EOG llm_load_vocab: control token: 128019 '<|reserved_special_token_11|>' is not marked as EOG llm_load_vocab: control token: 128170 '<|reserved_special_token_162|>' is not marked as EOG llm_load_vocab: control token: 128205 '<|reserved_special_token_197|>' is not marked as EOG llm_load_vocab: control token: 128040 '<|reserved_special_token_32|>' is not marked as EOG llm_load_vocab: control token: 128200 '<|reserved_special_token_192|>' is not marked as EOG llm_load_vocab: control token: 128236 '<|reserved_special_token_228|>' is not marked as EOG llm_load_vocab: control token: 128145 '<|reserved_special_token_137|>' is not marked as EOG llm_load_vocab: control token: 128168 '<|reserved_special_token_160|>' is not marked as EOG llm_load_vocab: control token: 128214 '<|reserved_special_token_206|>' is not marked as EOG llm_load_vocab: control token: 128137 '<|reserved_special_token_129|>' is not marked as EOG llm_load_vocab: control token: 128232 '<|reserved_special_token_224|>' is not marked as EOG llm_load_vocab: control token: 128239 '<|reserved_special_token_231|>' is not marked as EOG llm_load_vocab: control token: 128055 '<|reserved_special_token_47|>' is not marked as EOG llm_load_vocab: control token: 128228 '<|reserved_special_token_220|>' is not marked as EOG llm_load_vocab: control token: 128206 '<|reserved_special_token_198|>' is not marked as EOG llm_load_vocab: control token: 128018 '<|reserved_special_token_10|>' is not marked as EOG llm_load_vocab: control token: 128012 '<|reserved_special_token_4|>' is not marked as EOG llm_load_vocab: control token: 128198 '<|reserved_special_token_190|>' is not marked as EOG llm_load_vocab: control token: 128021 '<|reserved_special_token_13|>' is not marked as EOG llm_load_vocab: control token: 128086 '<|reserved_special_token_78|>' is not marked as EOG llm_load_vocab: control token: 128074 '<|reserved_special_token_66|>' is not marked as EOG llm_load_vocab: control token: 128027 '<|reserved_special_token_19|>' is not marked as EOG llm_load_vocab: control token: 128242 '<|reserved_special_token_234|>' is not marked as EOG llm_load_vocab: control token: 128155 '<|reserved_special_token_147|>' is not marked as EOG llm_load_vocab: control token: 128052 '<|reserved_special_token_44|>' is not marked as EOG llm_load_vocab: control token: 128246 '<|reserved_special_token_238|>' is not marked as EOG llm_load_vocab: control token: 128117 '<|reserved_special_token_109|>' is not marked as EOG llm_load_vocab: control token: 128237 '<|reserved_special_token_229|>' is not marked as EOG llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_layer = 80 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 28672 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = Q5_K - Small llm_load_print_meta: model params = 70.55 B llm_load_print_meta: model size = 45.31 GiB (5.52 BPW) llm_load_print_meta: general.name = Llama 3.3 70B Instruct Abliterated llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llm_load_tensors: tensor 'token_embd.weight' (q5_K) (and 192 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead llm_load_tensors: offloading 61 repeating layers to GPU llm_load_tensors: offloaded 61/81 layers to GPU llm_load_tensors: CPU model buffer size = 688.88 MiB llm_load_tensors: CUDA_Host model buffer size = 11482.17 MiB llm_load_tensors: CUDA0 model buffer size = 16831.88 MiB llm_load_tensors: CUDA1 model buffer size = 17392.94 MiB load_all_data: no device found for buffer type CPU for async uploads load_all_data: buffer type CUDA_Host is not the default buffer type for device CUDA0 for async uploads time=2025-01-03T08:50:55.861-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.01" time=2025-01-03T08:50:56.111-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.05" time=2025-01-03T08:50:56.362-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.08" time=2025-01-03T08:50:56.612-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.11" time=2025-01-03T08:50:56.862-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.14" time=2025-01-03T08:50:57.113-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.17" time=2025-01-03T08:50:57.363-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.20" time=2025-01-03T08:50:57.614-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.23" load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0 time=2025-01-03T08:50:57.864-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.26" time=2025-01-03T08:50:58.114-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.29" time=2025-01-03T08:50:58.365-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.32" time=2025-01-03T08:50:58.616-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.34" time=2025-01-03T08:50:58.866-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.37" time=2025-01-03T08:50:59.116-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.40" time=2025-01-03T08:50:59.367-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.42" time=2025-01-03T08:50:59.617-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.45" time=2025-01-03T08:50:59.867-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.47" time=2025-01-03T08:51:00.118-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.50" time=2025-01-03T08:51:00.368-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.53" time=2025-01-03T08:51:00.618-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.57" time=2025-01-03T08:51:00.869-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.59" load_all_data: using async uploads for device CUDA1, buffer type CUDA1, backend CUDA1 time=2025-01-03T08:51:01.119-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.63" time=2025-01-03T08:51:01.369-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.63" time=2025-01-03T08:51:01.620-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.67" time=2025-01-03T08:51:01.870-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.69" time=2025-01-03T08:51:02.120-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.72" time=2025-01-03T08:51:02.371-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.75" time=2025-01-03T08:51:02.621-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.78" time=2025-01-03T08:51:02.872-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.81" time=2025-01-03T08:51:03.122-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.84" time=2025-01-03T08:51:03.373-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.87" time=2025-01-03T08:51:03.623-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.89" time=2025-01-03T08:51:03.873-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.92" time=2025-01-03T08:51:04.123-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.94" time=2025-01-03T08:51:04.374-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.97" time=2025-01-03T08:51:04.624-05:00 level=DEBUG source=server.go:600 msg="model load progress 0.99" llama_new_context_with_model: n_seq_max = 1 llama_new_context_with_model: n_ctx = 16384 llama_new_context_with_model: n_ctx_per_seq = 16384 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized time=2025-01-03T08:51:04.875-05:00 level=DEBUG source=server.go:600 msg="model load progress 1.00" llama_kv_cache_init: CPU KV buffer size = 1216.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 1920.00 MiB llama_kv_cache_init: CUDA1 KV buffer size = 1984.00 MiB llama_new_context_with_model: KV self size = 5120.00 MiB, K (f16): 2560.00 MiB, V (f16): 2560.00 MiB llama_new_context_with_model: CPU output buffer size = 0.52 MiB time=2025-01-03T08:51:05.126-05:00 level=DEBUG source=server.go:603 msg="model load completed, waiting for server to become available" status="llm server loading model" llama_new_context_with_model: CUDA0 compute buffer size = 2224.00 MiB llama_new_context_with_model: CUDA1 compute buffer size = 2144.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 48.01 MiB llama_new_context_with_model: graph nodes = 2566 llama_new_context_with_model: graph splits = 215 (with bs=512), 4 (with bs=1) time=2025-01-03T08:51:05.376-05:00 level=INFO source=server.go:594 msg="llama runner started in 12.77 seconds" time=2025-01-03T08:51:05.376-05:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba time=2025-01-03T08:51:05.377-05:00 level=DEBUG source=routes.go:1542 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\nWrite the game of Tetris in Python<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" time=2025-01-03T08:51:05.378-05:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=18 used=0 remaining=18 time=2025-01-03T08:51:24.053-05:00 level=DEBUG source=sched.go:466 msg="context for request finished" time=2025-01-03T08:51:24.053-05:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba duration=1h0m0s time=2025-01-03T08:51:24.053-05:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=E:\DeepLearning\LLM\blobs\sha256-b2f816f1ede7e263e90aae9221bce26a6b68a6ccdec52304f20dc66682f573ba refCount=0 [GIN] 2025/01/03 - 08:51:24 | 200 | 32.2631045s | 127.0.0.1 | POST "/api/chat"
Author
Owner

@robbyjo commented on GitHub (Jan 7, 2025):

Could it be that the update from v0.4.7 to 0.5.0 involves overwriting parts that were essential to multi-GPUs for Windows?

<!-- gh-comment-id:2575988922 --> @robbyjo commented on GitHub (Jan 7, 2025): Could it be that the update from v0.4.7 to 0.5.0 involves overwriting parts that were essential to multi-GPUs for Windows?
Author
Owner

@rick-github commented on GitHub (Jan 8, 2025):

It's looking more like a Windows specific issue. I configured a linux server with 2x4090, CUDA 12.7 and it works fine with the Write the game of Tetris in Python prompt and the writing prompt from the first log. Something you could try is installing the WSL CUDA driver and running ollama inside a WSL container. That would remove any Window differences from ollama, and it would be just down to the Nvidia driver and the cards.

<!-- gh-comment-id:2576782097 --> @rick-github commented on GitHub (Jan 8, 2025): It's looking more like a Windows specific issue. I configured a linux server with 2x4090, CUDA 12.7 and it works fine with the `Write the game of Tetris in Python` prompt and the writing prompt from the first log. Something you could try is installing the [WSL CUDA driver](https://learn.microsoft.com/en-us/windows/ai/directml/gpu-cuda-in-wsl) and running ollama inside a WSL container. That would remove any Window differences from ollama, and it would be just down to the Nvidia driver and the cards.
Author
Owner

@YonTracks commented on GitHub (Jan 9, 2025):

this is very good info cheers:
I turned off OLLAMA_FLASH_ATTENTION. Its output is garbled.

and windows specific and OLLAMA_NUM_PARALLEL and

If I enable OLLAMA_SCHED_SPREAD, then the output will be garbled no matter what (regardless of how many layers I offload to GPU).

for this particular issue num_ctx
but also embeddings issue very related!

this will take me a while, a few things happening here, but:
all points back to the Normalize the NumCtx for parallelism
server/sched.go with the needsReload(ctx context.Context, req *LlmRequest) and windows, and more.

using go test I will find it.
edit^: actually the issue is showing via the go test. go test -tags=integration ./... awesome, I should be able to sort it.

wish I could explain better, forgive me. I will show via code itself.

<!-- gh-comment-id:2579349318 --> @YonTracks commented on GitHub (Jan 9, 2025): this is very good info cheers: ```I turned off OLLAMA_FLASH_ATTENTION. Its output is garbled.``` and `windows specific` and `OLLAMA_NUM_PARALLEL` and ```If I enable OLLAMA_SCHED_SPREAD, then the output will be garbled no matter what (regardless of how many layers I offload to GPU).``` for this particular issue `num_ctx` but also `embeddings` issue very related! this will take me a while, a few things happening here, but: all points back to the `Normalize the NumCtx for parallelism` `server/sched.go` with the `needsReload(ctx context.Context, req *LlmRequest)` and windows, and more. using `go test` I will find it. edit^: actually the issue is showing via the go test. `go test -tags=integration ./...` awesome, I should be able to sort it. wish I could explain better, forgive me. I will show via code itself.
Author
Owner

@YonTracks commented on GitHub (Jan 9, 2025):

Howdy @rick-github,
should I just attempt to fix this, which is a few issues all related? but then what? do I try PR with full fix?
or
PR with each fix,
or
try to reveal each issue 1 at a time, either here or other related,
or a new bug report,
or
I should just wait, as the mob already are on to it?

I fear this issue/s is making ollama look bad?
lol hope this makes sense, a little at least.

<!-- gh-comment-id:2580086946 --> @YonTracks commented on GitHub (Jan 9, 2025): Howdy @rick-github, should I just attempt to fix this, which is a few issues all related? but then what? do I try PR with full fix? or PR with each fix, or try to reveal each issue 1 at a time, either here or other related, or a new bug report, or I should just wait, as the mob already are on to it? I fear this issue/s is making ollama look bad? lol hope this makes sense, a little at least.
Author
Owner

@rick-github commented on GitHub (Jan 9, 2025):

If you think you understand the problem, go ahead and make a PR with a full fix with test cases to demonstrate the problem and resolution. If it turns out to be too big, the reviewers will likely provide guidance on how to split it into manageable chunks.

<!-- gh-comment-id:2580111414 --> @rick-github commented on GitHub (Jan 9, 2025): If you think you understand the problem, go ahead and make a PR with a full fix with test cases to demonstrate the problem and resolution. If it turns out to be too big, the reviewers will likely provide guidance on how to split it into manageable chunks.
Author
Owner

@robbyjo commented on GitHub (Jan 10, 2025):

@rick-github I tried running ollama on WSL2. I already installed the NVidia drivers per instruction, but I got "Error: timed out waiting for llama runner to start - progress 0.00 -"

<!-- gh-comment-id:2581667053 --> @robbyjo commented on GitHub (Jan 10, 2025): @rick-github I tried running ollama on WSL2. I already installed the NVidia drivers per instruction, but I got "Error: timed out waiting for llama runner to start - progress 0.00 -"
Author
Owner

@rick-github commented on GitHub (Jan 10, 2025):

Since ollama needs to communicate with the GPU via the WSL/host interface it's a bit slower, so the model load timed out before it finished. You can extend the timeout by setting OLLAMA_LOAD_TIMEOUT=30m in the environment of the ollama server inside the WSL container.

<!-- gh-comment-id:2581679910 --> @rick-github commented on GitHub (Jan 10, 2025): Since ollama needs to communicate with the GPU via the WSL/host interface it's a bit slower, so the model load timed out before it finished. You can extend the timeout by setting `OLLAMA_LOAD_TIMEOUT=30m` in the environment of the ollama server inside the WSL container.
Author
Owner

@robbyjo commented on GitHub (Jan 10, 2025):

@rick-github Thanks for the guidance. I ran ollama from WSL and I still got the garbled result. The query was the same: "Write the game of Tetris in Python"

Sample output:

@7&,."F+AC<;C07(82;-,8(@!B+;34+&91DCE.CD"".15;0E>C.^C

<!-- gh-comment-id:2582785789 --> @robbyjo commented on GitHub (Jan 10, 2025): @rick-github Thanks for the guidance. I ran ollama from WSL and I still got the garbled result. The query was the same: "Write the game of Tetris in Python" Sample output: > @7&,."F+AC<;C07(82;-,8(@!B+;34+&91DCE.CD"".15;0E>C.^C
Author
Owner

@rick-github commented on GitHub (Jan 10, 2025):

OK, so that didn't eliminate any variables. What we know:

  1. ollama-0.5.0 + dual GPUs (4090 or A100) + linux == ok
  2. ollama-0.5.0 + dual 4090 on your machine (Windows) == bad
  3. ollama-0.5.0 + single 4090 on your machine (Windows) == ok
  4. ollama-0.5.0 + dual 4090 on your machine (WSL) == bad
  5. ollama-0.4.7 + dual 4090 on your machine (Windows) == ok
  6. FA makes no difference

I see a docker path in the logs, are you running ollama in docker on Windows, or bare metal?

From the logs, it looks like you've tried the following models:

  • Llama 3.1 70B ArliAI RPMax v1.3 Q4_K - Medium
  • Llama 3.3 70B Instruct Abliterated Q5_K - Medium
  • Llama 3.3 70B Instruct Abliterated Q5_K - Small

These models seem finetuned, have you tried a stock model from the ollama library, eg llama3.1:70b-instruct-q4_K_M?

<!-- gh-comment-id:2582875153 --> @rick-github commented on GitHub (Jan 10, 2025): OK, so that didn't eliminate any variables. What we know: 1. ollama-0.5.0 + dual GPUs (4090 or A100) + linux == ok 2. ollama-0.5.0 + dual 4090 on your machine (Windows) == bad 3. ollama-0.5.0 + single 4090 on your machine (Windows) == ok 4. ollama-0.5.0 + dual 4090 on your machine (WSL) == bad 5. ollama-0.4.7 + dual 4090 on your machine (Windows) == ok 6. FA makes no difference I see a docker path in the logs, are you running ollama in docker on Windows, or bare metal? From the logs, it looks like you've tried the following models: - Llama 3.1 70B ArliAI RPMax v1.3 Q4_K - Medium - Llama 3.3 70B Instruct Abliterated Q5_K - Medium - Llama 3.3 70B Instruct Abliterated Q5_K - Small These models seem finetuned, have you tried a stock model from the ollama library, eg [llama3.1:70b-instruct-q4_K_M](https://ollama.com/library/llama3.1:70b-instruct-q4_K_M)?
Author
Owner

@robbyjo commented on GitHub (Jan 10, 2025):

@rick-github I tried the stock model you indicated (llama3.1:70b-instruct-q4_K_M and it still didn't work. I did it both in straight Windows or WSL.

By the way, I mostly used 0.5.4, not 0.5.0.

<!-- gh-comment-id:2583390997 --> @robbyjo commented on GitHub (Jan 10, 2025): @rick-github I tried the stock model you indicated ([llama3.1:70b-instruct-q4_K_M](https://ollama.com/library/llama3.1:70b-instruct-q4_K_M) and it still didn't work. I did it both in straight Windows or WSL. By the way, I mostly used 0.5.4, not 0.5.0.
Author
Owner

@JohnSmithToYou commented on GitHub (Jan 11, 2025):

I have the same problem with my 2x4090 under wsl2. It looks like the new cache feature broke dual graphics cards. I don't get garbage but my graphics cards are only half loaded. I just loaded Qwen2.5-Coder-32B-Instruct-Q6_K with a context of 98304. My cards should be filled! Instead they are only half full and its offloading to CPU.

@YonTracks In the log form above... Do these number look correct? Is minimum_memory correct? Isn't that that the sum of both graphics cards? This is what my log shows also.

time=2024-12-22T14:37:43.770-05:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB" minimum_memory=479199232 layer_size="2.5 GiB" gpu_zer_overhead="0 B" partial_offload="67.1 GiB" full_offload="65.1 GiB"
...
time=2024-12-22T18:11:59.438-05:00 level=DEBUG source=memory.go:173 msg="gpu has too little memory to allocate any layers" id=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.1 GiB" minimum_memory=479199232 layer_size="2.5 GiB" gpu_zer_overhead="0 B" partial_offload="67.1 GiB" full_offload="65.1 GiB"
<!-- gh-comment-id:2585039754 --> @JohnSmithToYou commented on GitHub (Jan 11, 2025): I have the same problem with my 2x4090 under wsl2. It looks like the new cache feature broke dual graphics cards. I don't get garbage but my graphics cards are only half loaded. I just loaded Qwen2.5-Coder-32B-Instruct-Q6_K with a context of 98304. My cards should be filled! Instead they are only half full and its offloading to CPU. @YonTracks In the log form above... Do these number look correct? Is minimum_memory correct? Isn't that that the sum of both graphics cards? This is what my log shows also. ``` time=2024-12-22T14:37:43.770-05:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB" minimum_memory=479199232 layer_size="2.5 GiB" gpu_zer_overhead="0 B" partial_offload="67.1 GiB" full_offload="65.1 GiB" ... time=2024-12-22T18:11:59.438-05:00 level=DEBUG source=memory.go:173 msg="gpu has too little memory to allocate any layers" id=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.1 GiB" minimum_memory=479199232 layer_size="2.5 GiB" gpu_zer_overhead="0 B" partial_offload="67.1 GiB" full_offload="65.1 GiB" ```
Author
Owner

@rick-github commented on GitHub (Jan 11, 2025):

Full log.

<!-- gh-comment-id:2585040430 --> @rick-github commented on GitHub (Jan 11, 2025): Full log.
Author
Owner

@YonTracks commented on GitHub (Jan 11, 2025):

windows and goroutines, I'm seeing fun lol.
found a few things, trying fixes, but still more I think lol.
I am trying :)
https://github.com/ollama/ollama/pull/8029
edit^ I see a few updates already that should help this, in the code already, based on OLLAMA_FLASH_ATTENTION also.
ggml and cuda issues.

edit^
yep, last bit, yeeew :) runner.go
build and test. lets go. please lol.
cheers

<!-- gh-comment-id:2585042318 --> @YonTracks commented on GitHub (Jan 11, 2025): windows and goroutines, I'm seeing fun lol. found a few things, trying fixes, but still more I think lol. I am trying :) https://github.com/ollama/ollama/pull/8029 edit^ I see a few updates already that should help this, in the code already, based on OLLAMA_FLASH_ATTENTION also. ggml and cuda issues. edit^ yep, last bit, yeeew :) runner.go build and test. lets go. please lol. cheers
Author
Owner

@rick-github commented on GitHub (Jan 11, 2025):

Going back through the thread I realized I missed something.

Ok. Here is the log for 0.4.7. At first, I only changed num_ctx to 131072, which worked great except for low memory utilization. I interrupted the output. Then I changed num_ctx to 32768 and num_gpu to 48 and repeated the same query. The result was then garbled.

So 0.4.7+ctx=128k+offload=4 works, 0.4.7+ctx=32k+offload=48 fails? 0.4.7 is not immune?

<!-- gh-comment-id:2585126280 --> @rick-github commented on GitHub (Jan 11, 2025): Going back through the thread I realized I missed something. > Ok. Here is the log for 0.4.7. At first, I only changed num_ctx to 131072, which worked great except for low memory utilization. I interrupted the output. Then I changed num_ctx to 32768 and num_gpu to 48 and repeated the same query. The result was then garbled. So 0.4.7+ctx=128k+offload=4 works, 0.4.7+ctx=32k+offload=48 fails? 0.4.7 is not immune?
Author
Owner

@YonTracks commented on GitHub (Jan 11, 2025):

my mind is mush lol, I'm having trouble getting my external gpu working for multi-gpu testing.
it was working but pci issues and riser, bugger.

heres the repo so far.
https://github.com/YonTracks/ollama-yontracks/tree/ollama-sched-server
maybe, rick or some multi-GPU user should test.
cheers, good luck

<!-- gh-comment-id:2585131433 --> @YonTracks commented on GitHub (Jan 11, 2025): my mind is mush lol, I'm having trouble getting my external gpu working for multi-gpu testing. it was working but pci issues and riser, bugger. heres the repo so far. https://github.com/YonTracks/ollama-yontracks/tree/ollama-sched-server maybe, rick or some multi-GPU user should test. cheers, good luck
Author
Owner

@YonTracks commented on GitHub (Jan 12, 2025):

far out motherboard issues, maybe I did it, I could only simulate but no more fails, and fast as, when I confirm I will PR.

<!-- gh-comment-id:2585601782 --> @YonTracks commented on GitHub (Jan 12, 2025): far out motherboard issues, maybe I did it, I could only simulate but no more fails, and fast as, when I confirm I will PR.
Author
Owner

@robbyjo commented on GitHub (Jan 12, 2025):

@YonTracks Just wondering if you could make a test build so that I could test it for you? Thanks!

<!-- gh-comment-id:2585865134 --> @robbyjo commented on GitHub (Jan 12, 2025): @YonTracks Just wondering if you could make a test build so that I could test it for you? Thanks!
Author
Owner

@YonTracks commented on GitHub (Jan 12, 2025):

@robbyjo I'm not sure how, I will try learning to do that, fast, but Wednesday I should be able to test proper anyway. cheers, I'll try soon.

<!-- gh-comment-id:2585950999 --> @YonTracks commented on GitHub (Jan 12, 2025): @robbyjo I'm not sure how, I will try learning to do that, fast, but Wednesday I should be able to test proper anyway. cheers, I'll try soon.
Author
Owner

@rick-github commented on GitHub (Jan 12, 2025):

In the meantime, is it correct that 0.4.7 is not immune?

<!-- gh-comment-id:2585951792 --> @rick-github commented on GitHub (Jan 12, 2025): In the meantime, is it correct that 0.4.7 is not immune?
Author
Owner

@YonTracks commented on GitHub (Jan 12, 2025):

yes, I'm pretty sure 0.4.7 and earlier, are not immune, but slight changes make it work sometimes can't confirm multi-GPU.

pretty sure the 3.2 vision, 0.4.0, is the start of these issues, and better and better each update, but never fully immune.

<!-- gh-comment-id:2585955673 --> @YonTracks commented on GitHub (Jan 12, 2025): yes, I'm pretty sure 0.4.7 and earlier, are not immune, but slight changes make it work sometimes can't confirm multi-GPU. pretty sure the 3.2 vision, 0.4.0, is the start of these issues, and better and better each update, but never fully immune.
Author
Owner

@YonTracks commented on GitHub (Jan 12, 2025):

yep, wow, far out I managed to do it, epic! cheers.
https://github.com/YonTracks/ollama-yontracks/releases/tag/0.5.4-yontracks

<!-- gh-comment-id:2585960051 --> @YonTracks commented on GitHub (Jan 12, 2025): yep, wow, far out I managed to do it, epic! cheers. https://github.com/YonTracks/ollama-yontracks/releases/tag/0.5.4-yontracks
Author
Owner

@robbyjo commented on GitHub (Jan 13, 2025):

@YonTracks Thanks a lot for the custom build. Really appreciate it.

For "Write a game of Tetris in Python"
The output is still garbled, though: (>G>4#F"C+923)H<&6C!3:B"97

Also note that if I set CUDA_VISIBLE_DEVICE=1,0, it simply ignored GPU 0 and only loaded the model to GPU 1.

<!-- gh-comment-id:2586034634 --> @robbyjo commented on GitHub (Jan 13, 2025): @YonTracks Thanks a lot for the custom build. Really appreciate it. For "Write a game of Tetris in Python" The output is still garbled, though: (>G>4#F"C+923)H<&6C!3:B"97 Also note that if I set CUDA_VISIBLE_DEVICE=1,0, it simply ignored GPU 0 and only loaded the model to GPU 1.
Author
Owner

@YonTracks commented on GitHub (Jan 13, 2025):

@robbyjo cheers, can you experiment with default settings only, so no changes to visible devices, keep alive, parallel etc. and then, experiment with OLLAMA_FLASH_ATTENTION, OLLAMA_SCHED_SPREAD. checking the logs for the details, cheers.
good luck.

<!-- gh-comment-id:2586043209 --> @YonTracks commented on GitHub (Jan 13, 2025): @robbyjo cheers, can you experiment with default settings only, so no changes to visible devices, keep alive, parallel etc. and then, experiment with OLLAMA_FLASH_ATTENTION, OLLAMA_SCHED_SPREAD. checking the logs for the details, cheers. good luck.
Author
Owner

@robbyjo commented on GitHub (Jan 13, 2025):

@YonTracks The failure I posted earlier was with both OLLAMA_FLASH_ATTENTION and OLLAMA_SCHED_SPREAD equal false.

Any possible combination the two flags (true/true, true/false, false/true) still yielded garbled outputs.

<!-- gh-comment-id:2586053907 --> @robbyjo commented on GitHub (Jan 13, 2025): @YonTracks The failure I posted earlier was with both OLLAMA_FLASH_ATTENTION and OLLAMA_SCHED_SPREAD equal false. Any possible combination the two flags (true/true, true/false, false/true) still yielded garbled outputs.
Author
Owner

@YonTracks commented on GitHub (Jan 13, 2025):

@robbyjo awesome, same same, cheers for your help, very appreciative.

edit:^ I forgot to mention, just in case (you might not know or forget) when making changes to env? need to restart ollama and the client.

edit:^ srry about the many edits, also just in case, at the top of the server.log you can check the env details to be sure.

@YonTracks The failure I posted earlier was with both OLLAMA_FLASH_ATTENTION and OLLAMA_SCHED_SPREAD equal false.

Any possible combination the two flags (true/true, true/false, false/true) still yielded garbled outputs.

can or did you try the num_ctx? if not wanting to try much, try near max num_ctx (131072) and lower from there, or vise versa if no trouble. if any trouble, don't worry.
ollama will sort soon it anyway, most likely if not already.

I will keep trying anyway when I can.
cheers

<!-- gh-comment-id:2586059138 --> @YonTracks commented on GitHub (Jan 13, 2025): @robbyjo awesome, same same, cheers for your help, very appreciative. edit:^ I forgot to mention, just in case (you might not know or forget) when making changes to env? need to restart ollama and the client. edit:^ srry about the many edits, also just in case, at the top of the server.log you can check the env details to be sure. > @YonTracks The failure I posted earlier was with both OLLAMA_FLASH_ATTENTION and OLLAMA_SCHED_SPREAD equal false. > > Any possible combination the two flags (true/true, true/false, false/true) still yielded garbled outputs. can or did you try the num_ctx? if not wanting to try much, try near max num_ctx (131072) and lower from there, or vise versa if no trouble. if any trouble, don't worry. ollama will sort soon it anyway, most likely if not already. I will keep trying anyway when I can. cheers
Author
Owner

@robbyjo commented on GitHub (Jan 13, 2025):

@YonTracks Yes, I did restart ollama. No worries. I tried the experiment with num_ctx set to 16384.

BUT!!!! I got a new finding. When I load the model on GPU 1 only, it was all good. HOWEVER, if I load the model only on GPU 0, then the output is also garbled!!! Did that tell you something?

<!-- gh-comment-id:2586070196 --> @robbyjo commented on GitHub (Jan 13, 2025): @YonTracks Yes, I did restart ollama. No worries. I tried the experiment with num_ctx set to 16384. BUT!!!! I got a new finding. When I load the model on GPU 1 only, it was all good. HOWEVER, if I load the model only on GPU 0, then the output is also garbled!!! Did that tell you something?
Author
Owner

@YonTracks commented on GitHub (Jan 13, 2025):

@YonTracks Yes, I did restart ollama. No worries. I tried the experiment with num_ctx set to 16384.

BUT!!!! I got a new finding. When I load the model on GPU 1 only, it was all good. HOWEVER, if I load the model only on GPU 0, then the output is also garbled!!! Did that tell you something?

sure did super cheers, I now can't wait until Wednesday or Thursday to try multi-gpu's.
again, very much appreciated.
cheers.

<!-- gh-comment-id:2586072136 --> @YonTracks commented on GitHub (Jan 13, 2025): > @YonTracks Yes, I did restart ollama. No worries. I tried the experiment with num_ctx set to 16384. > > BUT!!!! I got a new finding. When I load the model on GPU 1 only, it was all good. HOWEVER, if I load the model only on GPU 0, then the output is also garbled!!! Did that tell you something? sure did super cheers, I now can't wait until Wednesday or Thursday to try multi-gpu's. again, very much appreciated. cheers.
Author
Owner

@robbyjo commented on GitHub (Jan 13, 2025):

Thanks a lot @YonTracks and @rick-github . Really appreciate what you did. I would love to help you as much as I could. Would love to learn as well.

<!-- gh-comment-id:2586073582 --> @robbyjo commented on GitHub (Jan 13, 2025): Thanks a lot @YonTracks and @rick-github . Really appreciate what you did. I would love to help you as much as I could. Would love to learn as well.
Author
Owner

@rick-github commented on GitHub (Jan 13, 2025):

If you load the model only on GPU 0 and use standard 0.5.4, is the output garbled?

<!-- gh-comment-id:2586075240 --> @rick-github commented on GitHub (Jan 13, 2025): If you load the model only on GPU 0 and use standard 0.5.4, is the output garbled?
Author
Owner

@YonTracks commented on GitHub (Jan 13, 2025):

There is one last thing I can try today? the current test is with the following commented out? it is supposed to be included, but good test.

// Normalize the NumCtx for parallelism
// optsExisting.NumCtx = optsExisting.NumCtx / runner.numParallel

if wanting to try, I can uncomment and rebuild.
no trouble if not wanting to, I don't think that's fully it anyway, but good test.
<!-- gh-comment-id:2586077174 --> @YonTracks commented on GitHub (Jan 13, 2025): There is one last thing I can try today? the current test is with the following commented out? it is supposed to be included, but good test. // Normalize the NumCtx for parallelism // optsExisting.NumCtx = optsExisting.NumCtx / runner.numParallel if wanting to try, I can uncomment and rebuild. no trouble if not wanting to, I don't think that's fully it anyway, but good test.
Author
Owner

@YonTracks commented on GitHub (Jan 13, 2025):

If you load the model only on GPU 0 and use standard 0.5.4, is the output garbled?

missed that, cheers yep, good idea.

edit:^ If my way of thinking is correct, we should expect to see:

with both gpus. llama_new_context_with_model: n_ctx_per_seq (32768) < n_ctx_train (131072) -- the full capacity of the model will not be
and
with both gpus. llama_new_context_with_model: n_ctx_per_seq (32768) < n_ctx_train (131072) -- the full capacity of the model will not be
and none showing:
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be.

this is my main priority bug?

<!-- gh-comment-id:2586077943 --> @YonTracks commented on GitHub (Jan 13, 2025): > If you load the model only on GPU 0 and use standard 0.5.4, is the output garbled? missed that, cheers yep, good idea. edit:^ If my way of thinking is correct, we should expect to see: ``` with both gpus. llama_new_context_with_model: n_ctx_per_seq (32768) < n_ctx_train (131072) -- the full capacity of the model will not be and with both gpus. llama_new_context_with_model: n_ctx_per_seq (32768) < n_ctx_train (131072) -- the full capacity of the model will not be and none showing: llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be. ``` this is my main priority bug?
Author
Owner

@YonTracks commented on GitHub (Jan 13, 2025):

There is one last thing I can try today? the current test is with the following commented out? it is supposed to be included, but good test.

// Normalize the NumCtx for parallelism
// optsExisting.NumCtx = optsExisting.NumCtx / runner.numParallel

if wanting to try, I can uncomment and rebuild.
no trouble if not wanting to, I don't think that's fully it anyway, but good test.

https://github.com/YonTracks/ollama-yontracks/releases/tag/0.5.4-test2-yontracks

<!-- gh-comment-id:2586122321 --> @YonTracks commented on GitHub (Jan 13, 2025): > There is one last thing I can try today? the current test is with the following commented out? it is supposed to be included, but good test. > > ``` > // Normalize the NumCtx for parallelism > // optsExisting.NumCtx = optsExisting.NumCtx / runner.numParallel > > if wanting to try, I can uncomment and rebuild. > no trouble if not wanting to, I don't think that's fully it anyway, but good test. > ``` https://github.com/YonTracks/ollama-yontracks/releases/tag/0.5.4-test2-yontracks
Author
Owner

@robbyjo commented on GitHub (Jan 13, 2025):

@rick-github Yes, using Ollama official 0.5.4, loading to only GPU 0 leading to Garbled output. I'll try the second test @YonTracks , be right back.

<!-- gh-comment-id:2586134364 --> @robbyjo commented on GitHub (Jan 13, 2025): @rick-github Yes, using Ollama official 0.5.4, loading to only GPU 0 leading to Garbled output. I'll try the second test @YonTracks , be right back.
Author
Owner

@robbyjo commented on GitHub (Jan 13, 2025):

@YonTracks I tested your test2 build. For some reason, no matter what I request (GPU 0 only or GPU 1 only), the model will always be loaded on GPU 1 and that worked. I do notice that sometimes the official build somehow ignore this selection as well.

If I set CUDA_VISIBLE_DEVICES to both GPU0 and GPU1, then the output is still garbled.

<!-- gh-comment-id:2586146613 --> @robbyjo commented on GitHub (Jan 13, 2025): @YonTracks I tested your test2 build. For some reason, no matter what I request (GPU 0 only or GPU 1 only), the model will always be loaded on GPU 1 and that worked. I do notice that sometimes the official build somehow ignore this selection as well. If I set CUDA_VISIBLE_DEVICES to both GPU0 and GPU1, then the output is still garbled.
Author
Owner

@YonTracks commented on GitHub (Jan 13, 2025):

@YonTracks I tested your test2 build. For some reason, no matter what I request (GPU 0 only or GPU 1 only), the model will always be loaded on GPU 1 and that worked. I do notice that sometimes the official build somehow ignore this selection as well.

If I set CUDA_VISIBLE_DEVICES to both GPU0 and GPU1, then the output is still garbled.

in the server.log
you will see when working the ctx is showing a larger kv / ctx example: llama_new_context_with_model: n_ctx_per_seq (32768) < n_ctx_train (131072) -- the full capacity of the model will not be...

and when it does not work, llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be...

and (2048) is default too small, and also fallback for parallel, I bet if I force a default num_ctx of (32768) or the size needed, then it will work?
server.log would be good to see.
and

time=2025-01-13T13:36:32.664+10:00 level=DEBUG source=sched.go:611 msg=optsExisting: ""="{NumCtx:2048 NumBatch:512 NumGPU:-1 MainGPU:0 LowVRAM:false F16KV:false LogitsAll:false VocabOnly:false UseMMap: UseMLock:false NumThread:0}"
time=2025-01-13T13:36:32.664+10:00 level=DEBUG source=sched.go:612 msg=ctx: ""="context.Background.WithDeadline(2025-01-13 13:36:32.763406 +1000 AEST m=+2.164479501 [99.3874ms]).WithCancel"

<!-- gh-comment-id:2586152567 --> @YonTracks commented on GitHub (Jan 13, 2025): > @YonTracks I tested your test2 build. For some reason, no matter what I request (GPU 0 only or GPU 1 only), the model will always be loaded on GPU 1 and that worked. I do notice that sometimes the official build somehow ignore this selection as well. > > If I set CUDA_VISIBLE_DEVICES to both GPU0 and GPU1, then the output is still garbled. in the server.log you will see when working the ctx is showing a larger kv / ctx example: `llama_new_context_with_model: n_ctx_per_seq (32768) < n_ctx_train (131072) -- the full capacity of the model will not be...` and when it does not work, `llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be...` and (2048) is default too small, and also fallback for parallel, I bet if I force a default num_ctx of (32768) or the size needed, then it will work? server.log would be good to see. and time=2025-01-13T13:36:32.664+10:00 level=DEBUG source=sched.go:611 msg=optsExisting: ""="{NumCtx:2048 NumBatch:512 NumGPU:-1 MainGPU:0 LowVRAM:false F16KV:false LogitsAll:false VocabOnly:false UseMMap:<nil> UseMLock:false NumThread:0}" time=2025-01-13T13:36:32.664+10:00 level=DEBUG source=sched.go:612 msg=ctx: ""="context.Background.WithDeadline(2025-01-13 13:36:32.763406 +1000 AEST m=+2.164479501 [99.3874ms]).WithCancel"
Author
Owner

@rick-github commented on GitHub (Jan 13, 2025):

Yes, using Ollama official 0.5.4, loading to only GPU 0 leading to Garbled output. I'll try the second test

I'm leaning to the idea that your GPU 0 is sub-optimal in some way. The switch from 0.3.14 to 0.5.* that saw the onset of this problem might be because the GPU kernels are executing in a different part of the GPU/VRAM, not because of any change in the code. It would explain the good/bad results for 0.4.7 - the different context sizes moves stuff around inside the GPU. Since nobody else has so far reported a problem, it narrows it down to your particular setup/configuration. Have you tried running a GPU/VRAM tester to see if anything gets flagged?

<!-- gh-comment-id:2586163385 --> @rick-github commented on GitHub (Jan 13, 2025): > Yes, using Ollama official 0.5.4, loading to only GPU 0 leading to Garbled output. I'll try the second test I'm leaning to the idea that your GPU 0 is sub-optimal in some way. The switch from 0.3.14 to 0.5.* that saw the onset of this problem might be because the GPU kernels are executing in a different part of the GPU/VRAM, not because of any change in the code. It would explain the good/bad results for 0.4.7 - the different context sizes moves stuff around inside the GPU. Since nobody else has so far reported a problem, it narrows it down to your particular setup/configuration. Have you tried running a GPU/VRAM tester to see if anything gets flagged?
Author
Owner

@robbyjo commented on GitHub (Jan 13, 2025):

For some reason, I tried the following on Python:

import torch

torch.cuda.device_count() # returns only 1 instead of 2

I checked GPU-Z and my GPU0 was not marked as supporting CUDA (which is weird).

Edit: I must add that nvidia-smi somehow recognize both GPUs. Double weird. Why not CUDA?

<!-- gh-comment-id:2586167570 --> @robbyjo commented on GitHub (Jan 13, 2025): For some reason, I tried the following on Python: import torch torch.cuda.device_count() # returns only 1 instead of 2 I checked GPU-Z and my GPU0 was not marked as supporting CUDA (which is weird). Edit: I must add that nvidia-smi somehow recognize both GPUs. Double weird. Why not CUDA?
Author
Owner

@YonTracks commented on GitHub (Jan 13, 2025):

For some reason, I tried the following on Python:

import torch

torch.cuda.device_count() # returns only 1 instead of 2

I checked GPU-Z and my GPU0 was not marked as supporting CUDA (which is weird).

Edit: I must add that nvidia-smi somehow recognize both GPUs. Double weird. Why not CUDA?
@rick-github cheers for your help, and also cheers to @robbyjo I'm Learning heaps, and love it super cheers.

rick you are awesome, cheers for putting up with things, you have a good way nice!

does this mean this can be closed?

<!-- gh-comment-id:2586186184 --> @YonTracks commented on GitHub (Jan 13, 2025): > For some reason, I tried the following on Python: > > import torch > > torch.cuda.device_count() # returns only 1 instead of 2 > > I checked GPU-Z and my GPU0 was not marked as supporting CUDA (which is weird). > > Edit: I must add that nvidia-smi somehow recognize both GPUs. Double weird. Why not CUDA? @rick-github cheers for your help, and also cheers to @robbyjo I'm Learning heaps, and love it super cheers. rick you are awesome, cheers for putting up with things, you have a good way nice! does this mean this can be closed?
Author
Owner

@rick-github commented on GitHub (Jan 13, 2025):

does this mean this can be closed?

While I think this is a hardware not a software problem, we should wait to see if Roby can do a test and verify. I don't use Windows but I've heard that OCCT is good for this sort of testing.

<!-- gh-comment-id:2586191103 --> @rick-github commented on GitHub (Jan 13, 2025): > does this mean this can be closed? While I think this is a hardware not a software problem, we should wait to see if Roby can do a test and verify. I don't use Windows but I've heard that [OCCT](https://www.ocbase.com/download) is good for this sort of testing.
Author
Owner

@robbyjo commented on GitHub (Jan 14, 2025):

@rick-github and @YonTracks Ok this is weird. OCCT could detect the two cards
Clipboard01

I'm frankly at loss. I tried downgrading to CUDA 12.1 or 12.3 but it doesn't work still (GPU0 still not recognized as CUDA)

<!-- gh-comment-id:2588771489 --> @robbyjo commented on GitHub (Jan 14, 2025): @rick-github and @YonTracks Ok this is weird. OCCT could detect the two cards ![Clipboard01](https://github.com/user-attachments/assets/c258d023-9255-4cc7-8169-a7eaa9a1321d) I'm frankly at loss. I tried downgrading to CUDA 12.1 or 12.3 but it doesn't work still (GPU0 still not recognized as CUDA)
Author
Owner

@rick-github commented on GitHub (Jan 14, 2025):

Did you run a VRAM test?

<!-- gh-comment-id:2588807733 --> @rick-github commented on GitHub (Jan 14, 2025): Did you run a VRAM test?
Author
Owner

@YonTracks commented on GitHub (Jan 14, 2025):

Device Manager is good also, to confirm no gpu/s issues and drivers are compatible and all that, I think in the properties of the display adapters the gpu will show if any issue, resources etc. Good luck.

funny enough, for me pci issues etc. might be a windows 11 thing, but I have an old riser and old rtx2060, so lol, my own issues. but, interesting I believe it was working. I wonder.
good luck.

<!-- gh-comment-id:2588925242 --> @YonTracks commented on GitHub (Jan 14, 2025): Device Manager is good also, to confirm no gpu/s issues and drivers are compatible and all that, I think in the properties of the display adapters the gpu will show if any issue, resources etc. Good luck. funny enough, for me pci issues etc. might be a windows 11 thing, but I have an old riser and old rtx2060, so lol, my own issues. but, interesting I believe it was working. I wonder. good luck.
Author
Owner

@robbyjo commented on GitHub (Jan 14, 2025):

Hi @rick-github and @YonTracks Thank you so much for your help. OCCS did not detect any errors. However, Device Manager did show some warnings in PCI Device (unsure if it's related to the display). As I mentioned earlier, GPU-Z detected both cards, but GPU0 did not have the CUDA flag checked, while GPU1 had. Which is weird.

Image
Image
Image
Image
Image
Image

<!-- gh-comment-id:2591014456 --> @robbyjo commented on GitHub (Jan 14, 2025): Hi @rick-github and @YonTracks Thank you so much for your help. OCCS did not detect any errors. However, Device Manager did show some warnings in PCI Device (unsure if it's related to the display). As I mentioned earlier, GPU-Z detected both cards, but GPU0 did not have the CUDA flag checked, while GPU1 had. Which is weird. ![Image](https://github.com/user-attachments/assets/4ed4442d-76ed-4443-af57-5e686ff35fe3) ![Image](https://github.com/user-attachments/assets/71ff6842-489d-4295-936e-18630bc834bf) ![Image](https://github.com/user-attachments/assets/b4adf5aa-0d73-405b-ba1d-51890d9cc92c) ![Image](https://github.com/user-attachments/assets/4b81197e-2f94-46f2-8869-e8bd8b8fdbe2) ![Image](https://github.com/user-attachments/assets/7e5c8ad1-e4e8-46b0-b9f6-d7affd2d0fea) ![Image](https://github.com/user-attachments/assets/53b356a6-adf8-42f3-9432-28e3aac4afc3)
Author
Owner

@robbyjo commented on GitHub (Jan 14, 2025):

By the way, I have my own OpenCL program and it worked across both GPUs.

<!-- gh-comment-id:2591030302 --> @robbyjo commented on GitHub (Jan 14, 2025): By the way, I have my own OpenCL program and it worked across both GPUs.
Author
Owner

@YonTracks commented on GitHub (Jan 14, 2025):

I hope I don't hinder.
but, yep, seems a pci bus issue and resources (example would be ssd M2 slot/s taking up pci resources also), windows 11 I'm guessing cuda is not happy with 1.1. gen and happy with 4.0. Try to fix / delete all the yellow triangles (they should just auto install next restart) and try trial and error, for devices needed etc, then (try to force x8 for both gpu's and check the M2 slots, if you can keep the first slot open, or slot2 check the bios).

I will be testing similar, in the next few days hopefully, will know this afternoon anyway (PC parts).
good luck.

update^: yep, I had similar issues, I had to remove a M2 ssd and make bios changes, or my PC would not even start with 2nd gpu.
I need better motherboard, I did not end up with any parts, I tried, I need to order online, oh well.
good luck

<!-- gh-comment-id:2591191388 --> @YonTracks commented on GitHub (Jan 14, 2025): I hope I don't hinder. but, yep, seems a pci bus issue and resources (example would be ssd M2 slot/s taking up pci resources also), windows 11 I'm guessing cuda is not happy with 1.1. gen and happy with 4.0. Try to fix / delete all the yellow triangles (they should just auto install next restart) and try trial and error, for devices needed etc, then (try to force x8 for both gpu's and check the M2 slots, if you can keep the first slot open, or slot2 check the bios). I will be testing similar, in the next few days hopefully, will know this afternoon anyway (PC parts). good luck. update^: yep, I had similar issues, I had to remove a M2 ssd and make bios changes, or my PC would not even start with 2nd gpu. I need better motherboard, I did not end up with any parts, I tried, I need to order online, oh well. good luck
Author
Owner

@rick-github commented on GitHub (Jan 15, 2025):

What happens if you switch the slots the cards are plugged in to? Does GPU 1 then become the one that garbles output and has no CUDA in GPU-Z?

<!-- gh-comment-id:2591747087 --> @rick-github commented on GitHub (Jan 15, 2025): What happens if you switch the slots the cards are plugged in to? Does GPU 1 then become the one that garbles output and has no CUDA in GPU-Z?
Author
Owner

@robbyjo commented on GitHub (Jan 16, 2025):

Hi @YonTracks and @rick-github . At present all evidence seem to point to hardware issue. I already updated the drivers to no avail. PCIE still has problems. Not sure if I have to swap GPU---pretty sure the GPU1 (which then becomes GPU0) would be garbled. I'm in no position to eliminate the second SSD or to replace motherboard at this point. Will try this some time in the future. Thanks for all your help!

<!-- gh-comment-id:2595834515 --> @robbyjo commented on GitHub (Jan 16, 2025): Hi @YonTracks and @rick-github . At present all evidence seem to point to hardware issue. I already updated the drivers to no avail. PCIE still has problems. Not sure if I have to swap GPU---pretty sure the GPU1 (which then becomes GPU0) would be garbled. I'm in no position to eliminate the second SSD or to replace motherboard at this point. Will try this some time in the future. Thanks for all your help!
Author
Owner

@robbyjo commented on GitHub (Mar 3, 2025):

Hi @YonTracks and @rick-github. I managed to get both of my GPUs to work now (both CUDA options are on). I updated Ollama to the latest version (0.5.12). My two GPUs held some load, but the output still garbled.

ollama run hf.co/bartowski/DeepSeek-R1-Distill-Qwen-32B-abliterated-GGUF:Q6_K_L
>>> /set parameter num_ctx 16384
Set parameter 'num_ctx' to '16384'
>>> Write a game of tetris in Python
ize@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

<!-- gh-comment-id:2694570142 --> @robbyjo commented on GitHub (Mar 3, 2025): Hi @YonTracks and @rick-github. I managed to get both of my GPUs to work now (both CUDA options are on). I updated Ollama to the latest version (0.5.12). My two GPUs held some load, but the output still garbled. ``` ollama run hf.co/bartowski/DeepSeek-R1-Distill-Qwen-32B-abliterated-GGUF:Q6_K_L >>> /set parameter num_ctx 16384 Set parameter 'num_ctx' to '16384' >>> Write a game of tetris in Python ize@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ ```
Author
Owner

@rick-github commented on GitHub (Mar 3, 2025):

What version of windows do you use? If you are OK with running a non-standard build I can have a go at generating a binary with some changes that may or may not make a difference.

<!-- gh-comment-id:2695216990 --> @rick-github commented on GitHub (Mar 3, 2025): What version of windows do you use? If you are OK with running a non-standard build I can have a go at generating a binary with some changes that may or may not make a difference.
Author
Owner

@YonTracks commented on GitHub (Mar 3, 2025):

Howdy @robbyjo, sorry this is happening.
I feel for you. I don't want to hinder; Rick is awesome and can help way better than I or explain way better but needs good info.

Does it work correct at all?
is this docker? (not sure if it matters) lm-studio, and or cli.
is everything for ollama default, what env params have you tried? OLLAMA_SCHED_SPREAD, OLLAMA_FLASH_ATTENTION, OLLAMA_GPU_OVERHEAD CUDA_VISIBLE_DEVICES etc.
what windows 11? how updated.
are the gpu's using risers or network etc?
what cuda 12.x and is the path env variables confirmed?

the latest server.log from 0.5.12+ are better also.

<!-- gh-comment-id:2695732226 --> @YonTracks commented on GitHub (Mar 3, 2025): Howdy @robbyjo, sorry this is happening. I feel for you. I don't want to hinder; Rick is awesome and can help way better than I or explain way better but needs good info. Does it work correct at all? is this docker? (not sure if it matters) lm-studio, and or cli. is everything for ollama default, what env params have you tried? `OLLAMA_SCHED_SPREAD`, `OLLAMA_FLASH_ATTENTION`, `OLLAMA_GPU_OVERHEAD` `CUDA_VISIBLE_DEVICES` etc. what windows 11? how updated. are the gpu's using risers or network etc? what cuda 12.x and is the path env variables confirmed? the latest server.log from 0.5.12+ are better also.
Author
Owner

@YonTracks commented on GitHub (Mar 4, 2025):

ok so recap:
correct me if wrong, main issue is.

  1. when using a single gpu it works as expected (cuda v11 or 12.1 and 12.3) if num_ctx is above 2048
  2. when using multi gpu's it becomes garbled. seems the context issue.

I would ensure/confirm the following,
system variables:

  • env Path cuda 12.1 and 12.3. above logs are showing both?
    also check,
  • env CUDA_PATH_V12_3, CUDA_PATH_V12_1, or just CUDA_PATH also, most likely has both also? I think this causes issues! It did for me, 12.1 and 12.4 for older ollama worked for me, but newer ollama, seems to need only 1 or proper config, I think 12.8 is best, not sure.
    I am currently using 12.8 with the toolkit only (I uninstalled everything else, and started with 12.8, so the env gets set automatically with only 1),
    but I can't yet test multi gpu's srry, I got sidetracked lol.
<!-- gh-comment-id:2696188980 --> @YonTracks commented on GitHub (Mar 4, 2025): ok so recap: correct me if wrong, main issue is. 1. when using a single gpu it works as expected (cuda v11 or 12.1 and 12.3) if num_ctx is above 2048 2. when using multi gpu's it becomes garbled. seems the context issue. I would ensure/confirm the following, system variables: - env `Path` cuda 12.1 and 12.3. above logs are showing both? also check, - env `CUDA_PATH_V12_3`, `CUDA_PATH_V12_1`, or just `CUDA_PATH` also, most likely has both also? I think this causes issues! It did for me, 12.1 and 12.4 for older ollama worked for me, but newer ollama, seems to need only 1 or proper config, I think 12.8 is best, not sure. I am currently using 12.8 with the toolkit only (I uninstalled everything else, and started with 12.8, so the env gets set automatically with only 1), but I can't yet test multi gpu's srry, I got sidetracked lol.
Author
Owner

@YonTracks commented on GitHub (Mar 4, 2025):

and remove all ollama env's to try use default, and add as needed, trying to confirm things, as you go.

@robbyjo below are the variables from above. possible issues. bugger! this was in the original message; I missed it so sorry. OLLAMA_GPU_OVERHEAD:1572864000. remember to start fresh. should see OLLAMA_GPU_OVERHEAD:0 <<< should be most of the issue.
check.

CUDA_VISIBLE_DEVICES:0,1 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:1572864000 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:E:\DeepLearning\LLM OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"

ollama should have a somewhat, working default.
good luck.

<!-- gh-comment-id:2696197363 --> @YonTracks commented on GitHub (Mar 4, 2025): and remove all ollama env's to try use default, and add as needed, trying to confirm things, as you go. @robbyjo below are the variables from above. possible issues. bugger! this was in the original message; I missed it so sorry. `OLLAMA_GPU_OVERHEAD:1572864000`. remember to start fresh. should see `OLLAMA_GPU_OVERHEAD:0` <<< should be most of the issue. check. ``` CUDA_VISIBLE_DEVICES:0,1 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:1572864000 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:E:\DeepLearning\LLM OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" ``` ollama should have a somewhat, working default. good luck.
Author
Owner

@robbyjo commented on GitHub (Dec 10, 2025):

The issue was my second GPU was faulty and I RMA-ed it. Apologies for all the troubles.

<!-- gh-comment-id:3637026548 --> @robbyjo commented on GitHub (Dec 10, 2025): The issue was my second GPU was faulty and I RMA-ed it. Apologies for all the troubles.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#30990