[GH-ISSUE #5654] Failure to Generate Response After Model Unloading #65564

Closed
opened 2026-05-03 21:42:15 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @NWBx01 on GitHub (Jul 12, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5654

What is the issue?

Generating a response after first starting Ollama works flawlessly from what I can tell. I am able to change models and generate responses from prompts. After the model unloads due to inactivity, however, I am unable to generate any response.

I use Nvidia vGPU 17.1 to passthrough my GPU to a virtual machine running the ollama docker image that has GPU capability. The CUDA compute capability are the same between the host and guest: 6.1 on the host Quadro P4000, and 6.1 on the guest GRID P40-8Q. Both also have the same amount of VRAM: 8GB on the host, and 8GB on the guest. I don't believe this would cause any issues, but I thought it would be wise to mention.

Below are logs from when this happens (I've had to split this into two messages because of length):

2024/07/12 20:35:32 routes.go:1033: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-07-12T20:35:32.959Z level=INFO source=images.go:751 msg="total blobs: 25"
time=2024-07-12T20:35:32.960Z level=INFO source=images.go:758 msg="total unused blobs removed: 0"
time=2024-07-12T20:35:32.961Z level=INFO source=routes.go:1080 msg="Listening on [::]:11434 (version 0.2.1)"
time=2024-07-12T20:35:32.961Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama4058211551/runners
time=2024-07-12T20:35:32.962Z level=DEBUG source=payload.go:182 msg=extracting variant=cpu file=build/linux/x86_64/cpu/bin/ollama_llama_server.gz
time=2024-07-12T20:35:32.962Z level=DEBUG source=payload.go:182 msg=extracting variant=cpu_avx file=build/linux/x86_64/cpu_avx/bin/ollama_llama_server.gz
time=2024-07-12T20:35:32.962Z level=DEBUG source=payload.go:182 msg=extracting variant=cpu_avx2 file=build/linux/x86_64/cpu_avx2/bin/ollama_llama_server.gz
time=2024-07-12T20:35:32.962Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublas.so.11.gz
time=2024-07-12T20:35:32.962Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublasLt.so.11.gz
time=2024-07-12T20:35:32.962Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcudart.so.11.0.gz
time=2024-07-12T20:35:32.962Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/ollama_llama_server.gz
time=2024-07-12T20:35:32.962Z level=DEBUG source=payload.go:182 msg=extracting variant=rocm_v60101 file=build/linux/x86_64/rocm_v60101/bin/deps.txt.gz
time=2024-07-12T20:35:32.962Z level=DEBUG source=payload.go:182 msg=extracting variant=rocm_v60101 file=build/linux/x86_64/rocm_v60101/bin/ollama_llama_server.gz
time=2024-07-12T20:35:37.143Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu/ollama_llama_server
time=2024-07-12T20:35:37.143Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx/ollama_llama_server
time=2024-07-12T20:35:37.143Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx2/ollama_llama_server
time=2024-07-12T20:35:37.143Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cuda_v11/ollama_llama_server
time=2024-07-12T20:35:37.143Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/rocm_v60101/ollama_llama_server
time=2024-07-12T20:35:37.143Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx2 cuda_v11 rocm_v60101 cpu cpu_avx]"
time=2024-07-12T20:35:37.143Z level=DEBUG source=payload.go:45 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
time=2024-07-12T20:35:37.143Z level=DEBUG source=sched.go:102 msg="starting llm scheduler"
time=2024-07-12T20:35:37.143Z level=INFO source=gpu.go:205 msg="looking for compatible GPUs"
time=2024-07-12T20:35:37.144Z level=DEBUG source=gpu.go:91 msg="searching for GPU discovery libraries for NVIDIA"
time=2024-07-12T20:35:37.144Z level=DEBUG source=gpu.go:438 msg="Searching for GPU library" name=libcuda.so*
time=2024-07-12T20:35:37.144Z level=DEBUG source=gpu.go:457 msg="gpu library search" globs="[/usr/local/nvidia/lib/libcuda.so** /usr/local/nvidia/lib64/libcuda.so** /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
time=2024-07-12T20:35:37.150Z level=DEBUG source=gpu.go:491 msg="discovered GPU libraries" paths=[/usr/lib/x86_64-linux-gnu/libcuda.so.550.54.15]
CUDA driver version: 12.4
time=2024-07-12T20:35:37.202Z level=DEBUG source=gpu.go:124 msg="detected GPUs" count=1 library=/usr/lib/x86_64-linux-gnu/libcuda.so.550.54.15
[GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda] CUDA totalMem 8192 mb
[GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda] CUDA freeMem 7541 mb
[GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda] Compute Capability 6.1
time=2024-07-12T20:35:37.364Z level=DEBUG source=amd_linux.go:356 msg="amdgpu driver not detected /sys/module/amdgpu"
releasing cuda driver library
time=2024-07-12T20:35:37.364Z level=INFO source=types.go:103 msg="inference compute" id=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda library=cuda compute=6.1 driver=12.4 name="GRID P40-8Q" total="8.0 GiB" available="7.4 GiB"
[GIN] 2024/07/12 - 20:36:17 | 200 |    1.769258ms |  192.168.75.195 | GET      "/api/tags"
[GIN] 2024/07/12 - 20:36:17 | 200 |       94.14µs |  192.168.75.195 | GET      "/api/version"
time=2024-07-12T20:38:26.695Z level=DEBUG source=gpu.go:336 msg="updating system memory data" before.total="23.5 GiB" before.free="20.5 GiB" now.total="23.5 GiB" now.free="20.5 GiB"
CUDA driver version: 12.4
time=2024-07-12T20:38:26.899Z level=DEBUG source=gpu.go:377 msg="updating cuda memory data" gpu=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda name="GRID P40-8Q" before.total="8.0 GiB" before.free="7.4 GiB" now.total="8.0 GiB" now.free="7.4 GiB" now.used="650.0 MiB"
releasing cuda driver library
time=2024-07-12T20:38:26.899Z level=DEBUG source=sched.go:182 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1
time=2024-07-12T20:38:26.923Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[7.4 GiB]"
time=2024-07-12T20:38:26.924Z level=DEBUG source=sched.go:251 msg="loading first model" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-12T20:38:26.924Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[7.4 GiB]"
time=2024-07-12T20:38:26.924Z level=INFO source=sched.go:738 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa gpu=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda parallel=4 available=7908343808 required="6.2 GiB"
time=2024-07-12T20:38:26.924Z level=DEBUG source=server.go:98 msg="system memory" total="23.5 GiB" free=21974278144
time=2024-07-12T20:38:26.924Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[7.4 GiB]"
time=2024-07-12T20:38:26.925Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[7.4 GiB]" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
time=2024-07-12T20:38:26.925Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu/ollama_llama_server
time=2024-07-12T20:38:26.925Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx/ollama_llama_server
time=2024-07-12T20:38:26.925Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx2/ollama_llama_server
time=2024-07-12T20:38:26.925Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cuda_v11/ollama_llama_server
time=2024-07-12T20:38:26.925Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/rocm_v60101/ollama_llama_server
time=2024-07-12T20:38:26.926Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu/ollama_llama_server
time=2024-07-12T20:38:26.926Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx/ollama_llama_server
time=2024-07-12T20:38:26.926Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx2/ollama_llama_server
time=2024-07-12T20:38:26.926Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cuda_v11/ollama_llama_server
time=2024-07-12T20:38:26.926Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/rocm_v60101/ollama_llama_server
time=2024-07-12T20:38:26.926Z level=INFO source=server.go:375 msg="starting llama server" cmd="/tmp/ollama4058211551/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --verbose --parallel 4 --port 44335"
time=2024-07-12T20:38:26.926Z level=DEBUG source=server.go:390 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/tmp/ollama4058211551/runners/cuda_v11:/tmp/ollama4058211551/runners:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 CUDA_VISIBLE_DEVICES=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda]"
time=2024-07-12T20:38:26.927Z level=INFO source=sched.go:474 msg="loaded runners" count=1
time=2024-07-12T20:38:26.927Z level=INFO source=server.go:563 msg="waiting for llama runner to start responding"
time=2024-07-12T20:38:26.927Z level=INFO source=server.go:604 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="a8db2a9" tid="139851958501376" timestamp=1720816706
INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="139851958501376" timestamp=1720816706 total_threads=8
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="7" port="44335" tid="139851958501376" timestamp=1720816706
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
time=2024-07-12T20:38:27.179Z level=INFO source=server.go:604 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW) 
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: GRID P40-8Q, compute capability 6.1, VMM: no
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   281.81 MiB
llm_load_tensors:      CUDA0 buffer size =  4155.99 MiB
time=2024-07-12T20:38:28.184Z level=DEBUG source=server.go:615 msg="model load progress 0.18"
time=2024-07-12T20:38:28.436Z level=DEBUG source=server.go:615 msg="model load progress 0.64"
time=2024-07-12T20:38:28.686Z level=DEBUG source=server.go:615 msg="model load progress 0.99"
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     2.02 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   560.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
time=2024-07-12T20:38:28.938Z level=DEBUG source=server.go:615 msg="model load progress 1.00"
time=2024-07-12T20:38:29.189Z level=DEBUG source=server.go:618 msg="model load completed, waiting for server to become available" status="llm server loading model"
DEBUG [initialize] initializing slots | n_slots=4 tid="139851958501376" timestamp=1720816709
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=0 tid="139851958501376" timestamp=1720816709
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=1 tid="139851958501376" timestamp=1720816709
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=2 tid="139851958501376" timestamp=1720816709
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=3 tid="139851958501376" timestamp=1720816709
INFO [main] model loaded | tid="139851958501376" timestamp=1720816709
DEBUG [update_slots] all slots are idle and system prompt is empty, clear the KV cache | tid="139851958501376" timestamp=1720816709
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=0 tid="139851958501376" timestamp=1720816709
time=2024-07-12T20:38:29.440Z level=INFO source=server.go:609 msg="llama runner started in 2.51 seconds"
time=2024-07-12T20:38:29.440Z level=DEBUG source=sched.go:487 msg="finished setting up runner" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=1 tid="139851958501376" timestamp=1720816709
DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=39120 status=200 tid="139851487993856" timestamp=1720816709
time=2024-07-12T20:38:29.485Z level=DEBUG source=prompt.go:168 msg="prompt now fits in context window" required=19 window=2048
time=2024-07-12T20:38:29.485Z level=DEBUG source=routes.go:1334 msg="chat handler" prompt="<|start_header_id|>user<|end_header_id|>\n\nWhat's the deal with orange juice?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" images=0
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=2 tid="139851958501376" timestamp=1720816709
DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=3 tid="139851958501376" timestamp=1720816709
DEBUG [update_slots] slot progression | ga_i=0 n_past=0 n_past_se=0 n_prompt_tokens_processed=18 slot_id=0 task_id=3 tid="139851958501376" timestamp=1720816709
DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=3 tid="139851958501376" timestamp=1720816709
DEBUG [print_timings] prompt eval time     =     123.72 ms /    18 tokens (    6.87 ms per token,   145.49 tokens per second) | n_prompt_tokens_processed=18 n_tokens_second=145.48511202353626 slot_id=0 t_prompt_processing=123.724 t_token=6.873555555555556 task_id=3 tid="139851958501376" timestamp=1720816731
DEBUG [print_timings] generation eval time =   21845.44 ms /   482 runs   (   45.32 ms per token,    22.06 tokens per second) | n_decoded=482 n_tokens_second=22.064097209468482 slot_id=0 t_token=45.322497925311204 t_token_generation=21845.444 task_id=3 tid="139851958501376" timestamp=1720816731
DEBUG [print_timings]           total time =   21969.17 ms | slot_id=0 t_prompt_processing=123.724 t_token_generation=21845.444 t_total=21969.167999999998 task_id=3 tid="139851958501376" timestamp=1720816731
DEBUG [update_slots] slot released | n_cache_tokens=500 n_ctx=8192 n_past=499 n_system_tokens=0 slot_id=0 task_id=3 tid="139851958501376" timestamp=1720816731 truncated=false
DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=39120 status=200 tid="139851487993856" timestamp=1720816731
[GIN] 2024/07/12 - 20:38:51 | 200 | 24.824350507s |  192.168.75.195 | POST     "/api/chat"
time=2024-07-12T20:38:51.499Z level=DEBUG source=sched.go:491 msg="context for request finished"
time=2024-07-12T20:38:51.499Z level=DEBUG source=sched.go:363 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa duration=5m0s
time=2024-07-12T20:38:51.499Z level=DEBUG source=sched.go:381 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa refCount=0
time=2024-07-12T20:38:51.606Z level=DEBUG source=sched.go:600 msg="evaluating already loaded" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=488 tid="139851958501376" timestamp=1720816731
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=489 tid="139851958501376" timestamp=1720816731
DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=50248 status=200 tid="139851479601152" timestamp=1720816731
time=2024-07-12T20:38:51.653Z level=DEBUG source=prompt.go:168 msg="prompt now fits in context window" required=92 window=2048
time=2024-07-12T20:38:51.653Z level=DEBUG source=routes.go:1334 msg="chat handler" prompt="<|start_header_id|>user<|end_header_id|>\n\nHere is the query:\nWhat's the deal with orange juice?\n\nCreate a concise, 3-5 word phrase as a title for the previous query. Avoid quotation marks or special formatting. RESPOND ONLY WITH THE TITLE TEXT.\n\nExamples of titles:\nStock Market Trends\nPerfect Chocolate Chip Recipe\nEvolution of Music Streaming\nRemote Work Productivity Tips\nArtificial Intelligence in Healthcare\nVideo Game Development Insights<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" images=0
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=490 tid="139851958501376" timestamp=1720816731
DEBUG [prefix_slot] slot with common prefix found | 0=["slot_id",0,"characters",42]
DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=491 tid="139851958501376" timestamp=1720816731
DEBUG [update_slots] slot progression | ga_i=0 n_past=5 n_past_se=0 n_prompt_tokens_processed=91 slot_id=0 task_id=491 tid="139851958501376" timestamp=1720816731
DEBUG [update_slots] kv cache rm [p0, end) | p0=5 slot_id=0 task_id=491 tid="139851958501376" timestamp=1720816731
DEBUG [print_timings] prompt eval time     =     300.25 ms /    91 tokens (    3.30 ms per token,   303.08 tokens per second) | n_prompt_tokens_processed=91 n_tokens_second=303.0797566036416 slot_id=0 t_prompt_processing=300.251 t_token=3.299461538461538 task_id=491 tid="139851958501376" timestamp=1720816732
DEBUG [print_timings] generation eval time =     175.14 ms /     5 runs   (   35.03 ms per token,    28.55 tokens per second) | n_decoded=5 n_tokens_second=28.548589699668838 slot_id=0 t_token=35.028 t_token_generation=175.14 task_id=491 tid="139851958501376" timestamp=1720816732
DEBUG [print_timings]           total time =     475.39 ms | slot_id=0 t_prompt_processing=300.251 t_token_generation=175.14 t_total=475.39099999999996 task_id=491 tid="139851958501376" timestamp=1720816732
DEBUG [update_slots] slot released | n_cache_tokens=96 n_ctx=8192 n_past=95 n_system_tokens=0 slot_id=0 task_id=491 tid="139851958501376" timestamp=1720816732 truncated=false
DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=50248 status=200 tid="139851479601152" timestamp=1720816732
[GIN] 2024/07/12 - 20:38:52 | 200 |  590.647083ms |  192.168.75.195 | POST     "/v1/chat/completions"
time=2024-07-12T20:38:52.178Z level=DEBUG source=sched.go:432 msg="context for request finished"
time=2024-07-12T20:38:52.179Z level=DEBUG source=sched.go:363 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa duration=5m0s
time=2024-07-12T20:38:52.179Z level=DEBUG source=sched.go:381 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa refCount=0
time=2024-07-12T20:39:23.060Z level=DEBUG source=sched.go:600 msg="evaluating already loaded" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=499 tid="139851958501376" timestamp=1720816763
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=500 tid="139851958501376" timestamp=1720816763
DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=55306 status=200 tid="139851471208448" timestamp=1720816763
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=501 tid="139851958501376" timestamp=1720816763
DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=55306 status=200 tid="139851471208448" timestamp=1720816763
time=2024-07-12T20:39:23.194Z level=DEBUG source=prompt.go:168 msg="prompt now fits in context window" required=539 window=2048
time=2024-07-12T20:39:23.194Z level=DEBUG source=routes.go:1334 msg="chat handler" prompt="<|start_header_id|>user<|end_header_id|>\n\nWhat's the deal with orange juice?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nOrange juice - a staple in many breakfast routines, but also a beverage that sparks controversy and debate. Here are some interesting facts and perspectives on OJ:\n\n**Quality concerns:**\n\n1. **Processing:** Most commercial orange juices are made from concentrate, which involves freezing or drying the juice before reconstituting it with water. This can affect the flavor, nutrients, and overall quality of the final product.\n2. **Additives:** Some OJs contain added sugars, preservatives, and flavor enhancers, which may not align with consumer expectations.\n\n**Nutritional aspects:**\n\n1. **Vitamin C:** Orange juice is an excellent source of vitamin C, a vital antioxidant that boosts immune function and overall health.\n2. **Flavonoids:** OJ contains flavonoids, a type of polyphenol that may help protect against chronic diseases like heart disease, cancer, and cognitive decline.\n3. **Sugar content:** Orange juice can be high in natural sugars (fructose and glucose), which can be a concern for those monitoring their sugar intake.\n\n**Environmental impact:**\n\n1. **Sustainability:** Large-scale orange farming can have negative environmental impacts, such as water pollution, soil degradation, and habitat destruction.\n2. **Fair trade:** The orange juice industry is often criticized for exploiting small farmers and workers in countries like Brazil and Florida.\n\n**Cultural significance:**\n\n1. **Breakfast staple:** Orange juice has become a standard breakfast beverage in many Western cultures, particularly in the United States.\n2. **Florida's identity:** Orange juice is closely tied to Florida's agricultural heritage and economy, with the state being one of the world's largest producers.\n\n**Controversies and debates:**\n\n1. **Fake OJ:** The term \"fake orange juice\" refers to OJs that are not 100% freshly squeezed or contain added ingredients.\n2. **Squeeze vs. concentrate:** There is ongoing debate about whether fresh-squeezed OJ or concentrated juice with water reconstitution is better.\n\nIn conclusion, orange juice is a complex beverage with both positive and negative aspects. While it provides essential nutrients like vitamin C, its processing methods, sugar content, and environmental impact are areas of concern. As consumers, we can make informed choices by opting for high-quality, sustainably sourced OJs or exploring alternative beverages that align with our values and dietary needs.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nOh, yeah. Speaking of Florida. I've heard that there was a train that carried oranges or orange juice. Do you know about that?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" images=0
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=502 tid="139851958501376" timestamp=1720816763
DEBUG [prefix_slot] slot with common prefix found | 0=["slot_id",0,"characters",42]
DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=503 tid="139851958501376" timestamp=1720816763
DEBUG [update_slots] slot progression | ga_i=0 n_past=5 n_past_se=0 n_prompt_tokens_processed=538 slot_id=0 task_id=503 tid="139851958501376" timestamp=1720816763
DEBUG [update_slots] kv cache rm [p0, end) | p0=5 slot_id=0 task_id=503 tid="139851958501376" timestamp=1720816763
DEBUG [print_timings] prompt eval time     =    1492.37 ms /   538 tokens (    2.77 ms per token,   360.50 tokens per second) | n_prompt_tokens_processed=538 n_tokens_second=360.5006536587131 slot_id=0 t_prompt_processing=1492.369 t_token=2.773920074349442 task_id=503 tid="139851958501376" timestamp=1720816781
DEBUG [print_timings] generation eval time =   16369.56 ms /   329 runs   (   49.76 ms per token,    20.10 tokens per second) | n_decoded=329 n_tokens_second=20.09827875041976 slot_id=0 t_token=49.75550455927051 t_token_generation=16369.561 task_id=503 tid="139851958501376" timestamp=1720816781
DEBUG [print_timings]           total time =   17861.93 ms | slot_id=0 t_prompt_processing=1492.369 t_token_generation=16369.561 t_total=17861.93 task_id=503 tid="139851958501376" timestamp=1720816781
DEBUG [update_slots] slot released | n_cache_tokens=867 n_ctx=8192 n_past=866 n_system_tokens=0 slot_id=0 task_id=503 tid="139851958501376" timestamp=1720816781 truncated=false
DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=55310 status=200 tid="139851462815744" timestamp=1720816781
[GIN] 2024/07/12 - 20:39:41 | 200 | 18.015866775s |  192.168.75.195 | POST     "/api/chat"
time=2024-07-12T20:39:41.058Z level=DEBUG source=sched.go:432 msg="context for request finished"
time=2024-07-12T20:39:41.058Z level=DEBUG source=sched.go:363 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa duration=5m0s
time=2024-07-12T20:39:41.058Z level=DEBUG source=sched.go:381 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa refCount=0
time=2024-07-12T20:40:53.034Z level=DEBUG source=sched.go:600 msg="evaluating already loaded" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=835 tid="139851958501376" timestamp=1720816853
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=836 tid="139851958501376" timestamp=1720816853
DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=46440 status=200 tid="139851382910976" timestamp=1720816853
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=837 tid="139851958501376" timestamp=1720816853
DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=46440 status=200 tid="139851382910976" timestamp=1720816853
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=838 tid="139851958501376" timestamp=1720816853
DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=46444 status=200 tid="139851506835456" timestamp=1720816853
time=2024-07-12T20:40:53.209Z level=DEBUG source=prompt.go:168 msg="prompt now fits in context window" required=899 window=2048
time=2024-07-12T20:40:53.209Z level=DEBUG source=routes.go:1334 msg="chat handler" prompt="<|start_header_id|>user<|end_header_id|>\n\nWhat's the deal with orange juice?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nOrange juice - a staple in many breakfast routines, but also a beverage that sparks controversy and debate. Here are some interesting facts and perspectives on OJ:\n\n**Quality concerns:**\n\n1. **Processing:** Most commercial orange juices are made from concentrate, which involves freezing or drying the juice before reconstituting it with water. This can affect the flavor, nutrients, and overall quality of the final product.\n2. **Additives:** Some OJs contain added sugars, preservatives, and flavor enhancers, which may not align with consumer expectations.\n\n**Nutritional aspects:**\n\n1. **Vitamin C:** Orange juice is an excellent source of vitamin C, a vital antioxidant that boosts immune function and overall health.\n2. **Flavonoids:** OJ contains flavonoids, a type of polyphenol that may help protect against chronic diseases like heart disease, cancer, and cognitive decline.\n3. **Sugar content:** Orange juice can be high in natural sugars (fructose and glucose), which can be a concern for those monitoring their sugar intake.\n\n**Environmental impact:**\n\n1. **Sustainability:** Large-scale orange farming can have negative environmental impacts, such as water pollution, soil degradation, and habitat destruction.\n2. **Fair trade:** The orange juice industry is often criticized for exploiting small farmers and workers in countries like Brazil and Florida.\n\n**Cultural significance:**\n\n1. **Breakfast staple:** Orange juice has become a standard breakfast beverage in many Western cultures, particularly in the United States.\n2. **Florida's identity:** Orange juice is closely tied to Florida's agricultural heritage and economy, with the state being one of the world's largest producers.\n\n**Controversies and debates:**\n\n1. **Fake OJ:** The term \"fake orange juice\" refers to OJs that are not 100% freshly squeezed or contain added ingredients.\n2. **Squeeze vs. concentrate:** There is ongoing debate about whether fresh-squeezed OJ or concentrated juice with water reconstitution is better.\n\nIn conclusion, orange juice is a complex beverage with both positive and negative aspects. While it provides essential nutrients like vitamin C, its processing methods, sugar content, and environmental impact are areas of concern. As consumers, we can make informed choices by opting for high-quality, sustainably sourced OJs or exploring alternative beverages that align with our values and dietary needs.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nOh, yeah. Speaking of Florida. I've heard that there was a train that carried oranges or orange juice. Do you know about that?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nYou're referring to the \"Orange Blossom Special\"!\n\nThe Orange Blossom Special was a nickname for the Atlantic Coast Line Railroad's (ACL) passenger train service between Jacksonville, Florida, and New York City. The train ran from 1929 to 1970 and became famous for its unique cargo: orange juice.\n\nIn the early 20th century, Florida's citrus industry was booming, and oranges were a major commodity. To transport these perishable goods efficiently and safely, ACL developed a specialized train service. The Orange Blossom Special would carry refrigerated cars filled with freshly squeezed orange juice from Florida to major cities in the Northeast.\n\nThe train's route would take it through the Appalachian Mountains, where it would stop at key stations like Washington D.C. and Philadelphia. At each stop, the train would offload its precious cargo to supply local markets. The journey took about 30 hours, depending on the number of stops and the weather conditions.\n\nThe Orange Blossom Special was more than just a transportation service; it became an iconic symbol of Florida's citrus industry and American culture. The train was immortalized in song by Johnny Cash, who wrote \"Orange Blossom Special\" (also known as \"The Orange Blossom Special\") in 1965. The catchy tune tells the story of a man waiting for the train at a station, reminiscing about his love of the Florida sunshine and the sweet taste of freshly squeezed OJ.\n\nAlthough the Orange Blossom Special ceased operations in 1970 due to declining passenger traffic and the rise of air transportation, its legacy lives on as a nostalgic reminder of Florida's citrus heritage.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHuh. That's pretty interesting. Speaking of, do you know what's going on with Amtrak?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" images=0
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=839 tid="139851958501376" timestamp=1720816853
DEBUG [prefix_slot] slot with common prefix found | 0=["slot_id",0,"characters",2780]
DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=840 tid="139851958501376" timestamp=1720816853
DEBUG [update_slots] slot progression | ga_i=0 n_past=866 n_past_se=0 n_prompt_tokens_processed=898 slot_id=0 task_id=840 tid="139851958501376" timestamp=1720816853
DEBUG [update_slots] kv cache rm [p0, end) | p0=866 slot_id=0 task_id=840 tid="139851958501376" timestamp=1720816853
DEBUG [print_timings] prompt eval time     =     192.75 ms /   898 tokens (    0.21 ms per token,  4658.86 tokens per second) | n_prompt_tokens_processed=898 n_tokens_second=4658.860395017406 slot_id=0 t_prompt_processing=192.751 t_token=0.21464476614699332 task_id=840 tid="139851958501376" timestamp=1720816876
DEBUG [print_timings] generation eval time =   23009.74 ms /   433 runs   (   53.14 ms per token,    18.82 tokens per second) | n_decoded=433 n_tokens_second=18.818118710516444 slot_id=0 t_token=53.14027482678984 t_token_generation=23009.739 task_id=840 tid="139851958501376" timestamp=1720816876
DEBUG [print_timings]           total time =   23202.49 ms | slot_id=0 t_prompt_processing=192.751 t_token_generation=23009.739 t_total=23202.49 task_id=840 tid="139851958501376" timestamp=1720816876
DEBUG [update_slots] slot released | n_cache_tokens=1331 n_ctx=8192 n_past=1330 n_system_tokens=0 slot_id=0 task_id=840 tid="139851958501376" timestamp=1720816876 truncated=false
DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=46444 status=200 tid="139851506835456" timestamp=1720816876
time=2024-07-12T20:41:16.460Z level=DEBUG source=sched.go:432 msg="context for request finished"
[GIN] 2024/07/12 - 20:41:16 | 200 | 23.448294785s |  192.168.75.195 | POST     "/api/chat"
time=2024-07-12T20:41:16.461Z level=DEBUG source=sched.go:363 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa duration=5m0s
time=2024-07-12T20:41:16.461Z level=DEBUG source=sched.go:381 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa refCount=0
time=2024-07-12T20:41:57.790Z level=DEBUG source=sched.go:600 msg="evaluating already loaded" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=1276 tid="139851958501376" timestamp=1720816917
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=1277 tid="139851958501376" timestamp=1720816917
DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=44952 status=200 tid="139851496386560" timestamp=1720816917
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=1278 tid="139851958501376" timestamp=1720816917
DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=44952 status=200 tid="139851496386560" timestamp=1720816917
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=1279 tid="139851958501376" timestamp=1720816917
DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=44962 status=200 tid="139851487993856" timestamp=1720816917
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=1280 tid="139851958501376" timestamp=1720816917
DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=44962 status=200 tid="139851487993856" timestamp=1720816918
time=2024-07-12T20:41:58.057Z level=DEBUG source=prompt.go:168 msg="prompt now fits in context window" required=1361 window=2048
time=2024-07-12T20:41:58.057Z level=DEBUG source=routes.go:1334 msg="chat handler" prompt="<|start_header_id|>user<|end_header_id|>\n\nWhat's the deal with orange juice?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nOrange juice - a staple in many breakfast routines, but also a beverage that sparks controversy and debate. Here are some interesting facts and perspectives on OJ:\n\n**Quality concerns:**\n\n1. **Processing:** Most commercial orange juices are made from concentrate, which involves freezing or drying the juice before reconstituting it with water. This can affect the flavor, nutrients, and overall quality of the final product.\n2. **Additives:** Some OJs contain added sugars, preservatives, and flavor enhancers, which may not align with consumer expectations.\n\n**Nutritional aspects:**\n\n1. **Vitamin C:** Orange juice is an excellent source of vitamin C, a vital antioxidant that boosts immune function and overall health.\n2. **Flavonoids:** OJ contains flavonoids, a type of polyphenol that may help protect against chronic diseases like heart disease, cancer, and cognitive decline.\n3. **Sugar content:** Orange juice can be high in natural sugars (fructose and glucose), which can be a concern for those monitoring their sugar intake.\n\n**Environmental impact:**\n\n1. **Sustainability:** Large-scale orange farming can have negative environmental impacts, such as water pollution, soil degradation, and habitat destruction.\n2. **Fair trade:** The orange juice industry is often criticized for exploiting small farmers and workers in countries like Brazil and Florida.\n\n**Cultural significance:**\n\n1. **Breakfast staple:** Orange juice has become a standard breakfast beverage in many Western cultures, particularly in the United States.\n2. **Florida's identity:** Orange juice is closely tied to Florida's agricultural heritage and economy, with the state being one of the world's largest producers.\n\n**Controversies and debates:**\n\n1. **Fake OJ:** The term \"fake orange juice\" refers to OJs that are not 100% freshly squeezed or contain added ingredients.\n2. **Squeeze vs. concentrate:** There is ongoing debate about whether fresh-squeezed OJ or concentrated juice with water reconstitution is better.\n\nIn conclusion, orange juice is a complex beverage with both positive and negative aspects. While it provides essential nutrients like vitamin C, its processing methods, sugar content, and environmental impact are areas of concern. As consumers, we can make informed choices by opting for high-quality, sustainably sourced OJs or exploring alternative beverages that align with our values and dietary needs.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nOh, yeah. Speaking of Florida. I've heard that there was a train that carried oranges or orange juice. Do you know about that?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nYou're referring to the \"Orange Blossom Special\"!\n\nThe Orange Blossom Special was a nickname for the Atlantic Coast Line Railroad's (ACL) passenger train service between Jacksonville, Florida, and New York City. The train ran from 1929 to 1970 and became famous for its unique cargo: orange juice.\n\nIn the early 20th century, Florida's citrus industry was booming, and oranges were a major commodity. To transport these perishable goods efficiently and safely, ACL developed a specialized train service. The Orange Blossom Special would carry refrigerated cars filled with freshly squeezed orange juice from Florida to major cities in the Northeast.\n\nThe train's route would take it through the Appalachian Mountains, where it would stop at key stations like Washington D.C. and Philadelphia. At each stop, the train would offload its precious cargo to supply local markets. The journey took about 30 hours, depending on the number of stops and the weather conditions.\n\nThe Orange Blossom Special was more than just a transportation service; it became an iconic symbol of Florida's citrus industry and American culture. The train was immortalized in song by Johnny Cash, who wrote \"Orange Blossom Special\" (also known as \"The Orange Blossom Special\") in 1965. The catchy tune tells the story of a man waiting for the train at a station, reminiscing about his love of the Florida sunshine and the sweet taste of freshly squeezed OJ.\n\nAlthough the Orange Blossom Special ceased operations in 1970 due to declining passenger traffic and the rise of air transportation, its legacy lives on as a nostalgic reminder of Florida's citrus heritage.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHuh. That's pretty interesting. Speaking of, do you know what's going on with Amtrak?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nAmtrak! The national passenger railroad service in the United States has been around since 1971, and it's had its share of challenges and changes over the years.\n\nCurrently, Amtrak is facing some significant hurdles:\n\n1. **Funding:** Amtrak relies heavily on federal funding to operate its services. However, as a result of the COVID-19 pandemic, ridership numbers have dropped significantly, leading to reduced revenue and increased pressure on funding.\n2. **Infrastructure:** Much of Amtrak's infrastructure, including tracks, bridges, and stations, is aging or in need of repair. The railroad is working to upgrade its network, but this process can be slow and costly.\n3. **Competition:** With the rise of ride-hailing services, buses, and airlines, Amtrak faces increased competition for passengers' attention. To stay competitive, Amtrak has been focusing on improving service quality, expanding routes, and offering more amenities.\n4. **Coronavirus pandemic:** As I mentioned earlier, the pandemic has had a significant impact on Amtrak's ridership and revenue. The railroad has implemented various safety measures to reduce the risk of transmission, but this has also affected its operations.\n\nDespite these challenges, Amtrak is taking steps to modernize and improve its services:\n\n1. **New trains:** Amtrak is introducing new trainsets, such as the Acela Express and the Northeast Regional trains, which offer improved amenities, comfort, and technology.\n2. **Electrification:** Amtrak is working on electrifying some of its routes, like the Northeast Corridor (NEC), to reduce emissions and increase efficiency.\n3. **Station upgrades:** Amtrak is investing in station renovations, including modernizing facilities, improving accessibility, and enhancing passenger experiences.\n4. **Coronavirus response:** The railroad has implemented various measures to reduce the spread of COVID-19 on its trains and stations, such as increased cleaning protocols, social distancing measures, and mask mandates.\n\nAmtrak continues to play a vital role in connecting Americans across the country, and while it faces challenges, the railroad is working to adapt and improve its services for the future.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nAt this point, I'm going to wait 10 minutes or so before my next response. <|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" images=0
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=1281 tid="139851958501376" timestamp=1720816918
DEBUG [prefix_slot] slot with common prefix found | 0=["slot_id",0,"characters",4613]
DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=1282 tid="139851958501376" timestamp=1720816918
DEBUG [update_slots] slot progression | ga_i=0 n_past=1330 n_past_se=0 n_prompt_tokens_processed=1360 slot_id=0 task_id=1282 tid="139851958501376" timestamp=1720816918
DEBUG [update_slots] kv cache rm [p0, end) | p0=1330 slot_id=0 task_id=1282 tid="139851958501376" timestamp=1720816918
DEBUG [print_timings] prompt eval time     =     210.03 ms /  1360 tokens (    0.15 ms per token,  6475.20 tokens per second) | n_prompt_tokens_processed=1360 n_tokens_second=6475.203778471851 slot_id=0 t_prompt_processing=210.032 t_token=0.15443529411764706 task_id=1282 tid="139851958501376" timestamp=1720816922
DEBUG [print_timings] generation eval time =    4452.38 ms /    81 runs   (   54.97 ms per token,    18.19 tokens per second) | n_decoded=81 n_tokens_second=18.192509088393585 slot_id=0 t_token=54.96767901234568 t_token_generation=4452.382 task_id=1282 tid="139851958501376" timestamp=1720816922
DEBUG [print_timings]           total time =    4662.41 ms | slot_id=0 t_prompt_processing=210.032 t_token_generation=4452.382 t_total=4662.414 task_id=1282 tid="139851958501376" timestamp=1720816922
DEBUG [update_slots] slot released | n_cache_tokens=1441 n_ctx=8192 n_past=1440 n_system_tokens=0 slot_id=0 task_id=1282 tid="139851958501376" timestamp=1720816922 truncated=false
DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=44966 status=200 tid="139851479601152" timestamp=1720816922
[GIN] 2024/07/12 - 20:42:02 | 200 |  4.952035613s |  192.168.75.195 | POST     "/api/chat"
time=2024-07-12T20:42:02.722Z level=DEBUG source=sched.go:432 msg="context for request finished"
time=2024-07-12T20:42:02.722Z level=DEBUG source=sched.go:363 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa duration=5m0s
time=2024-07-12T20:42:02.722Z level=DEBUG source=sched.go:381 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa refCount=0
time=2024-07-12T20:47:02.722Z level=DEBUG source=sched.go:365 msg="timer expired, expiring to unload" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-12T20:47:02.722Z level=DEBUG source=sched.go:384 msg="runner expired event received" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-12T20:47:02.722Z level=DEBUG source=sched.go:400 msg="got lock to unload" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-12T20:47:02.723Z level=DEBUG source=gpu.go:336 msg="updating system memory data" before.total="23.5 GiB" before.free="20.5 GiB" now.total="23.5 GiB" now.free="19.9 GiB"
CUDA driver version: 12.4
time=2024-07-12T20:47:03.004Z level=DEBUG source=gpu.go:377 msg="updating cuda memory data" gpu=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda name="GRID P40-8Q" before.total="8.0 GiB" before.free="7.4 GiB" now.total="8.0 GiB" now.free="1.5 GiB" now.used="6.5 GiB"
releasing cuda driver library
time=2024-07-12T20:47:03.005Z level=DEBUG source=server.go:1026 msg="stopping llama server"
time=2024-07-12T20:47:03.005Z level=DEBUG source=server.go:1032 msg="waiting for llama server to exit"
time=2024-07-12T20:47:03.089Z level=DEBUG source=server.go:1036 msg="llama server stopped"
time=2024-07-12T20:47:03.089Z level=DEBUG source=sched.go:405 msg="runner released" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-12T20:47:03.256Z level=DEBUG source=gpu.go:336 msg="updating system memory data" before.total="23.5 GiB" before.free="19.9 GiB" now.total="23.5 GiB" now.free="20.4 GiB"
CUDA driver version: 12.4
time=2024-07-12T20:47:03.406Z level=DEBUG source=gpu.go:377 msg="updating cuda memory data" gpu=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda name="GRID P40-8Q" before.total="8.0 GiB" before.free="1.5 GiB" now.total="8.0 GiB" now.free="7.4 GiB" now.used="650.0 MiB"
releasing cuda driver library
time=2024-07-12T20:47:03.406Z level=DEBUG source=sched.go:684 msg="gpu VRAM free memory converged after 0.68 seconds" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-12T20:47:03.406Z level=DEBUG source=sched.go:409 msg="sending an unloaded event" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-12T20:47:03.406Z level=DEBUG source=sched.go:332 msg="ignoring unload event with no pending requests"
time=2024-07-12T20:52:06.542Z level=DEBUG source=gpu.go:336 msg="updating system memory data" before.total="23.5 GiB" before.free="20.4 GiB" now.total="23.5 GiB" now.free="20.4 GiB"
CUDA driver version: 12.4
time=2024-07-12T20:52:06.733Z level=DEBUG source=gpu.go:377 msg="updating cuda memory data" gpu=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda name="GRID P40-8Q" before.total="8.0 GiB" before.free="7.4 GiB" now.total="8.0 GiB" now.free="7.4 GiB" now.used="650.0 MiB"
releasing cuda driver library
time=2024-07-12T20:52:06.755Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[7.4 GiB]"
time=2024-07-12T20:52:06.756Z level=DEBUG source=sched.go:251 msg="loading first model" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-12T20:52:06.756Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[7.4 GiB]"
time=2024-07-12T20:52:06.756Z level=INFO source=sched.go:738 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa gpu=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda parallel=4 available=7908343808 required="6.2 GiB"
time=2024-07-12T20:52:06.756Z level=DEBUG source=server.go:98 msg="system memory" total="23.5 GiB" free=21902262272
time=2024-07-12T20:52:06.756Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[7.4 GiB]"
time=2024-07-12T20:52:06.757Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[7.4 GiB]" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
time=2024-07-12T20:52:06.757Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu/ollama_llama_server
time=2024-07-12T20:52:06.757Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx/ollama_llama_server
time=2024-07-12T20:52:06.757Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx2/ollama_llama_server
time=2024-07-12T20:52:06.757Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cuda_v11/ollama_llama_server
time=2024-07-12T20:52:06.757Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/rocm_v60101/ollama_llama_server
time=2024-07-12T20:52:06.757Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu/ollama_llama_server
time=2024-07-12T20:52:06.757Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx/ollama_llama_server
time=2024-07-12T20:52:06.757Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx2/ollama_llama_server
time=2024-07-12T20:52:06.757Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cuda_v11/ollama_llama_server
time=2024-07-12T20:52:06.757Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/rocm_v60101/ollama_llama_server
time=2024-07-12T20:52:06.757Z level=INFO source=server.go:375 msg="starting llama server" cmd="/tmp/ollama4058211551/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --verbose --parallel 4 --port 35875"
time=2024-07-12T20:52:06.757Z level=DEBUG source=server.go:390 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/tmp/ollama4058211551/runners/cuda_v11:/tmp/ollama4058211551/runners:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 CUDA_VISIBLE_DEVICES=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda]"
time=2024-07-12T20:52:06.758Z level=INFO source=sched.go:474 msg="loaded runners" count=1
time=2024-07-12T20:52:06.758Z level=INFO source=server.go:563 msg="waiting for llama runner to start responding"
time=2024-07-12T20:52:06.758Z level=INFO source=server.go:604 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="a8db2a9" tid="139684785954816" timestamp=1720817526
INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="139684785954816" timestamp=1720817526 total_threads=8
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="7" port="35875" tid="139684785954816" timestamp=1720817526
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
time=2024-07-12T20:52:07.010Z level=INFO source=server.go:604 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00

OS

Docker

GPU

Nvidia

CPU

Intel

Ollama version

0.2.1

Originally created by @NWBx01 on GitHub (Jul 12, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5654 ### What is the issue? Generating a response after first starting Ollama works flawlessly from what I can tell. I am able to change models and generate responses from prompts. After the model unloads due to inactivity, however, I am unable to generate any response. I use Nvidia vGPU 17.1 to passthrough my GPU to a virtual machine running the ollama docker image that has GPU capability. The CUDA compute capability are the same between the host and guest: 6.1 on the host Quadro P4000, and 6.1 on the guest GRID P40-8Q. Both also have the same amount of VRAM: 8GB on the host, and 8GB on the guest. I don't believe this would cause any issues, but I thought it would be wise to mention. Below are logs from when this happens (I've had to split this into two messages because of length): ``` 2024/07/12 20:35:32 routes.go:1033: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" time=2024-07-12T20:35:32.959Z level=INFO source=images.go:751 msg="total blobs: 25" time=2024-07-12T20:35:32.960Z level=INFO source=images.go:758 msg="total unused blobs removed: 0" time=2024-07-12T20:35:32.961Z level=INFO source=routes.go:1080 msg="Listening on [::]:11434 (version 0.2.1)" time=2024-07-12T20:35:32.961Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama4058211551/runners time=2024-07-12T20:35:32.962Z level=DEBUG source=payload.go:182 msg=extracting variant=cpu file=build/linux/x86_64/cpu/bin/ollama_llama_server.gz time=2024-07-12T20:35:32.962Z level=DEBUG source=payload.go:182 msg=extracting variant=cpu_avx file=build/linux/x86_64/cpu_avx/bin/ollama_llama_server.gz time=2024-07-12T20:35:32.962Z level=DEBUG source=payload.go:182 msg=extracting variant=cpu_avx2 file=build/linux/x86_64/cpu_avx2/bin/ollama_llama_server.gz time=2024-07-12T20:35:32.962Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublas.so.11.gz time=2024-07-12T20:35:32.962Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublasLt.so.11.gz time=2024-07-12T20:35:32.962Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcudart.so.11.0.gz time=2024-07-12T20:35:32.962Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/ollama_llama_server.gz time=2024-07-12T20:35:32.962Z level=DEBUG source=payload.go:182 msg=extracting variant=rocm_v60101 file=build/linux/x86_64/rocm_v60101/bin/deps.txt.gz time=2024-07-12T20:35:32.962Z level=DEBUG source=payload.go:182 msg=extracting variant=rocm_v60101 file=build/linux/x86_64/rocm_v60101/bin/ollama_llama_server.gz time=2024-07-12T20:35:37.143Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu/ollama_llama_server time=2024-07-12T20:35:37.143Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx/ollama_llama_server time=2024-07-12T20:35:37.143Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx2/ollama_llama_server time=2024-07-12T20:35:37.143Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cuda_v11/ollama_llama_server time=2024-07-12T20:35:37.143Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/rocm_v60101/ollama_llama_server time=2024-07-12T20:35:37.143Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx2 cuda_v11 rocm_v60101 cpu cpu_avx]" time=2024-07-12T20:35:37.143Z level=DEBUG source=payload.go:45 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY" time=2024-07-12T20:35:37.143Z level=DEBUG source=sched.go:102 msg="starting llm scheduler" time=2024-07-12T20:35:37.143Z level=INFO source=gpu.go:205 msg="looking for compatible GPUs" time=2024-07-12T20:35:37.144Z level=DEBUG source=gpu.go:91 msg="searching for GPU discovery libraries for NVIDIA" time=2024-07-12T20:35:37.144Z level=DEBUG source=gpu.go:438 msg="Searching for GPU library" name=libcuda.so* time=2024-07-12T20:35:37.144Z level=DEBUG source=gpu.go:457 msg="gpu library search" globs="[/usr/local/nvidia/lib/libcuda.so** /usr/local/nvidia/lib64/libcuda.so** /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" time=2024-07-12T20:35:37.150Z level=DEBUG source=gpu.go:491 msg="discovered GPU libraries" paths=[/usr/lib/x86_64-linux-gnu/libcuda.so.550.54.15] CUDA driver version: 12.4 time=2024-07-12T20:35:37.202Z level=DEBUG source=gpu.go:124 msg="detected GPUs" count=1 library=/usr/lib/x86_64-linux-gnu/libcuda.so.550.54.15 [GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda] CUDA totalMem 8192 mb [GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda] CUDA freeMem 7541 mb [GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda] Compute Capability 6.1 time=2024-07-12T20:35:37.364Z level=DEBUG source=amd_linux.go:356 msg="amdgpu driver not detected /sys/module/amdgpu" releasing cuda driver library time=2024-07-12T20:35:37.364Z level=INFO source=types.go:103 msg="inference compute" id=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda library=cuda compute=6.1 driver=12.4 name="GRID P40-8Q" total="8.0 GiB" available="7.4 GiB" [GIN] 2024/07/12 - 20:36:17 | 200 | 1.769258ms | 192.168.75.195 | GET "/api/tags" [GIN] 2024/07/12 - 20:36:17 | 200 | 94.14µs | 192.168.75.195 | GET "/api/version" time=2024-07-12T20:38:26.695Z level=DEBUG source=gpu.go:336 msg="updating system memory data" before.total="23.5 GiB" before.free="20.5 GiB" now.total="23.5 GiB" now.free="20.5 GiB" CUDA driver version: 12.4 time=2024-07-12T20:38:26.899Z level=DEBUG source=gpu.go:377 msg="updating cuda memory data" gpu=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda name="GRID P40-8Q" before.total="8.0 GiB" before.free="7.4 GiB" now.total="8.0 GiB" now.free="7.4 GiB" now.used="650.0 MiB" releasing cuda driver library time=2024-07-12T20:38:26.899Z level=DEBUG source=sched.go:182 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1 time=2024-07-12T20:38:26.923Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[7.4 GiB]" time=2024-07-12T20:38:26.924Z level=DEBUG source=sched.go:251 msg="loading first model" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-12T20:38:26.924Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[7.4 GiB]" time=2024-07-12T20:38:26.924Z level=INFO source=sched.go:738 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa gpu=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda parallel=4 available=7908343808 required="6.2 GiB" time=2024-07-12T20:38:26.924Z level=DEBUG source=server.go:98 msg="system memory" total="23.5 GiB" free=21974278144 time=2024-07-12T20:38:26.924Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[7.4 GiB]" time=2024-07-12T20:38:26.925Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[7.4 GiB]" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB" time=2024-07-12T20:38:26.925Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu/ollama_llama_server time=2024-07-12T20:38:26.925Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx/ollama_llama_server time=2024-07-12T20:38:26.925Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx2/ollama_llama_server time=2024-07-12T20:38:26.925Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cuda_v11/ollama_llama_server time=2024-07-12T20:38:26.925Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/rocm_v60101/ollama_llama_server time=2024-07-12T20:38:26.926Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu/ollama_llama_server time=2024-07-12T20:38:26.926Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx/ollama_llama_server time=2024-07-12T20:38:26.926Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx2/ollama_llama_server time=2024-07-12T20:38:26.926Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cuda_v11/ollama_llama_server time=2024-07-12T20:38:26.926Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/rocm_v60101/ollama_llama_server time=2024-07-12T20:38:26.926Z level=INFO source=server.go:375 msg="starting llama server" cmd="/tmp/ollama4058211551/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --verbose --parallel 4 --port 44335" time=2024-07-12T20:38:26.926Z level=DEBUG source=server.go:390 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/tmp/ollama4058211551/runners/cuda_v11:/tmp/ollama4058211551/runners:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 CUDA_VISIBLE_DEVICES=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda]" time=2024-07-12T20:38:26.927Z level=INFO source=sched.go:474 msg="loaded runners" count=1 time=2024-07-12T20:38:26.927Z level=INFO source=server.go:563 msg="waiting for llama runner to start responding" time=2024-07-12T20:38:26.927Z level=INFO source=server.go:604 msg="waiting for server to become available" status="llm server error" INFO [main] build info | build=1 commit="a8db2a9" tid="139851958501376" timestamp=1720816706 INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="139851958501376" timestamp=1720816706 total_threads=8 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="7" port="44335" tid="139851958501376" timestamp=1720816706 llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors time=2024-07-12T20:38:27.179Z level=INFO source=server.go:604 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.8000 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: GRID P40-8Q, compute capability 6.1, VMM: no llm_load_tensors: ggml ctx size = 0.27 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 281.81 MiB llm_load_tensors: CUDA0 buffer size = 4155.99 MiB time=2024-07-12T20:38:28.184Z level=DEBUG source=server.go:615 msg="model load progress 0.18" time=2024-07-12T20:38:28.436Z level=DEBUG source=server.go:615 msg="model load progress 0.64" time=2024-07-12T20:38:28.686Z level=DEBUG source=server.go:615 msg="model load progress 0.99" llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 1024.00 MiB llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 2.02 MiB llama_new_context_with_model: CUDA0 compute buffer size = 560.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 24.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 2 time=2024-07-12T20:38:28.938Z level=DEBUG source=server.go:615 msg="model load progress 1.00" time=2024-07-12T20:38:29.189Z level=DEBUG source=server.go:618 msg="model load completed, waiting for server to become available" status="llm server loading model" DEBUG [initialize] initializing slots | n_slots=4 tid="139851958501376" timestamp=1720816709 DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=0 tid="139851958501376" timestamp=1720816709 DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=1 tid="139851958501376" timestamp=1720816709 DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=2 tid="139851958501376" timestamp=1720816709 DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=3 tid="139851958501376" timestamp=1720816709 INFO [main] model loaded | tid="139851958501376" timestamp=1720816709 DEBUG [update_slots] all slots are idle and system prompt is empty, clear the KV cache | tid="139851958501376" timestamp=1720816709 DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=0 tid="139851958501376" timestamp=1720816709 time=2024-07-12T20:38:29.440Z level=INFO source=server.go:609 msg="llama runner started in 2.51 seconds" time=2024-07-12T20:38:29.440Z level=DEBUG source=sched.go:487 msg="finished setting up runner" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=1 tid="139851958501376" timestamp=1720816709 DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=39120 status=200 tid="139851487993856" timestamp=1720816709 time=2024-07-12T20:38:29.485Z level=DEBUG source=prompt.go:168 msg="prompt now fits in context window" required=19 window=2048 time=2024-07-12T20:38:29.485Z level=DEBUG source=routes.go:1334 msg="chat handler" prompt="<|start_header_id|>user<|end_header_id|>\n\nWhat's the deal with orange juice?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" images=0 DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=2 tid="139851958501376" timestamp=1720816709 DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=3 tid="139851958501376" timestamp=1720816709 DEBUG [update_slots] slot progression | ga_i=0 n_past=0 n_past_se=0 n_prompt_tokens_processed=18 slot_id=0 task_id=3 tid="139851958501376" timestamp=1720816709 DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=3 tid="139851958501376" timestamp=1720816709 DEBUG [print_timings] prompt eval time = 123.72 ms / 18 tokens ( 6.87 ms per token, 145.49 tokens per second) | n_prompt_tokens_processed=18 n_tokens_second=145.48511202353626 slot_id=0 t_prompt_processing=123.724 t_token=6.873555555555556 task_id=3 tid="139851958501376" timestamp=1720816731 DEBUG [print_timings] generation eval time = 21845.44 ms / 482 runs ( 45.32 ms per token, 22.06 tokens per second) | n_decoded=482 n_tokens_second=22.064097209468482 slot_id=0 t_token=45.322497925311204 t_token_generation=21845.444 task_id=3 tid="139851958501376" timestamp=1720816731 DEBUG [print_timings] total time = 21969.17 ms | slot_id=0 t_prompt_processing=123.724 t_token_generation=21845.444 t_total=21969.167999999998 task_id=3 tid="139851958501376" timestamp=1720816731 DEBUG [update_slots] slot released | n_cache_tokens=500 n_ctx=8192 n_past=499 n_system_tokens=0 slot_id=0 task_id=3 tid="139851958501376" timestamp=1720816731 truncated=false DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=39120 status=200 tid="139851487993856" timestamp=1720816731 [GIN] 2024/07/12 - 20:38:51 | 200 | 24.824350507s | 192.168.75.195 | POST "/api/chat" time=2024-07-12T20:38:51.499Z level=DEBUG source=sched.go:491 msg="context for request finished" time=2024-07-12T20:38:51.499Z level=DEBUG source=sched.go:363 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa duration=5m0s time=2024-07-12T20:38:51.499Z level=DEBUG source=sched.go:381 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa refCount=0 time=2024-07-12T20:38:51.606Z level=DEBUG source=sched.go:600 msg="evaluating already loaded" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=488 tid="139851958501376" timestamp=1720816731 DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=489 tid="139851958501376" timestamp=1720816731 DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=50248 status=200 tid="139851479601152" timestamp=1720816731 time=2024-07-12T20:38:51.653Z level=DEBUG source=prompt.go:168 msg="prompt now fits in context window" required=92 window=2048 time=2024-07-12T20:38:51.653Z level=DEBUG source=routes.go:1334 msg="chat handler" prompt="<|start_header_id|>user<|end_header_id|>\n\nHere is the query:\nWhat's the deal with orange juice?\n\nCreate a concise, 3-5 word phrase as a title for the previous query. Avoid quotation marks or special formatting. RESPOND ONLY WITH THE TITLE TEXT.\n\nExamples of titles:\nStock Market Trends\nPerfect Chocolate Chip Recipe\nEvolution of Music Streaming\nRemote Work Productivity Tips\nArtificial Intelligence in Healthcare\nVideo Game Development Insights<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" images=0 DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=490 tid="139851958501376" timestamp=1720816731 DEBUG [prefix_slot] slot with common prefix found | 0=["slot_id",0,"characters",42] DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=491 tid="139851958501376" timestamp=1720816731 DEBUG [update_slots] slot progression | ga_i=0 n_past=5 n_past_se=0 n_prompt_tokens_processed=91 slot_id=0 task_id=491 tid="139851958501376" timestamp=1720816731 DEBUG [update_slots] kv cache rm [p0, end) | p0=5 slot_id=0 task_id=491 tid="139851958501376" timestamp=1720816731 DEBUG [print_timings] prompt eval time = 300.25 ms / 91 tokens ( 3.30 ms per token, 303.08 tokens per second) | n_prompt_tokens_processed=91 n_tokens_second=303.0797566036416 slot_id=0 t_prompt_processing=300.251 t_token=3.299461538461538 task_id=491 tid="139851958501376" timestamp=1720816732 DEBUG [print_timings] generation eval time = 175.14 ms / 5 runs ( 35.03 ms per token, 28.55 tokens per second) | n_decoded=5 n_tokens_second=28.548589699668838 slot_id=0 t_token=35.028 t_token_generation=175.14 task_id=491 tid="139851958501376" timestamp=1720816732 DEBUG [print_timings] total time = 475.39 ms | slot_id=0 t_prompt_processing=300.251 t_token_generation=175.14 t_total=475.39099999999996 task_id=491 tid="139851958501376" timestamp=1720816732 DEBUG [update_slots] slot released | n_cache_tokens=96 n_ctx=8192 n_past=95 n_system_tokens=0 slot_id=0 task_id=491 tid="139851958501376" timestamp=1720816732 truncated=false DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=50248 status=200 tid="139851479601152" timestamp=1720816732 [GIN] 2024/07/12 - 20:38:52 | 200 | 590.647083ms | 192.168.75.195 | POST "/v1/chat/completions" time=2024-07-12T20:38:52.178Z level=DEBUG source=sched.go:432 msg="context for request finished" time=2024-07-12T20:38:52.179Z level=DEBUG source=sched.go:363 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa duration=5m0s time=2024-07-12T20:38:52.179Z level=DEBUG source=sched.go:381 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa refCount=0 time=2024-07-12T20:39:23.060Z level=DEBUG source=sched.go:600 msg="evaluating already loaded" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=499 tid="139851958501376" timestamp=1720816763 DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=500 tid="139851958501376" timestamp=1720816763 DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=55306 status=200 tid="139851471208448" timestamp=1720816763 DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=501 tid="139851958501376" timestamp=1720816763 DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=55306 status=200 tid="139851471208448" timestamp=1720816763 time=2024-07-12T20:39:23.194Z level=DEBUG source=prompt.go:168 msg="prompt now fits in context window" required=539 window=2048 time=2024-07-12T20:39:23.194Z level=DEBUG source=routes.go:1334 msg="chat handler" prompt="<|start_header_id|>user<|end_header_id|>\n\nWhat's the deal with orange juice?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nOrange juice - a staple in many breakfast routines, but also a beverage that sparks controversy and debate. Here are some interesting facts and perspectives on OJ:\n\n**Quality concerns:**\n\n1. **Processing:** Most commercial orange juices are made from concentrate, which involves freezing or drying the juice before reconstituting it with water. This can affect the flavor, nutrients, and overall quality of the final product.\n2. **Additives:** Some OJs contain added sugars, preservatives, and flavor enhancers, which may not align with consumer expectations.\n\n**Nutritional aspects:**\n\n1. **Vitamin C:** Orange juice is an excellent source of vitamin C, a vital antioxidant that boosts immune function and overall health.\n2. **Flavonoids:** OJ contains flavonoids, a type of polyphenol that may help protect against chronic diseases like heart disease, cancer, and cognitive decline.\n3. **Sugar content:** Orange juice can be high in natural sugars (fructose and glucose), which can be a concern for those monitoring their sugar intake.\n\n**Environmental impact:**\n\n1. **Sustainability:** Large-scale orange farming can have negative environmental impacts, such as water pollution, soil degradation, and habitat destruction.\n2. **Fair trade:** The orange juice industry is often criticized for exploiting small farmers and workers in countries like Brazil and Florida.\n\n**Cultural significance:**\n\n1. **Breakfast staple:** Orange juice has become a standard breakfast beverage in many Western cultures, particularly in the United States.\n2. **Florida's identity:** Orange juice is closely tied to Florida's agricultural heritage and economy, with the state being one of the world's largest producers.\n\n**Controversies and debates:**\n\n1. **Fake OJ:** The term \"fake orange juice\" refers to OJs that are not 100% freshly squeezed or contain added ingredients.\n2. **Squeeze vs. concentrate:** There is ongoing debate about whether fresh-squeezed OJ or concentrated juice with water reconstitution is better.\n\nIn conclusion, orange juice is a complex beverage with both positive and negative aspects. While it provides essential nutrients like vitamin C, its processing methods, sugar content, and environmental impact are areas of concern. As consumers, we can make informed choices by opting for high-quality, sustainably sourced OJs or exploring alternative beverages that align with our values and dietary needs.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nOh, yeah. Speaking of Florida. I've heard that there was a train that carried oranges or orange juice. Do you know about that?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" images=0 DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=502 tid="139851958501376" timestamp=1720816763 DEBUG [prefix_slot] slot with common prefix found | 0=["slot_id",0,"characters",42] DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=503 tid="139851958501376" timestamp=1720816763 DEBUG [update_slots] slot progression | ga_i=0 n_past=5 n_past_se=0 n_prompt_tokens_processed=538 slot_id=0 task_id=503 tid="139851958501376" timestamp=1720816763 DEBUG [update_slots] kv cache rm [p0, end) | p0=5 slot_id=0 task_id=503 tid="139851958501376" timestamp=1720816763 DEBUG [print_timings] prompt eval time = 1492.37 ms / 538 tokens ( 2.77 ms per token, 360.50 tokens per second) | n_prompt_tokens_processed=538 n_tokens_second=360.5006536587131 slot_id=0 t_prompt_processing=1492.369 t_token=2.773920074349442 task_id=503 tid="139851958501376" timestamp=1720816781 DEBUG [print_timings] generation eval time = 16369.56 ms / 329 runs ( 49.76 ms per token, 20.10 tokens per second) | n_decoded=329 n_tokens_second=20.09827875041976 slot_id=0 t_token=49.75550455927051 t_token_generation=16369.561 task_id=503 tid="139851958501376" timestamp=1720816781 DEBUG [print_timings] total time = 17861.93 ms | slot_id=0 t_prompt_processing=1492.369 t_token_generation=16369.561 t_total=17861.93 task_id=503 tid="139851958501376" timestamp=1720816781 DEBUG [update_slots] slot released | n_cache_tokens=867 n_ctx=8192 n_past=866 n_system_tokens=0 slot_id=0 task_id=503 tid="139851958501376" timestamp=1720816781 truncated=false DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=55310 status=200 tid="139851462815744" timestamp=1720816781 [GIN] 2024/07/12 - 20:39:41 | 200 | 18.015866775s | 192.168.75.195 | POST "/api/chat" time=2024-07-12T20:39:41.058Z level=DEBUG source=sched.go:432 msg="context for request finished" time=2024-07-12T20:39:41.058Z level=DEBUG source=sched.go:363 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa duration=5m0s time=2024-07-12T20:39:41.058Z level=DEBUG source=sched.go:381 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa refCount=0 time=2024-07-12T20:40:53.034Z level=DEBUG source=sched.go:600 msg="evaluating already loaded" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=835 tid="139851958501376" timestamp=1720816853 DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=836 tid="139851958501376" timestamp=1720816853 DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=46440 status=200 tid="139851382910976" timestamp=1720816853 DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=837 tid="139851958501376" timestamp=1720816853 DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=46440 status=200 tid="139851382910976" timestamp=1720816853 DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=838 tid="139851958501376" timestamp=1720816853 DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=46444 status=200 tid="139851506835456" timestamp=1720816853 time=2024-07-12T20:40:53.209Z level=DEBUG source=prompt.go:168 msg="prompt now fits in context window" required=899 window=2048 time=2024-07-12T20:40:53.209Z level=DEBUG source=routes.go:1334 msg="chat handler" prompt="<|start_header_id|>user<|end_header_id|>\n\nWhat's the deal with orange juice?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nOrange juice - a staple in many breakfast routines, but also a beverage that sparks controversy and debate. Here are some interesting facts and perspectives on OJ:\n\n**Quality concerns:**\n\n1. **Processing:** Most commercial orange juices are made from concentrate, which involves freezing or drying the juice before reconstituting it with water. This can affect the flavor, nutrients, and overall quality of the final product.\n2. **Additives:** Some OJs contain added sugars, preservatives, and flavor enhancers, which may not align with consumer expectations.\n\n**Nutritional aspects:**\n\n1. **Vitamin C:** Orange juice is an excellent source of vitamin C, a vital antioxidant that boosts immune function and overall health.\n2. **Flavonoids:** OJ contains flavonoids, a type of polyphenol that may help protect against chronic diseases like heart disease, cancer, and cognitive decline.\n3. **Sugar content:** Orange juice can be high in natural sugars (fructose and glucose), which can be a concern for those monitoring their sugar intake.\n\n**Environmental impact:**\n\n1. **Sustainability:** Large-scale orange farming can have negative environmental impacts, such as water pollution, soil degradation, and habitat destruction.\n2. **Fair trade:** The orange juice industry is often criticized for exploiting small farmers and workers in countries like Brazil and Florida.\n\n**Cultural significance:**\n\n1. **Breakfast staple:** Orange juice has become a standard breakfast beverage in many Western cultures, particularly in the United States.\n2. **Florida's identity:** Orange juice is closely tied to Florida's agricultural heritage and economy, with the state being one of the world's largest producers.\n\n**Controversies and debates:**\n\n1. **Fake OJ:** The term \"fake orange juice\" refers to OJs that are not 100% freshly squeezed or contain added ingredients.\n2. **Squeeze vs. concentrate:** There is ongoing debate about whether fresh-squeezed OJ or concentrated juice with water reconstitution is better.\n\nIn conclusion, orange juice is a complex beverage with both positive and negative aspects. While it provides essential nutrients like vitamin C, its processing methods, sugar content, and environmental impact are areas of concern. As consumers, we can make informed choices by opting for high-quality, sustainably sourced OJs or exploring alternative beverages that align with our values and dietary needs.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nOh, yeah. Speaking of Florida. I've heard that there was a train that carried oranges or orange juice. Do you know about that?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nYou're referring to the \"Orange Blossom Special\"!\n\nThe Orange Blossom Special was a nickname for the Atlantic Coast Line Railroad's (ACL) passenger train service between Jacksonville, Florida, and New York City. The train ran from 1929 to 1970 and became famous for its unique cargo: orange juice.\n\nIn the early 20th century, Florida's citrus industry was booming, and oranges were a major commodity. To transport these perishable goods efficiently and safely, ACL developed a specialized train service. The Orange Blossom Special would carry refrigerated cars filled with freshly squeezed orange juice from Florida to major cities in the Northeast.\n\nThe train's route would take it through the Appalachian Mountains, where it would stop at key stations like Washington D.C. and Philadelphia. At each stop, the train would offload its precious cargo to supply local markets. The journey took about 30 hours, depending on the number of stops and the weather conditions.\n\nThe Orange Blossom Special was more than just a transportation service; it became an iconic symbol of Florida's citrus industry and American culture. The train was immortalized in song by Johnny Cash, who wrote \"Orange Blossom Special\" (also known as \"The Orange Blossom Special\") in 1965. The catchy tune tells the story of a man waiting for the train at a station, reminiscing about his love of the Florida sunshine and the sweet taste of freshly squeezed OJ.\n\nAlthough the Orange Blossom Special ceased operations in 1970 due to declining passenger traffic and the rise of air transportation, its legacy lives on as a nostalgic reminder of Florida's citrus heritage.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHuh. That's pretty interesting. Speaking of, do you know what's going on with Amtrak?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" images=0 DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=839 tid="139851958501376" timestamp=1720816853 DEBUG [prefix_slot] slot with common prefix found | 0=["slot_id",0,"characters",2780] DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=840 tid="139851958501376" timestamp=1720816853 DEBUG [update_slots] slot progression | ga_i=0 n_past=866 n_past_se=0 n_prompt_tokens_processed=898 slot_id=0 task_id=840 tid="139851958501376" timestamp=1720816853 DEBUG [update_slots] kv cache rm [p0, end) | p0=866 slot_id=0 task_id=840 tid="139851958501376" timestamp=1720816853 DEBUG [print_timings] prompt eval time = 192.75 ms / 898 tokens ( 0.21 ms per token, 4658.86 tokens per second) | n_prompt_tokens_processed=898 n_tokens_second=4658.860395017406 slot_id=0 t_prompt_processing=192.751 t_token=0.21464476614699332 task_id=840 tid="139851958501376" timestamp=1720816876 DEBUG [print_timings] generation eval time = 23009.74 ms / 433 runs ( 53.14 ms per token, 18.82 tokens per second) | n_decoded=433 n_tokens_second=18.818118710516444 slot_id=0 t_token=53.14027482678984 t_token_generation=23009.739 task_id=840 tid="139851958501376" timestamp=1720816876 DEBUG [print_timings] total time = 23202.49 ms | slot_id=0 t_prompt_processing=192.751 t_token_generation=23009.739 t_total=23202.49 task_id=840 tid="139851958501376" timestamp=1720816876 DEBUG [update_slots] slot released | n_cache_tokens=1331 n_ctx=8192 n_past=1330 n_system_tokens=0 slot_id=0 task_id=840 tid="139851958501376" timestamp=1720816876 truncated=false DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=46444 status=200 tid="139851506835456" timestamp=1720816876 time=2024-07-12T20:41:16.460Z level=DEBUG source=sched.go:432 msg="context for request finished" [GIN] 2024/07/12 - 20:41:16 | 200 | 23.448294785s | 192.168.75.195 | POST "/api/chat" time=2024-07-12T20:41:16.461Z level=DEBUG source=sched.go:363 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa duration=5m0s time=2024-07-12T20:41:16.461Z level=DEBUG source=sched.go:381 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa refCount=0 time=2024-07-12T20:41:57.790Z level=DEBUG source=sched.go:600 msg="evaluating already loaded" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=1276 tid="139851958501376" timestamp=1720816917 DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=1277 tid="139851958501376" timestamp=1720816917 DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=44952 status=200 tid="139851496386560" timestamp=1720816917 DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=1278 tid="139851958501376" timestamp=1720816917 DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=44952 status=200 tid="139851496386560" timestamp=1720816917 DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=1279 tid="139851958501376" timestamp=1720816917 DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=44962 status=200 tid="139851487993856" timestamp=1720816917 DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=1280 tid="139851958501376" timestamp=1720816917 DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=44962 status=200 tid="139851487993856" timestamp=1720816918 time=2024-07-12T20:41:58.057Z level=DEBUG source=prompt.go:168 msg="prompt now fits in context window" required=1361 window=2048 time=2024-07-12T20:41:58.057Z level=DEBUG source=routes.go:1334 msg="chat handler" prompt="<|start_header_id|>user<|end_header_id|>\n\nWhat's the deal with orange juice?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nOrange juice - a staple in many breakfast routines, but also a beverage that sparks controversy and debate. Here are some interesting facts and perspectives on OJ:\n\n**Quality concerns:**\n\n1. **Processing:** Most commercial orange juices are made from concentrate, which involves freezing or drying the juice before reconstituting it with water. This can affect the flavor, nutrients, and overall quality of the final product.\n2. **Additives:** Some OJs contain added sugars, preservatives, and flavor enhancers, which may not align with consumer expectations.\n\n**Nutritional aspects:**\n\n1. **Vitamin C:** Orange juice is an excellent source of vitamin C, a vital antioxidant that boosts immune function and overall health.\n2. **Flavonoids:** OJ contains flavonoids, a type of polyphenol that may help protect against chronic diseases like heart disease, cancer, and cognitive decline.\n3. **Sugar content:** Orange juice can be high in natural sugars (fructose and glucose), which can be a concern for those monitoring their sugar intake.\n\n**Environmental impact:**\n\n1. **Sustainability:** Large-scale orange farming can have negative environmental impacts, such as water pollution, soil degradation, and habitat destruction.\n2. **Fair trade:** The orange juice industry is often criticized for exploiting small farmers and workers in countries like Brazil and Florida.\n\n**Cultural significance:**\n\n1. **Breakfast staple:** Orange juice has become a standard breakfast beverage in many Western cultures, particularly in the United States.\n2. **Florida's identity:** Orange juice is closely tied to Florida's agricultural heritage and economy, with the state being one of the world's largest producers.\n\n**Controversies and debates:**\n\n1. **Fake OJ:** The term \"fake orange juice\" refers to OJs that are not 100% freshly squeezed or contain added ingredients.\n2. **Squeeze vs. concentrate:** There is ongoing debate about whether fresh-squeezed OJ or concentrated juice with water reconstitution is better.\n\nIn conclusion, orange juice is a complex beverage with both positive and negative aspects. While it provides essential nutrients like vitamin C, its processing methods, sugar content, and environmental impact are areas of concern. As consumers, we can make informed choices by opting for high-quality, sustainably sourced OJs or exploring alternative beverages that align with our values and dietary needs.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nOh, yeah. Speaking of Florida. I've heard that there was a train that carried oranges or orange juice. Do you know about that?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nYou're referring to the \"Orange Blossom Special\"!\n\nThe Orange Blossom Special was a nickname for the Atlantic Coast Line Railroad's (ACL) passenger train service between Jacksonville, Florida, and New York City. The train ran from 1929 to 1970 and became famous for its unique cargo: orange juice.\n\nIn the early 20th century, Florida's citrus industry was booming, and oranges were a major commodity. To transport these perishable goods efficiently and safely, ACL developed a specialized train service. The Orange Blossom Special would carry refrigerated cars filled with freshly squeezed orange juice from Florida to major cities in the Northeast.\n\nThe train's route would take it through the Appalachian Mountains, where it would stop at key stations like Washington D.C. and Philadelphia. At each stop, the train would offload its precious cargo to supply local markets. The journey took about 30 hours, depending on the number of stops and the weather conditions.\n\nThe Orange Blossom Special was more than just a transportation service; it became an iconic symbol of Florida's citrus industry and American culture. The train was immortalized in song by Johnny Cash, who wrote \"Orange Blossom Special\" (also known as \"The Orange Blossom Special\") in 1965. The catchy tune tells the story of a man waiting for the train at a station, reminiscing about his love of the Florida sunshine and the sweet taste of freshly squeezed OJ.\n\nAlthough the Orange Blossom Special ceased operations in 1970 due to declining passenger traffic and the rise of air transportation, its legacy lives on as a nostalgic reminder of Florida's citrus heritage.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHuh. That's pretty interesting. Speaking of, do you know what's going on with Amtrak?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nAmtrak! The national passenger railroad service in the United States has been around since 1971, and it's had its share of challenges and changes over the years.\n\nCurrently, Amtrak is facing some significant hurdles:\n\n1. **Funding:** Amtrak relies heavily on federal funding to operate its services. However, as a result of the COVID-19 pandemic, ridership numbers have dropped significantly, leading to reduced revenue and increased pressure on funding.\n2. **Infrastructure:** Much of Amtrak's infrastructure, including tracks, bridges, and stations, is aging or in need of repair. The railroad is working to upgrade its network, but this process can be slow and costly.\n3. **Competition:** With the rise of ride-hailing services, buses, and airlines, Amtrak faces increased competition for passengers' attention. To stay competitive, Amtrak has been focusing on improving service quality, expanding routes, and offering more amenities.\n4. **Coronavirus pandemic:** As I mentioned earlier, the pandemic has had a significant impact on Amtrak's ridership and revenue. The railroad has implemented various safety measures to reduce the risk of transmission, but this has also affected its operations.\n\nDespite these challenges, Amtrak is taking steps to modernize and improve its services:\n\n1. **New trains:** Amtrak is introducing new trainsets, such as the Acela Express and the Northeast Regional trains, which offer improved amenities, comfort, and technology.\n2. **Electrification:** Amtrak is working on electrifying some of its routes, like the Northeast Corridor (NEC), to reduce emissions and increase efficiency.\n3. **Station upgrades:** Amtrak is investing in station renovations, including modernizing facilities, improving accessibility, and enhancing passenger experiences.\n4. **Coronavirus response:** The railroad has implemented various measures to reduce the spread of COVID-19 on its trains and stations, such as increased cleaning protocols, social distancing measures, and mask mandates.\n\nAmtrak continues to play a vital role in connecting Americans across the country, and while it faces challenges, the railroad is working to adapt and improve its services for the future.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nAt this point, I'm going to wait 10 minutes or so before my next response. <|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" images=0 DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=1281 tid="139851958501376" timestamp=1720816918 DEBUG [prefix_slot] slot with common prefix found | 0=["slot_id",0,"characters",4613] DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=1282 tid="139851958501376" timestamp=1720816918 DEBUG [update_slots] slot progression | ga_i=0 n_past=1330 n_past_se=0 n_prompt_tokens_processed=1360 slot_id=0 task_id=1282 tid="139851958501376" timestamp=1720816918 DEBUG [update_slots] kv cache rm [p0, end) | p0=1330 slot_id=0 task_id=1282 tid="139851958501376" timestamp=1720816918 DEBUG [print_timings] prompt eval time = 210.03 ms / 1360 tokens ( 0.15 ms per token, 6475.20 tokens per second) | n_prompt_tokens_processed=1360 n_tokens_second=6475.203778471851 slot_id=0 t_prompt_processing=210.032 t_token=0.15443529411764706 task_id=1282 tid="139851958501376" timestamp=1720816922 DEBUG [print_timings] generation eval time = 4452.38 ms / 81 runs ( 54.97 ms per token, 18.19 tokens per second) | n_decoded=81 n_tokens_second=18.192509088393585 slot_id=0 t_token=54.96767901234568 t_token_generation=4452.382 task_id=1282 tid="139851958501376" timestamp=1720816922 DEBUG [print_timings] total time = 4662.41 ms | slot_id=0 t_prompt_processing=210.032 t_token_generation=4452.382 t_total=4662.414 task_id=1282 tid="139851958501376" timestamp=1720816922 DEBUG [update_slots] slot released | n_cache_tokens=1441 n_ctx=8192 n_past=1440 n_system_tokens=0 slot_id=0 task_id=1282 tid="139851958501376" timestamp=1720816922 truncated=false DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=44966 status=200 tid="139851479601152" timestamp=1720816922 [GIN] 2024/07/12 - 20:42:02 | 200 | 4.952035613s | 192.168.75.195 | POST "/api/chat" time=2024-07-12T20:42:02.722Z level=DEBUG source=sched.go:432 msg="context for request finished" time=2024-07-12T20:42:02.722Z level=DEBUG source=sched.go:363 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa duration=5m0s time=2024-07-12T20:42:02.722Z level=DEBUG source=sched.go:381 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa refCount=0 time=2024-07-12T20:47:02.722Z level=DEBUG source=sched.go:365 msg="timer expired, expiring to unload" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-12T20:47:02.722Z level=DEBUG source=sched.go:384 msg="runner expired event received" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-12T20:47:02.722Z level=DEBUG source=sched.go:400 msg="got lock to unload" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-12T20:47:02.723Z level=DEBUG source=gpu.go:336 msg="updating system memory data" before.total="23.5 GiB" before.free="20.5 GiB" now.total="23.5 GiB" now.free="19.9 GiB" CUDA driver version: 12.4 time=2024-07-12T20:47:03.004Z level=DEBUG source=gpu.go:377 msg="updating cuda memory data" gpu=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda name="GRID P40-8Q" before.total="8.0 GiB" before.free="7.4 GiB" now.total="8.0 GiB" now.free="1.5 GiB" now.used="6.5 GiB" releasing cuda driver library time=2024-07-12T20:47:03.005Z level=DEBUG source=server.go:1026 msg="stopping llama server" time=2024-07-12T20:47:03.005Z level=DEBUG source=server.go:1032 msg="waiting for llama server to exit" time=2024-07-12T20:47:03.089Z level=DEBUG source=server.go:1036 msg="llama server stopped" time=2024-07-12T20:47:03.089Z level=DEBUG source=sched.go:405 msg="runner released" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-12T20:47:03.256Z level=DEBUG source=gpu.go:336 msg="updating system memory data" before.total="23.5 GiB" before.free="19.9 GiB" now.total="23.5 GiB" now.free="20.4 GiB" CUDA driver version: 12.4 time=2024-07-12T20:47:03.406Z level=DEBUG source=gpu.go:377 msg="updating cuda memory data" gpu=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda name="GRID P40-8Q" before.total="8.0 GiB" before.free="1.5 GiB" now.total="8.0 GiB" now.free="7.4 GiB" now.used="650.0 MiB" releasing cuda driver library time=2024-07-12T20:47:03.406Z level=DEBUG source=sched.go:684 msg="gpu VRAM free memory converged after 0.68 seconds" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-12T20:47:03.406Z level=DEBUG source=sched.go:409 msg="sending an unloaded event" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-12T20:47:03.406Z level=DEBUG source=sched.go:332 msg="ignoring unload event with no pending requests" time=2024-07-12T20:52:06.542Z level=DEBUG source=gpu.go:336 msg="updating system memory data" before.total="23.5 GiB" before.free="20.4 GiB" now.total="23.5 GiB" now.free="20.4 GiB" CUDA driver version: 12.4 time=2024-07-12T20:52:06.733Z level=DEBUG source=gpu.go:377 msg="updating cuda memory data" gpu=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda name="GRID P40-8Q" before.total="8.0 GiB" before.free="7.4 GiB" now.total="8.0 GiB" now.free="7.4 GiB" now.used="650.0 MiB" releasing cuda driver library time=2024-07-12T20:52:06.755Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[7.4 GiB]" time=2024-07-12T20:52:06.756Z level=DEBUG source=sched.go:251 msg="loading first model" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-12T20:52:06.756Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[7.4 GiB]" time=2024-07-12T20:52:06.756Z level=INFO source=sched.go:738 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa gpu=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda parallel=4 available=7908343808 required="6.2 GiB" time=2024-07-12T20:52:06.756Z level=DEBUG source=server.go:98 msg="system memory" total="23.5 GiB" free=21902262272 time=2024-07-12T20:52:06.756Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[7.4 GiB]" time=2024-07-12T20:52:06.757Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[7.4 GiB]" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB" time=2024-07-12T20:52:06.757Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu/ollama_llama_server time=2024-07-12T20:52:06.757Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx/ollama_llama_server time=2024-07-12T20:52:06.757Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx2/ollama_llama_server time=2024-07-12T20:52:06.757Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cuda_v11/ollama_llama_server time=2024-07-12T20:52:06.757Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/rocm_v60101/ollama_llama_server time=2024-07-12T20:52:06.757Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu/ollama_llama_server time=2024-07-12T20:52:06.757Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx/ollama_llama_server time=2024-07-12T20:52:06.757Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx2/ollama_llama_server time=2024-07-12T20:52:06.757Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cuda_v11/ollama_llama_server time=2024-07-12T20:52:06.757Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/rocm_v60101/ollama_llama_server time=2024-07-12T20:52:06.757Z level=INFO source=server.go:375 msg="starting llama server" cmd="/tmp/ollama4058211551/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --verbose --parallel 4 --port 35875" time=2024-07-12T20:52:06.757Z level=DEBUG source=server.go:390 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/tmp/ollama4058211551/runners/cuda_v11:/tmp/ollama4058211551/runners:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 CUDA_VISIBLE_DEVICES=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda]" time=2024-07-12T20:52:06.758Z level=INFO source=sched.go:474 msg="loaded runners" count=1 time=2024-07-12T20:52:06.758Z level=INFO source=server.go:563 msg="waiting for llama runner to start responding" time=2024-07-12T20:52:06.758Z level=INFO source=server.go:604 msg="waiting for server to become available" status="llm server error" INFO [main] build info | build=1 commit="a8db2a9" tid="139684785954816" timestamp=1720817526 INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="139684785954816" timestamp=1720817526 total_threads=8 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="7" port="35875" tid="139684785954816" timestamp=1720817526 llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors time=2024-07-12T20:52:07.010Z level=INFO source=server.go:604 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.8000 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 ``` ### OS Docker ### GPU Nvidia ### CPU Intel ### Ollama version 0.2.1
GiteaMirror added the bugneeds more info labels 2026-05-03 21:42:15 -05:00
Author
Owner

@NWBx01 commented on GitHub (Jul 12, 2024):

Log continues:

llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW) 
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: GRID P40-8Q, compute capability 6.1, VMM: no
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   281.81 MiB
llm_load_tensors:      CUDA0 buffer size =  4155.99 MiB
time=2024-07-12T20:52:08.015Z level=DEBUG source=server.go:615 msg="model load progress 0.19"
time=2024-07-12T20:52:08.266Z level=DEBUG source=server.go:615 msg="model load progress 0.71"
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
time=2024-07-12T20:52:08.517Z level=DEBUG source=server.go:615 msg="model load progress 1.00"
llama_kv_cache_init:      CUDA0 KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     2.02 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   560.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
time=2024-07-12T20:52:08.767Z level=DEBUG source=server.go:618 msg="model load completed, waiting for server to become available" status="llm server loading model"
DEBUG [initialize] initializing slots | n_slots=4 tid="139684785954816" timestamp=1720817529
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=0 tid="139684785954816" timestamp=1720817529
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=1 tid="139684785954816" timestamp=1720817529
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=2 tid="139684785954816" timestamp=1720817529
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=3 tid="139684785954816" timestamp=1720817529
INFO [main] model loaded | tid="139684785954816" timestamp=1720817529
DEBUG [update_slots] all slots are idle and system prompt is empty, clear the KV cache | tid="139684785954816" timestamp=1720817529
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=0 tid="139684785954816" timestamp=1720817529
time=2024-07-12T20:52:09.270Z level=INFO source=server.go:609 msg="llama runner started in 2.51 seconds"
time=2024-07-12T20:52:09.270Z level=DEBUG source=sched.go:487 msg="finished setting up runner" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=1 tid="139684785954816" timestamp=1720817529
DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=38022 status=200 tid="139684315619328" timestamp=1720817529
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=2 tid="139684785954816" timestamp=1720817529
DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=38022 status=200 tid="139684315619328" timestamp=1720817529
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=3 tid="139684785954816" timestamp=1720817529
DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=38032 status=200 tid="139684307226624" timestamp=1720817529
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=4 tid="139684785954816" timestamp=1720817529
DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=38032 status=200 tid="139684307226624" timestamp=1720817529
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=5 tid="139684785954816" timestamp=1720817529
DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=38032 status=200 tid="139684307226624" timestamp=1720817529
time=2024-07-12T20:52:09.537Z level=DEBUG source=prompt.go:168 msg="prompt now fits in context window" required=1466 window=2048
time=2024-07-12T20:52:09.537Z level=DEBUG source=routes.go:1334 msg="chat handler" prompt="<|start_header_id|>user<|end_header_id|>\n\nWhat's the deal with orange juice?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nOrange juice - a staple in many breakfast routines, but also a beverage that sparks controversy and debate. Here are some interesting facts and perspectives on OJ:\n\n**Quality concerns:**\n\n1. **Processing:** Most commercial orange juices are made from concentrate, which involves freezing or drying the juice before reconstituting it with water. This can affect the flavor, nutrients, and overall quality of the final product.\n2. **Additives:** Some OJs contain added sugars, preservatives, and flavor enhancers, which may not align with consumer expectations.\n\n**Nutritional aspects:**\n\n1. **Vitamin C:** Orange juice is an excellent source of vitamin C, a vital antioxidant that boosts immune function and overall health.\n2. **Flavonoids:** OJ contains flavonoids, a type of polyphenol that may help protect against chronic diseases like heart disease, cancer, and cognitive decline.\n3. **Sugar content:** Orange juice can be high in natural sugars (fructose and glucose), which can be a concern for those monitoring their sugar intake.\n\n**Environmental impact:**\n\n1. **Sustainability:** Large-scale orange farming can have negative environmental impacts, such as water pollution, soil degradation, and habitat destruction.\n2. **Fair trade:** The orange juice industry is often criticized for exploiting small farmers and workers in countries like Brazil and Florida.\n\n**Cultural significance:**\n\n1. **Breakfast staple:** Orange juice has become a standard breakfast beverage in many Western cultures, particularly in the United States.\n2. **Florida's identity:** Orange juice is closely tied to Florida's agricultural heritage and economy, with the state being one of the world's largest producers.\n\n**Controversies and debates:**\n\n1. **Fake OJ:** The term \"fake orange juice\" refers to OJs that are not 100% freshly squeezed or contain added ingredients.\n2. **Squeeze vs. concentrate:** There is ongoing debate about whether fresh-squeezed OJ or concentrated juice with water reconstitution is better.\n\nIn conclusion, orange juice is a complex beverage with both positive and negative aspects. While it provides essential nutrients like vitamin C, its processing methods, sugar content, and environmental impact are areas of concern. As consumers, we can make informed choices by opting for high-quality, sustainably sourced OJs or exploring alternative beverages that align with our values and dietary needs.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nOh, yeah. Speaking of Florida. I've heard that there was a train that carried oranges or orange juice. Do you know about that?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nYou're referring to the \"Orange Blossom Special\"!\n\nThe Orange Blossom Special was a nickname for the Atlantic Coast Line Railroad's (ACL) passenger train service between Jacksonville, Florida, and New York City. The train ran from 1929 to 1970 and became famous for its unique cargo: orange juice.\n\nIn the early 20th century, Florida's citrus industry was booming, and oranges were a major commodity. To transport these perishable goods efficiently and safely, ACL developed a specialized train service. The Orange Blossom Special would carry refrigerated cars filled with freshly squeezed orange juice from Florida to major cities in the Northeast.\n\nThe train's route would take it through the Appalachian Mountains, where it would stop at key stations like Washington D.C. and Philadelphia. At each stop, the train would offload its precious cargo to supply local markets. The journey took about 30 hours, depending on the number of stops and the weather conditions.\n\nThe Orange Blossom Special was more than just a transportation service; it became an iconic symbol of Florida's citrus industry and American culture. The train was immortalized in song by Johnny Cash, who wrote \"Orange Blossom Special\" (also known as \"The Orange Blossom Special\") in 1965. The catchy tune tells the story of a man waiting for the train at a station, reminiscing about his love of the Florida sunshine and the sweet taste of freshly squeezed OJ.\n\nAlthough the Orange Blossom Special ceased operations in 1970 due to declining passenger traffic and the rise of air transportation, its legacy lives on as a nostalgic reminder of Florida's citrus heritage.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHuh. That's pretty interesting. Speaking of, do you know what's going on with Amtrak?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nAmtrak! The national passenger railroad service in the United States has been around since 1971, and it's had its share of challenges and changes over the years.\n\nCurrently, Amtrak is facing some significant hurdles:\n\n1. **Funding:** Amtrak relies heavily on federal funding to operate its services. However, as a result of the COVID-19 pandemic, ridership numbers have dropped significantly, leading to reduced revenue and increased pressure on funding.\n2. **Infrastructure:** Much of Amtrak's infrastructure, including tracks, bridges, and stations, is aging or in need of repair. The railroad is working to upgrade its network, but this process can be slow and costly.\n3. **Competition:** With the rise of ride-hailing services, buses, and airlines, Amtrak faces increased competition for passengers' attention. To stay competitive, Amtrak has been focusing on improving service quality, expanding routes, and offering more amenities.\n4. **Coronavirus pandemic:** As I mentioned earlier, the pandemic has had a significant impact on Amtrak's ridership and revenue. The railroad has implemented various safety measures to reduce the risk of transmission, but this has also affected its operations.\n\nDespite these challenges, Amtrak is taking steps to modernize and improve its services:\n\n1. **New trains:** Amtrak is introducing new trainsets, such as the Acela Express and the Northeast Regional trains, which offer improved amenities, comfort, and technology.\n2. **Electrification:** Amtrak is working on electrifying some of its routes, like the Northeast Corridor (NEC), to reduce emissions and increase efficiency.\n3. **Station upgrades:** Amtrak is investing in station renovations, including modernizing facilities, improving accessibility, and enhancing passenger experiences.\n4. **Coronavirus response:** The railroad has implemented various measures to reduce the spread of COVID-19 on its trains and stations, such as increased cleaning protocols, social distancing measures, and mask mandates.\n\nAmtrak continues to play a vital role in connecting Americans across the country, and while it faces challenges, the railroad is working to adapt and improve its services for the future.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nAt this point, I'm going to wait 10 minutes or so before my next response. <|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nTake your time! I'll be here when you're ready to continue the conversation.\n\nIn the meantime, feel free to take a break, grab a snack, stretch, or do whatever relaxes you. When you're ready, just let me know and we can pick up where we left off.\n\nRemember, there's no rush, and I'm happy to chat with you whenever you're ready!<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nOkay, I'm now responding again. Can you generate a response please?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" images=0
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=6 tid="139684785954816" timestamp=1720817529
DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=7 tid="139684785954816" timestamp=1720817529
DEBUG [update_slots] slot progression | ga_i=0 n_past=0 n_past_se=0 n_prompt_tokens_processed=1465 slot_id=0 task_id=7 tid="139684785954816" timestamp=1720817529
DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=7 tid="139684785954816" timestamp=1720817529
DEBUG [print_timings] prompt eval time     =    3977.53 ms /  1465 tokens (    2.72 ms per token,   368.32 tokens per second) | n_prompt_tokens_processed=1465 n_tokens_second=368.3188469633934 slot_id=0 t_prompt_processing=3977.532 t_token=2.7150389078498294 task_id=7 tid="139684785954816" timestamp=1720817545
DEBUG [print_timings] generation eval time =   11510.08 ms /   200 runs   (   57.55 ms per token,    17.38 tokens per second) | n_decoded=200 n_tokens_second=17.37607987992434 slot_id=0 t_token=57.55038 t_token_generation=11510.076 task_id=7 tid="139684785954816" timestamp=1720817545
DEBUG [print_timings]           total time =   15487.61 ms | slot_id=0 t_prompt_processing=3977.532 t_token_generation=11510.076 t_total=15487.608 task_id=7 tid="139684785954816" timestamp=1720817545
DEBUG [update_slots] slot released | n_cache_tokens=1665 n_ctx=8192 n_past=1664 n_system_tokens=0 slot_id=0 task_id=7 tid="139684785954816" timestamp=1720817545 truncated=false
DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=38044 status=200 tid="139684298833920" timestamp=1720817545
[GIN] 2024/07/12 - 20:52:25 | 200 | 18.505977902s |  192.168.75.195 | POST     "/api/chat"
time=2024-07-12T20:52:25.027Z level=DEBUG source=sched.go:491 msg="context for request finished"
time=2024-07-12T20:52:25.027Z level=DEBUG source=sched.go:363 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa duration=5m0s
time=2024-07-12T20:52:25.027Z level=DEBUG source=sched.go:381 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa refCount=0
time=2024-07-12T20:57:25.028Z level=DEBUG source=sched.go:365 msg="timer expired, expiring to unload" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-12T20:57:25.028Z level=DEBUG source=sched.go:384 msg="runner expired event received" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-12T20:57:25.028Z level=DEBUG source=sched.go:400 msg="got lock to unload" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-12T20:57:25.028Z level=DEBUG source=gpu.go:336 msg="updating system memory data" before.total="23.5 GiB" before.free="20.4 GiB" now.total="23.5 GiB" now.free="19.9 GiB"
CUDA driver version: 12.4
time=2024-07-12T20:57:25.292Z level=DEBUG source=gpu.go:377 msg="updating cuda memory data" gpu=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda name="GRID P40-8Q" before.total="8.0 GiB" before.free="7.4 GiB" now.total="8.0 GiB" now.free="1.5 GiB" now.used="6.5 GiB"
releasing cuda driver library
time=2024-07-12T20:57:25.292Z level=DEBUG source=server.go:1026 msg="stopping llama server"
time=2024-07-12T20:57:25.292Z level=DEBUG source=server.go:1032 msg="waiting for llama server to exit"
time=2024-07-12T20:57:25.381Z level=DEBUG source=server.go:1036 msg="llama server stopped"
time=2024-07-12T20:57:25.381Z level=DEBUG source=sched.go:405 msg="runner released" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-12T20:57:25.542Z level=DEBUG source=gpu.go:336 msg="updating system memory data" before.total="23.5 GiB" before.free="19.9 GiB" now.total="23.5 GiB" now.free="20.4 GiB"
CUDA driver version: 12.4
time=2024-07-12T20:57:25.694Z level=DEBUG source=gpu.go:377 msg="updating cuda memory data" gpu=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda name="GRID P40-8Q" before.total="8.0 GiB" before.free="1.5 GiB" now.total="8.0 GiB" now.free="7.4 GiB" now.used="650.0 MiB"
releasing cuda driver library
time=2024-07-12T20:57:25.694Z level=DEBUG source=sched.go:684 msg="gpu VRAM free memory converged after 0.67 seconds" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-12T20:57:25.694Z level=DEBUG source=sched.go:409 msg="sending an unloaded event" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-12T20:57:25.694Z level=DEBUG source=sched.go:332 msg="ignoring unload event with no pending requests"
time=2024-07-12T21:09:08.074Z level=DEBUG source=gpu.go:336 msg="updating system memory data" before.total="23.5 GiB" before.free="20.4 GiB" now.total="23.5 GiB" now.free="20.4 GiB"
CUDA driver version: 12.4
time=2024-07-12T21:09:08.224Z level=DEBUG source=gpu.go:377 msg="updating cuda memory data" gpu=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda name="GRID P40-8Q" before.total="8.0 GiB" before.free="7.4 GiB" now.total="8.0 GiB" now.free="7.4 GiB" now.used="650.0 MiB"
releasing cuda driver library
time=2024-07-12T21:09:08.241Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[7.4 GiB]"
time=2024-07-12T21:09:08.242Z level=DEBUG source=sched.go:251 msg="loading first model" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-12T21:09:08.242Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[7.4 GiB]"
time=2024-07-12T21:09:08.242Z level=INFO source=sched.go:738 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa gpu=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda parallel=4 available=7908343808 required="6.2 GiB"
time=2024-07-12T21:09:08.242Z level=DEBUG source=server.go:98 msg="system memory" total="23.5 GiB" free=21881417728
time=2024-07-12T21:09:08.242Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[7.4 GiB]"
time=2024-07-12T21:09:08.243Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[7.4 GiB]" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
time=2024-07-12T21:09:08.243Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu/ollama_llama_server
time=2024-07-12T21:09:08.243Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx/ollama_llama_server
time=2024-07-12T21:09:08.243Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx2/ollama_llama_server
time=2024-07-12T21:09:08.243Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cuda_v11/ollama_llama_server
time=2024-07-12T21:09:08.243Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/rocm_v60101/ollama_llama_server
time=2024-07-12T21:09:08.243Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu/ollama_llama_server
time=2024-07-12T21:09:08.243Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx/ollama_llama_server
time=2024-07-12T21:09:08.243Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx2/ollama_llama_server
time=2024-07-12T21:09:08.243Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cuda_v11/ollama_llama_server
time=2024-07-12T21:09:08.243Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/rocm_v60101/ollama_llama_server
time=2024-07-12T21:09:08.244Z level=INFO source=server.go:375 msg="starting llama server" cmd="/tmp/ollama4058211551/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --verbose --parallel 4 --port 36433"
time=2024-07-12T21:09:08.244Z level=DEBUG source=server.go:390 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/tmp/ollama4058211551/runners/cuda_v11:/tmp/ollama4058211551/runners:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 CUDA_VISIBLE_DEVICES=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda]"
time=2024-07-12T21:09:08.244Z level=INFO source=sched.go:474 msg="loaded runners" count=1
time=2024-07-12T21:09:08.244Z level=INFO source=server.go:563 msg="waiting for llama runner to start responding"
time=2024-07-12T21:09:08.244Z level=INFO source=server.go:604 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="a8db2a9" tid="140349579063296" timestamp=1720818548
INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="140349579063296" timestamp=1720818548 total_threads=8
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="7" port="36433" tid="140349579063296" timestamp=1720818548
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
time=2024-07-12T21:09:08.496Z level=INFO source=server.go:604 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW) 
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: GRID P40-8Q, compute capability 6.1, VMM: no
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   281.81 MiB
llm_load_tensors:      CUDA0 buffer size =  4155.99 MiB
time=2024-07-12T21:09:49.204Z level=DEBUG source=server.go:615 msg="model load progress 0.06"
time=2024-07-12T21:09:49.455Z level=DEBUG source=server.go:615 msg="model load progress 0.16"
time=2024-07-12T21:09:50.459Z level=DEBUG source=server.go:615 msg="model load progress 0.18"
time=2024-07-12T21:09:50.962Z level=DEBUG source=server.go:615 msg="model load progress 0.19"
time=2024-07-12T21:09:51.716Z level=DEBUG source=server.go:615 msg="model load progress 0.21"
time=2024-07-12T21:09:52.470Z level=DEBUG source=server.go:615 msg="model load progress 0.22"
time=2024-07-12T21:09:52.721Z level=DEBUG source=server.go:615 msg="model load progress 0.24"
time=2024-07-12T21:09:53.726Z level=DEBUG source=server.go:615 msg="model load progress 0.25"
time=2024-07-12T21:09:53.977Z level=DEBUG source=server.go:615 msg="model load progress 0.26"
time=2024-07-12T21:09:54.480Z level=DEBUG source=server.go:615 msg="model load progress 0.27"
time=2024-07-12T21:09:54.983Z level=DEBUG source=server.go:615 msg="model load progress 0.29"
time=2024-07-12T21:09:55.987Z level=DEBUG source=server.go:615 msg="model load progress 0.31"
time=2024-07-12T21:09:56.490Z level=DEBUG source=server.go:615 msg="model load progress 0.32"
time=2024-07-12T21:09:56.992Z level=DEBUG source=server.go:615 msg="model load progress 0.33"
time=2024-07-12T21:09:57.243Z level=DEBUG source=server.go:615 msg="model load progress 0.34"
time=2024-07-12T21:09:57.746Z level=DEBUG source=server.go:615 msg="model load progress 0.35"
time=2024-07-12T21:09:58.249Z level=DEBUG source=server.go:615 msg="model load progress 0.37"
time=2024-07-12T21:09:59.254Z level=DEBUG source=server.go:615 msg="model load progress 0.39"
time=2024-07-12T21:09:59.756Z level=DEBUG source=server.go:615 msg="model load progress 0.40"
time=2024-07-12T21:10:00.259Z level=DEBUG source=server.go:615 msg="model load progress 0.41"
time=2024-07-12T21:10:00.510Z level=DEBUG source=server.go:615 msg="model load progress 0.42"
time=2024-07-12T21:10:01.515Z level=DEBUG source=server.go:615 msg="model load progress 0.44"
time=2024-07-12T21:10:01.766Z level=DEBUG source=server.go:615 msg="model load progress 0.45"
time=2024-07-12T21:10:02.520Z level=DEBUG source=server.go:615 msg="model load progress 0.47"
time=2024-07-12T21:10:03.022Z level=DEBUG source=server.go:615 msg="model load progress 0.48"
time=2024-07-12T21:10:03.776Z level=DEBUG source=server.go:615 msg="model load progress 0.50"
time=2024-07-12T21:10:04.781Z level=DEBUG source=server.go:615 msg="model load progress 0.52"
time=2024-07-12T21:10:05.032Z level=DEBUG source=server.go:615 msg="model load progress 0.53"
time=2024-07-12T21:10:05.786Z level=DEBUG source=server.go:615 msg="model load progress 0.54"
time=2024-07-12T21:10:06.037Z level=DEBUG source=server.go:615 msg="model load progress 0.55"
time=2024-07-12T21:10:06.791Z level=DEBUG source=server.go:615 msg="model load progress 0.56"
time=2024-07-12T21:10:07.042Z level=DEBUG source=server.go:615 msg="model load progress 0.58"
time=2024-07-12T21:10:08.047Z level=DEBUG source=server.go:615 msg="model load progress 0.60"
time=2024-07-12T21:10:08.550Z level=DEBUG source=server.go:615 msg="model load progress 0.61"
time=2024-07-12T21:10:09.052Z level=DEBUG source=server.go:615 msg="model load progress 0.62"
time=2024-07-12T21:10:09.304Z level=DEBUG source=server.go:615 msg="model load progress 0.63"
time=2024-07-12T21:10:10.058Z level=DEBUG source=server.go:615 msg="model load progress 0.64"
time=2024-07-12T21:10:10.309Z level=DEBUG source=server.go:615 msg="model load progress 0.65"
time=2024-07-12T21:10:10.560Z level=DEBUG source=server.go:615 msg="model load progress 0.66"
time=2024-07-12T21:10:11.314Z level=DEBUG source=server.go:615 msg="model load progress 0.67"
time=2024-07-12T21:10:11.565Z level=DEBUG source=server.go:615 msg="model load progress 0.68"
time=2024-07-12T21:10:11.817Z level=DEBUG source=server.go:615 msg="model load progress 0.69"
time=2024-07-12T21:10:12.319Z level=DEBUG source=server.go:615 msg="model load progress 0.70"
time=2024-07-12T21:10:12.570Z level=DEBUG source=server.go:615 msg="model load progress 0.71"
time=2024-07-12T21:10:13.575Z level=DEBUG source=server.go:615 msg="model load progress 0.73"
time=2024-07-12T21:10:13.826Z level=DEBUG source=server.go:615 msg="model load progress 0.74"
time=2024-07-12T21:10:14.580Z level=DEBUG source=server.go:615 msg="model load progress 0.75"
time=2024-07-12T21:10:14.831Z level=DEBUG source=server.go:615 msg="model load progress 0.76"
time=2024-07-12T21:10:15.334Z level=DEBUG source=server.go:615 msg="model load progress 0.77"
time=2024-07-12T21:10:15.837Z level=DEBUG source=server.go:615 msg="model load progress 0.78"
time=2024-07-12T21:10:16.088Z level=DEBUG source=server.go:615 msg="model load progress 0.79"
time=2024-07-12T21:10:16.842Z level=DEBUG source=server.go:615 msg="model load progress 0.81"
time=2024-07-12T21:10:17.093Z level=DEBUG source=server.go:615 msg="model load progress 0.82"
time=2024-07-12T21:10:17.847Z level=DEBUG source=server.go:615 msg="model load progress 0.83"
time=2024-07-12T21:10:18.098Z level=DEBUG source=server.go:615 msg="model load progress 0.84"
time=2024-07-12T21:10:18.852Z level=DEBUG source=server.go:615 msg="model load progress 0.85"
time=2024-07-12T21:10:19.103Z level=DEBUG source=server.go:615 msg="model load progress 0.86"
time=2024-07-12T21:10:19.354Z level=DEBUG source=server.go:615 msg="model load progress 0.87"
time=2024-07-12T21:10:20.108Z level=DEBUG source=server.go:615 msg="model load progress 0.88"
time=2024-07-12T21:10:20.360Z level=DEBUG source=server.go:615 msg="model load progress 0.89"
time=2024-07-12T21:10:20.611Z level=DEBUG source=server.go:615 msg="model load progress 0.90"
time=2024-07-12T21:10:21.365Z level=DEBUG source=server.go:615 msg="model load progress 0.91"
time=2024-07-12T21:10:21.616Z level=DEBUG source=server.go:615 msg="model load progress 0.92"
time=2024-07-12T21:10:22.118Z level=DEBUG source=server.go:615 msg="model load progress 0.93"
time=2024-07-12T21:10:22.369Z level=DEBUG source=server.go:615 msg="model load progress 0.94"
time=2024-07-12T21:10:22.620Z level=DEBUG source=server.go:615 msg="model load progress 0.95"
time=2024-07-12T21:10:23.374Z level=DEBUG source=server.go:615 msg="model load progress 0.96"
time=2024-07-12T21:10:23.625Z level=DEBUG source=server.go:615 msg="model load progress 0.97"
time=2024-07-12T21:10:23.876Z level=DEBUG source=server.go:615 msg="model load progress 0.98"
time=2024-07-12T21:10:24.630Z level=DEBUG source=server.go:615 msg="model load progress 0.99"
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
time=2024-07-12T21:10:24.880Z level=DEBUG source=server.go:615 msg="model load progress 1.00"
llama_kv_cache_init:      CUDA0 KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     2.02 MiB
time=2024-07-12T21:10:25.132Z level=DEBUG source=server.go:618 msg="model load completed, waiting for server to become available" status="llm server loading model"
llama_new_context_with_model:      CUDA0 compute buffer size =   560.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
time=2024-07-12T21:15:25.152Z level=ERROR source=sched.go:480 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 1.00 - "
time=2024-07-12T21:15:25.153Z level=DEBUG source=sched.go:483 msg="triggering expiration for failed load" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-12T21:15:25.153Z level=DEBUG source=sched.go:384 msg="runner expired event received" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-12T21:15:25.153Z level=DEBUG source=sched.go:400 msg="got lock to unload" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
[GIN] 2024/07/12 - 21:15:25 | 500 |         6m17s |  192.168.75.195 | POST     "/api/chat"
time=2024-07-12T21:15:25.153Z level=DEBUG source=gpu.go:336 msg="updating system memory data" before.total="23.5 GiB" before.free="20.4 GiB" now.total="23.5 GiB" now.free="19.9 GiB"
CUDA driver version: 12.4
time=2024-07-12T21:15:25.379Z level=DEBUG source=gpu.go:377 msg="updating cuda memory data" gpu=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda name="GRID P40-8Q" before.total="8.0 GiB" before.free="7.4 GiB" now.total="8.0 GiB" now.free="1.6 GiB" now.used="6.4 GiB"
releasing cuda driver library
time=2024-07-12T21:15:25.379Z level=DEBUG source=server.go:1026 msg="stopping llama server"
time=2024-07-12T21:15:25.379Z level=DEBUG source=server.go:1032 msg="waiting for llama server to exit"
time=2024-07-12T21:15:25.466Z level=DEBUG source=server.go:1036 msg="llama server stopped"
time=2024-07-12T21:15:25.466Z level=DEBUG source=sched.go:405 msg="runner released" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-12T21:15:25.630Z level=DEBUG source=gpu.go:336 msg="updating system memory data" before.total="23.5 GiB" before.free="19.9 GiB" now.total="23.5 GiB" now.free="20.4 GiB"
CUDA driver version: 12.4
time=2024-07-12T21:15:25.767Z level=DEBUG source=gpu.go:377 msg="updating cuda memory data" gpu=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda name="GRID P40-8Q" before.total="8.0 GiB" before.free="1.6 GiB" now.total="8.0 GiB" now.free="7.4 GiB" now.used="650.0 MiB"
releasing cuda driver library
time=2024-07-12T21:15:25.767Z level=DEBUG source=sched.go:684 msg="gpu VRAM free memory converged after 0.61 seconds" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-12T21:15:25.767Z level=DEBUG source=sched.go:409 msg="sending an unloaded event" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-12T21:15:25.767Z level=DEBUG source=sched.go:332 msg="ignoring unload event with no pending requests"
[GIN] 2024/07/12 - 21:26:03 | 200 |      23.765µs |  192.168.75.195 | GET      "/api/version"
time=2024-07-12T21:26:08.828Z level=DEBUG source=gpu.go:336 msg="updating system memory data" before.total="23.5 GiB" before.free="20.4 GiB" now.total="23.5 GiB" now.free="20.4 GiB"
CUDA driver version: 12.4
time=2024-07-12T21:26:08.990Z level=DEBUG source=gpu.go:377 msg="updating cuda memory data" gpu=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda name="GRID P40-8Q" before.total="8.0 GiB" before.free="7.4 GiB" now.total="8.0 GiB" now.free="7.4 GiB" now.used="650.0 MiB"
releasing cuda driver library
time=2024-07-12T21:26:09.008Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[7.4 GiB]"
time=2024-07-12T21:26:09.008Z level=DEBUG source=sched.go:251 msg="loading first model" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-12T21:26:09.008Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[7.4 GiB]"
time=2024-07-12T21:26:09.009Z level=INFO source=sched.go:738 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa gpu=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda parallel=4 available=7908343808 required="6.2 GiB"
time=2024-07-12T21:26:09.009Z level=DEBUG source=server.go:98 msg="system memory" total="23.5 GiB" free=21860347904
time=2024-07-12T21:26:09.009Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[7.4 GiB]"
time=2024-07-12T21:26:09.009Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[7.4 GiB]" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
time=2024-07-12T21:26:09.010Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu/ollama_llama_server
time=2024-07-12T21:26:09.010Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx/ollama_llama_server
time=2024-07-12T21:26:09.010Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx2/ollama_llama_server
time=2024-07-12T21:26:09.010Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cuda_v11/ollama_llama_server
time=2024-07-12T21:26:09.010Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/rocm_v60101/ollama_llama_server
time=2024-07-12T21:26:09.010Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu/ollama_llama_server
time=2024-07-12T21:26:09.010Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx/ollama_llama_server
time=2024-07-12T21:26:09.010Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx2/ollama_llama_server
time=2024-07-12T21:26:09.010Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cuda_v11/ollama_llama_server
time=2024-07-12T21:26:09.010Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/rocm_v60101/ollama_llama_server
time=2024-07-12T21:26:09.010Z level=INFO source=server.go:375 msg="starting llama server" cmd="/tmp/ollama4058211551/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --verbose --parallel 4 --port 42755"
time=2024-07-12T21:26:09.010Z level=DEBUG source=server.go:390 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/tmp/ollama4058211551/runners/cuda_v11:/tmp/ollama4058211551/runners:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 CUDA_VISIBLE_DEVICES=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda]"
time=2024-07-12T21:26:09.011Z level=INFO source=sched.go:474 msg="loaded runners" count=1
time=2024-07-12T21:26:09.011Z level=INFO source=server.go:563 msg="waiting for llama runner to start responding"
time=2024-07-12T21:26:09.011Z level=INFO source=server.go:604 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="a8db2a9" tid="140184082333696" timestamp=1720819569
INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="140184082333696" timestamp=1720819569 total_threads=8
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="7" port="42755" tid="140184082333696" timestamp=1720819569
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
time=2024-07-12T21:26:09.262Z level=INFO source=server.go:604 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW) 
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: GRID P40-8Q, compute capability 6.1, VMM: no
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   281.81 MiB
llm_load_tensors:      CUDA0 buffer size =  4155.99 MiB
time=2024-07-12T21:26:49.980Z level=DEBUG source=server.go:615 msg="model load progress 0.06"
time=2024-07-12T21:26:50.232Z level=DEBUG source=server.go:615 msg="model load progress 0.16"
time=2024-07-12T21:26:51.237Z level=DEBUG source=server.go:615 msg="model load progress 0.18"
time=2024-07-12T21:26:51.740Z level=DEBUG source=server.go:615 msg="model load progress 0.19"
time=2024-07-12T21:26:52.242Z level=DEBUG source=server.go:615 msg="model load progress 0.20"
time=2024-07-12T21:26:52.494Z level=DEBUG source=server.go:615 msg="model load progress 0.21"
time=2024-07-12T21:26:53.248Z level=DEBUG source=server.go:615 msg="model load progress 0.22"
time=2024-07-12T21:26:53.499Z level=DEBUG source=server.go:615 msg="model load progress 0.24"
time=2024-07-12T21:26:54.504Z level=DEBUG source=server.go:615 msg="model load progress 0.25"
time=2024-07-12T21:26:54.755Z level=DEBUG source=server.go:615 msg="model load progress 0.26"
time=2024-07-12T21:26:55.258Z level=DEBUG source=server.go:615 msg="model load progress 0.27"
time=2024-07-12T21:26:55.760Z level=DEBUG source=server.go:615 msg="model load progress 0.29"
time=2024-07-12T21:26:56.766Z level=DEBUG source=server.go:615 msg="model load progress 0.31"
time=2024-07-12T21:26:57.017Z level=DEBUG source=server.go:615 msg="model load progress 0.32"
time=2024-07-12T21:26:57.771Z level=DEBUG source=server.go:615 msg="model load progress 0.33"
time=2024-07-12T21:26:58.022Z level=DEBUG source=server.go:615 msg="model load progress 0.34"
time=2024-07-12T21:26:58.525Z level=DEBUG source=server.go:615 msg="model load progress 0.35"
time=2024-07-12T21:26:59.028Z level=DEBUG source=server.go:615 msg="model load progress 0.37"
time=2024-07-12T21:27:00.033Z level=DEBUG source=server.go:615 msg="model load progress 0.39"
time=2024-07-12T21:27:00.536Z level=DEBUG source=server.go:615 msg="model load progress 0.40"
time=2024-07-12T21:27:01.038Z level=DEBUG source=server.go:615 msg="model load progress 0.41"
time=2024-07-12T21:27:01.290Z level=DEBUG source=server.go:615 msg="model load progress 0.42"
time=2024-07-12T21:27:02.044Z level=DEBUG source=server.go:615 msg="model load progress 0.43"
time=2024-07-12T21:27:02.295Z level=DEBUG source=server.go:615 msg="model load progress 0.45"
time=2024-07-12T21:27:03.301Z level=DEBUG source=server.go:615 msg="model load progress 0.47"
time=2024-07-12T21:27:03.803Z level=DEBUG source=server.go:615 msg="model load progress 0.48"
time=2024-07-12T21:27:04.557Z level=DEBUG source=server.go:615 msg="model load progress 0.50"
time=2024-07-12T21:27:05.563Z level=DEBUG source=server.go:615 msg="model load progress 0.52"
time=2024-07-12T21:27:05.814Z level=DEBUG source=server.go:615 msg="model load progress 0.53"
time=2024-07-12T21:27:06.568Z level=DEBUG source=server.go:615 msg="model load progress 0.54"
time=2024-07-12T21:27:06.819Z level=DEBUG source=server.go:615 msg="model load progress 0.55"
time=2024-07-12T21:27:07.322Z level=DEBUG source=server.go:615 msg="model load progress 0.56"
time=2024-07-12T21:27:07.825Z level=DEBUG source=server.go:615 msg="model load progress 0.58"
time=2024-07-12T21:27:08.831Z level=DEBUG source=server.go:615 msg="model load progress 0.60"
time=2024-07-12T21:27:09.333Z level=DEBUG source=server.go:615 msg="model load progress 0.61"
time=2024-07-12T21:27:09.836Z level=DEBUG source=server.go:615 msg="model load progress 0.62"
time=2024-07-12T21:27:10.087Z level=DEBUG source=server.go:615 msg="model load progress 0.63"
time=2024-07-12T21:27:10.841Z level=DEBUG source=server.go:615 msg="model load progress 0.64"
time=2024-07-12T21:27:11.092Z level=DEBUG source=server.go:615 msg="model load progress 0.65"
time=2024-07-12T21:27:11.343Z level=DEBUG source=server.go:615 msg="model load progress 0.66"
time=2024-07-12T21:27:12.098Z level=DEBUG source=server.go:615 msg="model load progress 0.67"
time=2024-07-12T21:27:12.349Z level=DEBUG source=server.go:615 msg="model load progress 0.68"
time=2024-07-12T21:27:12.600Z level=DEBUG source=server.go:615 msg="model load progress 0.69"
time=2024-07-12T21:27:13.103Z level=DEBUG source=server.go:615 msg="model load progress 0.70"
time=2024-07-12T21:27:13.354Z level=DEBUG source=server.go:615 msg="model load progress 0.71"
time=2024-07-12T21:27:14.359Z level=DEBUG source=server.go:615 msg="model load progress 0.73"
time=2024-07-12T21:27:14.611Z level=DEBUG source=server.go:615 msg="model load progress 0.74"
time=2024-07-12T21:27:15.365Z level=DEBUG source=server.go:615 msg="model load progress 0.75"
time=2024-07-12T21:27:15.616Z level=DEBUG source=server.go:615 msg="model load progress 0.76"
time=2024-07-12T21:27:16.119Z level=DEBUG source=server.go:615 msg="model load progress 0.77"
time=2024-07-12T21:27:16.621Z level=DEBUG source=server.go:615 msg="model load progress 0.78"
time=2024-07-12T21:27:16.872Z level=DEBUG source=server.go:615 msg="model load progress 0.79"
time=2024-07-12T21:27:17.627Z level=DEBUG source=server.go:615 msg="model load progress 0.81"
time=2024-07-12T21:27:17.878Z level=DEBUG source=server.go:615 msg="model load progress 0.82"
time=2024-07-12T21:27:18.632Z level=DEBUG source=server.go:615 msg="model load progress 0.83"
time=2024-07-12T21:27:18.883Z level=DEBUG source=server.go:615 msg="model load progress 0.84"
time=2024-07-12T21:27:19.637Z level=DEBUG source=server.go:615 msg="model load progress 0.85"
time=2024-07-12T21:27:19.888Z level=DEBUG source=server.go:615 msg="model load progress 0.86"
time=2024-07-12T21:27:20.139Z level=DEBUG source=server.go:615 msg="model load progress 0.87"
time=2024-07-12T21:27:20.893Z level=DEBUG source=server.go:615 msg="model load progress 0.88"
time=2024-07-12T21:27:21.144Z level=DEBUG source=server.go:615 msg="model load progress 0.89"
time=2024-07-12T21:27:21.396Z level=DEBUG source=server.go:615 msg="model load progress 0.90"
time=2024-07-12T21:27:22.150Z level=DEBUG source=server.go:615 msg="model load progress 0.91"
time=2024-07-12T21:27:22.401Z level=DEBUG source=server.go:615 msg="model load progress 0.92"
time=2024-07-12T21:27:22.904Z level=DEBUG source=server.go:615 msg="model load progress 0.93"
time=2024-07-12T21:27:23.155Z level=DEBUG source=server.go:615 msg="model load progress 0.94"
time=2024-07-12T21:27:23.407Z level=DEBUG source=server.go:615 msg="model load progress 0.95"
time=2024-07-12T21:27:24.160Z level=DEBUG source=server.go:615 msg="model load progress 0.96"
time=2024-07-12T21:27:24.411Z level=DEBUG source=server.go:615 msg="model load progress 0.97"
time=2024-07-12T21:27:24.663Z level=DEBUG source=server.go:615 msg="model load progress 0.98"
time=2024-07-12T21:27:25.417Z level=DEBUG source=server.go:615 msg="model load progress 0.99"
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
time=2024-07-12T21:27:25.668Z level=DEBUG source=server.go:615 msg="model load progress 1.00"
llama_kv_cache_init:      CUDA0 KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     2.02 MiB
time=2024-07-12T21:27:25.919Z level=DEBUG source=server.go:618 msg="model load completed, waiting for server to become available" status="llm server loading model"
llama_new_context_with_model:      CUDA0 compute buffer size =   560.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
time=2024-07-12T21:32:26.020Z level=ERROR source=sched.go:480 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 1.00 - "
time=2024-07-12T21:32:26.020Z level=DEBUG source=sched.go:483 msg="triggering expiration for failed load" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-12T21:32:26.020Z level=DEBUG source=sched.go:384 msg="runner expired event received" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-12T21:32:26.020Z level=DEBUG source=sched.go:400 msg="got lock to unload" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
[GIN] 2024/07/12 - 21:32:26 | 500 |         6m17s |  192.168.75.195 | POST     "/api/chat"
time=2024-07-12T21:32:26.020Z level=DEBUG source=gpu.go:336 msg="updating system memory data" before.total="23.5 GiB" before.free="20.4 GiB" now.total="23.5 GiB" now.free="19.9 GiB"
CUDA driver version: 12.4
time=2024-07-12T21:32:26.223Z level=DEBUG source=gpu.go:377 msg="updating cuda memory data" gpu=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda name="GRID P40-8Q" before.total="8.0 GiB" before.free="7.4 GiB" now.total="8.0 GiB" now.free="1.6 GiB" now.used="6.4 GiB"
releasing cuda driver library
time=2024-07-12T21:32:26.223Z level=DEBUG source=server.go:1026 msg="stopping llama server"
time=2024-07-12T21:32:26.224Z level=DEBUG source=server.go:1032 msg="waiting for llama server to exit"
time=2024-07-12T21:32:26.311Z level=DEBUG source=server.go:1036 msg="llama server stopped"
time=2024-07-12T21:32:26.311Z level=DEBUG source=sched.go:405 msg="runner released" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-12T21:32:26.474Z level=DEBUG source=gpu.go:336 msg="updating system memory data" before.total="23.5 GiB" before.free="19.9 GiB" now.total="23.5 GiB" now.free="20.4 GiB"
CUDA driver version: 12.4
time=2024-07-12T21:32:26.619Z level=DEBUG source=gpu.go:377 msg="updating cuda memory data" gpu=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda name="GRID P40-8Q" before.total="8.0 GiB" before.free="1.6 GiB" now.total="8.0 GiB" now.free="7.4 GiB" now.used="650.0 MiB"
releasing cuda driver library
time=2024-07-12T21:32:26.620Z level=DEBUG source=sched.go:684 msg="gpu VRAM free memory converged after 0.60 seconds" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-12T21:32:26.620Z level=DEBUG source=sched.go:409 msg="sending an unloaded event" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-12T21:32:26.620Z level=DEBUG source=sched.go:332 msg="ignoring unload event with no pending requests"
<!-- gh-comment-id:2226394638 --> @NWBx01 commented on GitHub (Jul 12, 2024): Log continues: ``` llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: GRID P40-8Q, compute capability 6.1, VMM: no llm_load_tensors: ggml ctx size = 0.27 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 281.81 MiB llm_load_tensors: CUDA0 buffer size = 4155.99 MiB time=2024-07-12T20:52:08.015Z level=DEBUG source=server.go:615 msg="model load progress 0.19" time=2024-07-12T20:52:08.266Z level=DEBUG source=server.go:615 msg="model load progress 0.71" llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 time=2024-07-12T20:52:08.517Z level=DEBUG source=server.go:615 msg="model load progress 1.00" llama_kv_cache_init: CUDA0 KV buffer size = 1024.00 MiB llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 2.02 MiB llama_new_context_with_model: CUDA0 compute buffer size = 560.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 24.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 2 time=2024-07-12T20:52:08.767Z level=DEBUG source=server.go:618 msg="model load completed, waiting for server to become available" status="llm server loading model" DEBUG [initialize] initializing slots | n_slots=4 tid="139684785954816" timestamp=1720817529 DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=0 tid="139684785954816" timestamp=1720817529 DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=1 tid="139684785954816" timestamp=1720817529 DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=2 tid="139684785954816" timestamp=1720817529 DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=3 tid="139684785954816" timestamp=1720817529 INFO [main] model loaded | tid="139684785954816" timestamp=1720817529 DEBUG [update_slots] all slots are idle and system prompt is empty, clear the KV cache | tid="139684785954816" timestamp=1720817529 DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=0 tid="139684785954816" timestamp=1720817529 time=2024-07-12T20:52:09.270Z level=INFO source=server.go:609 msg="llama runner started in 2.51 seconds" time=2024-07-12T20:52:09.270Z level=DEBUG source=sched.go:487 msg="finished setting up runner" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=1 tid="139684785954816" timestamp=1720817529 DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=38022 status=200 tid="139684315619328" timestamp=1720817529 DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=2 tid="139684785954816" timestamp=1720817529 DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=38022 status=200 tid="139684315619328" timestamp=1720817529 DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=3 tid="139684785954816" timestamp=1720817529 DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=38032 status=200 tid="139684307226624" timestamp=1720817529 DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=4 tid="139684785954816" timestamp=1720817529 DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=38032 status=200 tid="139684307226624" timestamp=1720817529 DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=5 tid="139684785954816" timestamp=1720817529 DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=38032 status=200 tid="139684307226624" timestamp=1720817529 time=2024-07-12T20:52:09.537Z level=DEBUG source=prompt.go:168 msg="prompt now fits in context window" required=1466 window=2048 time=2024-07-12T20:52:09.537Z level=DEBUG source=routes.go:1334 msg="chat handler" prompt="<|start_header_id|>user<|end_header_id|>\n\nWhat's the deal with orange juice?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nOrange juice - a staple in many breakfast routines, but also a beverage that sparks controversy and debate. Here are some interesting facts and perspectives on OJ:\n\n**Quality concerns:**\n\n1. **Processing:** Most commercial orange juices are made from concentrate, which involves freezing or drying the juice before reconstituting it with water. This can affect the flavor, nutrients, and overall quality of the final product.\n2. **Additives:** Some OJs contain added sugars, preservatives, and flavor enhancers, which may not align with consumer expectations.\n\n**Nutritional aspects:**\n\n1. **Vitamin C:** Orange juice is an excellent source of vitamin C, a vital antioxidant that boosts immune function and overall health.\n2. **Flavonoids:** OJ contains flavonoids, a type of polyphenol that may help protect against chronic diseases like heart disease, cancer, and cognitive decline.\n3. **Sugar content:** Orange juice can be high in natural sugars (fructose and glucose), which can be a concern for those monitoring their sugar intake.\n\n**Environmental impact:**\n\n1. **Sustainability:** Large-scale orange farming can have negative environmental impacts, such as water pollution, soil degradation, and habitat destruction.\n2. **Fair trade:** The orange juice industry is often criticized for exploiting small farmers and workers in countries like Brazil and Florida.\n\n**Cultural significance:**\n\n1. **Breakfast staple:** Orange juice has become a standard breakfast beverage in many Western cultures, particularly in the United States.\n2. **Florida's identity:** Orange juice is closely tied to Florida's agricultural heritage and economy, with the state being one of the world's largest producers.\n\n**Controversies and debates:**\n\n1. **Fake OJ:** The term \"fake orange juice\" refers to OJs that are not 100% freshly squeezed or contain added ingredients.\n2. **Squeeze vs. concentrate:** There is ongoing debate about whether fresh-squeezed OJ or concentrated juice with water reconstitution is better.\n\nIn conclusion, orange juice is a complex beverage with both positive and negative aspects. While it provides essential nutrients like vitamin C, its processing methods, sugar content, and environmental impact are areas of concern. As consumers, we can make informed choices by opting for high-quality, sustainably sourced OJs or exploring alternative beverages that align with our values and dietary needs.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nOh, yeah. Speaking of Florida. I've heard that there was a train that carried oranges or orange juice. Do you know about that?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nYou're referring to the \"Orange Blossom Special\"!\n\nThe Orange Blossom Special was a nickname for the Atlantic Coast Line Railroad's (ACL) passenger train service between Jacksonville, Florida, and New York City. The train ran from 1929 to 1970 and became famous for its unique cargo: orange juice.\n\nIn the early 20th century, Florida's citrus industry was booming, and oranges were a major commodity. To transport these perishable goods efficiently and safely, ACL developed a specialized train service. The Orange Blossom Special would carry refrigerated cars filled with freshly squeezed orange juice from Florida to major cities in the Northeast.\n\nThe train's route would take it through the Appalachian Mountains, where it would stop at key stations like Washington D.C. and Philadelphia. At each stop, the train would offload its precious cargo to supply local markets. The journey took about 30 hours, depending on the number of stops and the weather conditions.\n\nThe Orange Blossom Special was more than just a transportation service; it became an iconic symbol of Florida's citrus industry and American culture. The train was immortalized in song by Johnny Cash, who wrote \"Orange Blossom Special\" (also known as \"The Orange Blossom Special\") in 1965. The catchy tune tells the story of a man waiting for the train at a station, reminiscing about his love of the Florida sunshine and the sweet taste of freshly squeezed OJ.\n\nAlthough the Orange Blossom Special ceased operations in 1970 due to declining passenger traffic and the rise of air transportation, its legacy lives on as a nostalgic reminder of Florida's citrus heritage.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHuh. That's pretty interesting. Speaking of, do you know what's going on with Amtrak?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nAmtrak! The national passenger railroad service in the United States has been around since 1971, and it's had its share of challenges and changes over the years.\n\nCurrently, Amtrak is facing some significant hurdles:\n\n1. **Funding:** Amtrak relies heavily on federal funding to operate its services. However, as a result of the COVID-19 pandemic, ridership numbers have dropped significantly, leading to reduced revenue and increased pressure on funding.\n2. **Infrastructure:** Much of Amtrak's infrastructure, including tracks, bridges, and stations, is aging or in need of repair. The railroad is working to upgrade its network, but this process can be slow and costly.\n3. **Competition:** With the rise of ride-hailing services, buses, and airlines, Amtrak faces increased competition for passengers' attention. To stay competitive, Amtrak has been focusing on improving service quality, expanding routes, and offering more amenities.\n4. **Coronavirus pandemic:** As I mentioned earlier, the pandemic has had a significant impact on Amtrak's ridership and revenue. The railroad has implemented various safety measures to reduce the risk of transmission, but this has also affected its operations.\n\nDespite these challenges, Amtrak is taking steps to modernize and improve its services:\n\n1. **New trains:** Amtrak is introducing new trainsets, such as the Acela Express and the Northeast Regional trains, which offer improved amenities, comfort, and technology.\n2. **Electrification:** Amtrak is working on electrifying some of its routes, like the Northeast Corridor (NEC), to reduce emissions and increase efficiency.\n3. **Station upgrades:** Amtrak is investing in station renovations, including modernizing facilities, improving accessibility, and enhancing passenger experiences.\n4. **Coronavirus response:** The railroad has implemented various measures to reduce the spread of COVID-19 on its trains and stations, such as increased cleaning protocols, social distancing measures, and mask mandates.\n\nAmtrak continues to play a vital role in connecting Americans across the country, and while it faces challenges, the railroad is working to adapt and improve its services for the future.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nAt this point, I'm going to wait 10 minutes or so before my next response. <|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nTake your time! I'll be here when you're ready to continue the conversation.\n\nIn the meantime, feel free to take a break, grab a snack, stretch, or do whatever relaxes you. When you're ready, just let me know and we can pick up where we left off.\n\nRemember, there's no rush, and I'm happy to chat with you whenever you're ready!<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nOkay, I'm now responding again. Can you generate a response please?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" images=0 DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=6 tid="139684785954816" timestamp=1720817529 DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=7 tid="139684785954816" timestamp=1720817529 DEBUG [update_slots] slot progression | ga_i=0 n_past=0 n_past_se=0 n_prompt_tokens_processed=1465 slot_id=0 task_id=7 tid="139684785954816" timestamp=1720817529 DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=7 tid="139684785954816" timestamp=1720817529 DEBUG [print_timings] prompt eval time = 3977.53 ms / 1465 tokens ( 2.72 ms per token, 368.32 tokens per second) | n_prompt_tokens_processed=1465 n_tokens_second=368.3188469633934 slot_id=0 t_prompt_processing=3977.532 t_token=2.7150389078498294 task_id=7 tid="139684785954816" timestamp=1720817545 DEBUG [print_timings] generation eval time = 11510.08 ms / 200 runs ( 57.55 ms per token, 17.38 tokens per second) | n_decoded=200 n_tokens_second=17.37607987992434 slot_id=0 t_token=57.55038 t_token_generation=11510.076 task_id=7 tid="139684785954816" timestamp=1720817545 DEBUG [print_timings] total time = 15487.61 ms | slot_id=0 t_prompt_processing=3977.532 t_token_generation=11510.076 t_total=15487.608 task_id=7 tid="139684785954816" timestamp=1720817545 DEBUG [update_slots] slot released | n_cache_tokens=1665 n_ctx=8192 n_past=1664 n_system_tokens=0 slot_id=0 task_id=7 tid="139684785954816" timestamp=1720817545 truncated=false DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=38044 status=200 tid="139684298833920" timestamp=1720817545 [GIN] 2024/07/12 - 20:52:25 | 200 | 18.505977902s | 192.168.75.195 | POST "/api/chat" time=2024-07-12T20:52:25.027Z level=DEBUG source=sched.go:491 msg="context for request finished" time=2024-07-12T20:52:25.027Z level=DEBUG source=sched.go:363 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa duration=5m0s time=2024-07-12T20:52:25.027Z level=DEBUG source=sched.go:381 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa refCount=0 time=2024-07-12T20:57:25.028Z level=DEBUG source=sched.go:365 msg="timer expired, expiring to unload" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-12T20:57:25.028Z level=DEBUG source=sched.go:384 msg="runner expired event received" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-12T20:57:25.028Z level=DEBUG source=sched.go:400 msg="got lock to unload" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-12T20:57:25.028Z level=DEBUG source=gpu.go:336 msg="updating system memory data" before.total="23.5 GiB" before.free="20.4 GiB" now.total="23.5 GiB" now.free="19.9 GiB" CUDA driver version: 12.4 time=2024-07-12T20:57:25.292Z level=DEBUG source=gpu.go:377 msg="updating cuda memory data" gpu=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda name="GRID P40-8Q" before.total="8.0 GiB" before.free="7.4 GiB" now.total="8.0 GiB" now.free="1.5 GiB" now.used="6.5 GiB" releasing cuda driver library time=2024-07-12T20:57:25.292Z level=DEBUG source=server.go:1026 msg="stopping llama server" time=2024-07-12T20:57:25.292Z level=DEBUG source=server.go:1032 msg="waiting for llama server to exit" time=2024-07-12T20:57:25.381Z level=DEBUG source=server.go:1036 msg="llama server stopped" time=2024-07-12T20:57:25.381Z level=DEBUG source=sched.go:405 msg="runner released" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-12T20:57:25.542Z level=DEBUG source=gpu.go:336 msg="updating system memory data" before.total="23.5 GiB" before.free="19.9 GiB" now.total="23.5 GiB" now.free="20.4 GiB" CUDA driver version: 12.4 time=2024-07-12T20:57:25.694Z level=DEBUG source=gpu.go:377 msg="updating cuda memory data" gpu=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda name="GRID P40-8Q" before.total="8.0 GiB" before.free="1.5 GiB" now.total="8.0 GiB" now.free="7.4 GiB" now.used="650.0 MiB" releasing cuda driver library time=2024-07-12T20:57:25.694Z level=DEBUG source=sched.go:684 msg="gpu VRAM free memory converged after 0.67 seconds" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-12T20:57:25.694Z level=DEBUG source=sched.go:409 msg="sending an unloaded event" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-12T20:57:25.694Z level=DEBUG source=sched.go:332 msg="ignoring unload event with no pending requests" time=2024-07-12T21:09:08.074Z level=DEBUG source=gpu.go:336 msg="updating system memory data" before.total="23.5 GiB" before.free="20.4 GiB" now.total="23.5 GiB" now.free="20.4 GiB" CUDA driver version: 12.4 time=2024-07-12T21:09:08.224Z level=DEBUG source=gpu.go:377 msg="updating cuda memory data" gpu=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda name="GRID P40-8Q" before.total="8.0 GiB" before.free="7.4 GiB" now.total="8.0 GiB" now.free="7.4 GiB" now.used="650.0 MiB" releasing cuda driver library time=2024-07-12T21:09:08.241Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[7.4 GiB]" time=2024-07-12T21:09:08.242Z level=DEBUG source=sched.go:251 msg="loading first model" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-12T21:09:08.242Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[7.4 GiB]" time=2024-07-12T21:09:08.242Z level=INFO source=sched.go:738 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa gpu=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda parallel=4 available=7908343808 required="6.2 GiB" time=2024-07-12T21:09:08.242Z level=DEBUG source=server.go:98 msg="system memory" total="23.5 GiB" free=21881417728 time=2024-07-12T21:09:08.242Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[7.4 GiB]" time=2024-07-12T21:09:08.243Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[7.4 GiB]" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB" time=2024-07-12T21:09:08.243Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu/ollama_llama_server time=2024-07-12T21:09:08.243Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx/ollama_llama_server time=2024-07-12T21:09:08.243Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx2/ollama_llama_server time=2024-07-12T21:09:08.243Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cuda_v11/ollama_llama_server time=2024-07-12T21:09:08.243Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/rocm_v60101/ollama_llama_server time=2024-07-12T21:09:08.243Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu/ollama_llama_server time=2024-07-12T21:09:08.243Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx/ollama_llama_server time=2024-07-12T21:09:08.243Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx2/ollama_llama_server time=2024-07-12T21:09:08.243Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cuda_v11/ollama_llama_server time=2024-07-12T21:09:08.243Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/rocm_v60101/ollama_llama_server time=2024-07-12T21:09:08.244Z level=INFO source=server.go:375 msg="starting llama server" cmd="/tmp/ollama4058211551/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --verbose --parallel 4 --port 36433" time=2024-07-12T21:09:08.244Z level=DEBUG source=server.go:390 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/tmp/ollama4058211551/runners/cuda_v11:/tmp/ollama4058211551/runners:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 CUDA_VISIBLE_DEVICES=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda]" time=2024-07-12T21:09:08.244Z level=INFO source=sched.go:474 msg="loaded runners" count=1 time=2024-07-12T21:09:08.244Z level=INFO source=server.go:563 msg="waiting for llama runner to start responding" time=2024-07-12T21:09:08.244Z level=INFO source=server.go:604 msg="waiting for server to become available" status="llm server error" INFO [main] build info | build=1 commit="a8db2a9" tid="140349579063296" timestamp=1720818548 INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="140349579063296" timestamp=1720818548 total_threads=8 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="7" port="36433" tid="140349579063296" timestamp=1720818548 llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors time=2024-07-12T21:09:08.496Z level=INFO source=server.go:604 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.8000 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: GRID P40-8Q, compute capability 6.1, VMM: no llm_load_tensors: ggml ctx size = 0.27 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 281.81 MiB llm_load_tensors: CUDA0 buffer size = 4155.99 MiB time=2024-07-12T21:09:49.204Z level=DEBUG source=server.go:615 msg="model load progress 0.06" time=2024-07-12T21:09:49.455Z level=DEBUG source=server.go:615 msg="model load progress 0.16" time=2024-07-12T21:09:50.459Z level=DEBUG source=server.go:615 msg="model load progress 0.18" time=2024-07-12T21:09:50.962Z level=DEBUG source=server.go:615 msg="model load progress 0.19" time=2024-07-12T21:09:51.716Z level=DEBUG source=server.go:615 msg="model load progress 0.21" time=2024-07-12T21:09:52.470Z level=DEBUG source=server.go:615 msg="model load progress 0.22" time=2024-07-12T21:09:52.721Z level=DEBUG source=server.go:615 msg="model load progress 0.24" time=2024-07-12T21:09:53.726Z level=DEBUG source=server.go:615 msg="model load progress 0.25" time=2024-07-12T21:09:53.977Z level=DEBUG source=server.go:615 msg="model load progress 0.26" time=2024-07-12T21:09:54.480Z level=DEBUG source=server.go:615 msg="model load progress 0.27" time=2024-07-12T21:09:54.983Z level=DEBUG source=server.go:615 msg="model load progress 0.29" time=2024-07-12T21:09:55.987Z level=DEBUG source=server.go:615 msg="model load progress 0.31" time=2024-07-12T21:09:56.490Z level=DEBUG source=server.go:615 msg="model load progress 0.32" time=2024-07-12T21:09:56.992Z level=DEBUG source=server.go:615 msg="model load progress 0.33" time=2024-07-12T21:09:57.243Z level=DEBUG source=server.go:615 msg="model load progress 0.34" time=2024-07-12T21:09:57.746Z level=DEBUG source=server.go:615 msg="model load progress 0.35" time=2024-07-12T21:09:58.249Z level=DEBUG source=server.go:615 msg="model load progress 0.37" time=2024-07-12T21:09:59.254Z level=DEBUG source=server.go:615 msg="model load progress 0.39" time=2024-07-12T21:09:59.756Z level=DEBUG source=server.go:615 msg="model load progress 0.40" time=2024-07-12T21:10:00.259Z level=DEBUG source=server.go:615 msg="model load progress 0.41" time=2024-07-12T21:10:00.510Z level=DEBUG source=server.go:615 msg="model load progress 0.42" time=2024-07-12T21:10:01.515Z level=DEBUG source=server.go:615 msg="model load progress 0.44" time=2024-07-12T21:10:01.766Z level=DEBUG source=server.go:615 msg="model load progress 0.45" time=2024-07-12T21:10:02.520Z level=DEBUG source=server.go:615 msg="model load progress 0.47" time=2024-07-12T21:10:03.022Z level=DEBUG source=server.go:615 msg="model load progress 0.48" time=2024-07-12T21:10:03.776Z level=DEBUG source=server.go:615 msg="model load progress 0.50" time=2024-07-12T21:10:04.781Z level=DEBUG source=server.go:615 msg="model load progress 0.52" time=2024-07-12T21:10:05.032Z level=DEBUG source=server.go:615 msg="model load progress 0.53" time=2024-07-12T21:10:05.786Z level=DEBUG source=server.go:615 msg="model load progress 0.54" time=2024-07-12T21:10:06.037Z level=DEBUG source=server.go:615 msg="model load progress 0.55" time=2024-07-12T21:10:06.791Z level=DEBUG source=server.go:615 msg="model load progress 0.56" time=2024-07-12T21:10:07.042Z level=DEBUG source=server.go:615 msg="model load progress 0.58" time=2024-07-12T21:10:08.047Z level=DEBUG source=server.go:615 msg="model load progress 0.60" time=2024-07-12T21:10:08.550Z level=DEBUG source=server.go:615 msg="model load progress 0.61" time=2024-07-12T21:10:09.052Z level=DEBUG source=server.go:615 msg="model load progress 0.62" time=2024-07-12T21:10:09.304Z level=DEBUG source=server.go:615 msg="model load progress 0.63" time=2024-07-12T21:10:10.058Z level=DEBUG source=server.go:615 msg="model load progress 0.64" time=2024-07-12T21:10:10.309Z level=DEBUG source=server.go:615 msg="model load progress 0.65" time=2024-07-12T21:10:10.560Z level=DEBUG source=server.go:615 msg="model load progress 0.66" time=2024-07-12T21:10:11.314Z level=DEBUG source=server.go:615 msg="model load progress 0.67" time=2024-07-12T21:10:11.565Z level=DEBUG source=server.go:615 msg="model load progress 0.68" time=2024-07-12T21:10:11.817Z level=DEBUG source=server.go:615 msg="model load progress 0.69" time=2024-07-12T21:10:12.319Z level=DEBUG source=server.go:615 msg="model load progress 0.70" time=2024-07-12T21:10:12.570Z level=DEBUG source=server.go:615 msg="model load progress 0.71" time=2024-07-12T21:10:13.575Z level=DEBUG source=server.go:615 msg="model load progress 0.73" time=2024-07-12T21:10:13.826Z level=DEBUG source=server.go:615 msg="model load progress 0.74" time=2024-07-12T21:10:14.580Z level=DEBUG source=server.go:615 msg="model load progress 0.75" time=2024-07-12T21:10:14.831Z level=DEBUG source=server.go:615 msg="model load progress 0.76" time=2024-07-12T21:10:15.334Z level=DEBUG source=server.go:615 msg="model load progress 0.77" time=2024-07-12T21:10:15.837Z level=DEBUG source=server.go:615 msg="model load progress 0.78" time=2024-07-12T21:10:16.088Z level=DEBUG source=server.go:615 msg="model load progress 0.79" time=2024-07-12T21:10:16.842Z level=DEBUG source=server.go:615 msg="model load progress 0.81" time=2024-07-12T21:10:17.093Z level=DEBUG source=server.go:615 msg="model load progress 0.82" time=2024-07-12T21:10:17.847Z level=DEBUG source=server.go:615 msg="model load progress 0.83" time=2024-07-12T21:10:18.098Z level=DEBUG source=server.go:615 msg="model load progress 0.84" time=2024-07-12T21:10:18.852Z level=DEBUG source=server.go:615 msg="model load progress 0.85" time=2024-07-12T21:10:19.103Z level=DEBUG source=server.go:615 msg="model load progress 0.86" time=2024-07-12T21:10:19.354Z level=DEBUG source=server.go:615 msg="model load progress 0.87" time=2024-07-12T21:10:20.108Z level=DEBUG source=server.go:615 msg="model load progress 0.88" time=2024-07-12T21:10:20.360Z level=DEBUG source=server.go:615 msg="model load progress 0.89" time=2024-07-12T21:10:20.611Z level=DEBUG source=server.go:615 msg="model load progress 0.90" time=2024-07-12T21:10:21.365Z level=DEBUG source=server.go:615 msg="model load progress 0.91" time=2024-07-12T21:10:21.616Z level=DEBUG source=server.go:615 msg="model load progress 0.92" time=2024-07-12T21:10:22.118Z level=DEBUG source=server.go:615 msg="model load progress 0.93" time=2024-07-12T21:10:22.369Z level=DEBUG source=server.go:615 msg="model load progress 0.94" time=2024-07-12T21:10:22.620Z level=DEBUG source=server.go:615 msg="model load progress 0.95" time=2024-07-12T21:10:23.374Z level=DEBUG source=server.go:615 msg="model load progress 0.96" time=2024-07-12T21:10:23.625Z level=DEBUG source=server.go:615 msg="model load progress 0.97" time=2024-07-12T21:10:23.876Z level=DEBUG source=server.go:615 msg="model load progress 0.98" time=2024-07-12T21:10:24.630Z level=DEBUG source=server.go:615 msg="model load progress 0.99" llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 time=2024-07-12T21:10:24.880Z level=DEBUG source=server.go:615 msg="model load progress 1.00" llama_kv_cache_init: CUDA0 KV buffer size = 1024.00 MiB llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 2.02 MiB time=2024-07-12T21:10:25.132Z level=DEBUG source=server.go:618 msg="model load completed, waiting for server to become available" status="llm server loading model" llama_new_context_with_model: CUDA0 compute buffer size = 560.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 24.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 2 time=2024-07-12T21:15:25.152Z level=ERROR source=sched.go:480 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 1.00 - " time=2024-07-12T21:15:25.153Z level=DEBUG source=sched.go:483 msg="triggering expiration for failed load" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-12T21:15:25.153Z level=DEBUG source=sched.go:384 msg="runner expired event received" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-12T21:15:25.153Z level=DEBUG source=sched.go:400 msg="got lock to unload" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa [GIN] 2024/07/12 - 21:15:25 | 500 | 6m17s | 192.168.75.195 | POST "/api/chat" time=2024-07-12T21:15:25.153Z level=DEBUG source=gpu.go:336 msg="updating system memory data" before.total="23.5 GiB" before.free="20.4 GiB" now.total="23.5 GiB" now.free="19.9 GiB" CUDA driver version: 12.4 time=2024-07-12T21:15:25.379Z level=DEBUG source=gpu.go:377 msg="updating cuda memory data" gpu=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda name="GRID P40-8Q" before.total="8.0 GiB" before.free="7.4 GiB" now.total="8.0 GiB" now.free="1.6 GiB" now.used="6.4 GiB" releasing cuda driver library time=2024-07-12T21:15:25.379Z level=DEBUG source=server.go:1026 msg="stopping llama server" time=2024-07-12T21:15:25.379Z level=DEBUG source=server.go:1032 msg="waiting for llama server to exit" time=2024-07-12T21:15:25.466Z level=DEBUG source=server.go:1036 msg="llama server stopped" time=2024-07-12T21:15:25.466Z level=DEBUG source=sched.go:405 msg="runner released" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-12T21:15:25.630Z level=DEBUG source=gpu.go:336 msg="updating system memory data" before.total="23.5 GiB" before.free="19.9 GiB" now.total="23.5 GiB" now.free="20.4 GiB" CUDA driver version: 12.4 time=2024-07-12T21:15:25.767Z level=DEBUG source=gpu.go:377 msg="updating cuda memory data" gpu=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda name="GRID P40-8Q" before.total="8.0 GiB" before.free="1.6 GiB" now.total="8.0 GiB" now.free="7.4 GiB" now.used="650.0 MiB" releasing cuda driver library time=2024-07-12T21:15:25.767Z level=DEBUG source=sched.go:684 msg="gpu VRAM free memory converged after 0.61 seconds" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-12T21:15:25.767Z level=DEBUG source=sched.go:409 msg="sending an unloaded event" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-12T21:15:25.767Z level=DEBUG source=sched.go:332 msg="ignoring unload event with no pending requests" [GIN] 2024/07/12 - 21:26:03 | 200 | 23.765µs | 192.168.75.195 | GET "/api/version" time=2024-07-12T21:26:08.828Z level=DEBUG source=gpu.go:336 msg="updating system memory data" before.total="23.5 GiB" before.free="20.4 GiB" now.total="23.5 GiB" now.free="20.4 GiB" CUDA driver version: 12.4 time=2024-07-12T21:26:08.990Z level=DEBUG source=gpu.go:377 msg="updating cuda memory data" gpu=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda name="GRID P40-8Q" before.total="8.0 GiB" before.free="7.4 GiB" now.total="8.0 GiB" now.free="7.4 GiB" now.used="650.0 MiB" releasing cuda driver library time=2024-07-12T21:26:09.008Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[7.4 GiB]" time=2024-07-12T21:26:09.008Z level=DEBUG source=sched.go:251 msg="loading first model" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-12T21:26:09.008Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[7.4 GiB]" time=2024-07-12T21:26:09.009Z level=INFO source=sched.go:738 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa gpu=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda parallel=4 available=7908343808 required="6.2 GiB" time=2024-07-12T21:26:09.009Z level=DEBUG source=server.go:98 msg="system memory" total="23.5 GiB" free=21860347904 time=2024-07-12T21:26:09.009Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[7.4 GiB]" time=2024-07-12T21:26:09.009Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[7.4 GiB]" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB" time=2024-07-12T21:26:09.010Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu/ollama_llama_server time=2024-07-12T21:26:09.010Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx/ollama_llama_server time=2024-07-12T21:26:09.010Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx2/ollama_llama_server time=2024-07-12T21:26:09.010Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cuda_v11/ollama_llama_server time=2024-07-12T21:26:09.010Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/rocm_v60101/ollama_llama_server time=2024-07-12T21:26:09.010Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu/ollama_llama_server time=2024-07-12T21:26:09.010Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx/ollama_llama_server time=2024-07-12T21:26:09.010Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cpu_avx2/ollama_llama_server time=2024-07-12T21:26:09.010Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/cuda_v11/ollama_llama_server time=2024-07-12T21:26:09.010Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4058211551/runners/rocm_v60101/ollama_llama_server time=2024-07-12T21:26:09.010Z level=INFO source=server.go:375 msg="starting llama server" cmd="/tmp/ollama4058211551/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --verbose --parallel 4 --port 42755" time=2024-07-12T21:26:09.010Z level=DEBUG source=server.go:390 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/tmp/ollama4058211551/runners/cuda_v11:/tmp/ollama4058211551/runners:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 CUDA_VISIBLE_DEVICES=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda]" time=2024-07-12T21:26:09.011Z level=INFO source=sched.go:474 msg="loaded runners" count=1 time=2024-07-12T21:26:09.011Z level=INFO source=server.go:563 msg="waiting for llama runner to start responding" time=2024-07-12T21:26:09.011Z level=INFO source=server.go:604 msg="waiting for server to become available" status="llm server error" INFO [main] build info | build=1 commit="a8db2a9" tid="140184082333696" timestamp=1720819569 INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="140184082333696" timestamp=1720819569 total_threads=8 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="7" port="42755" tid="140184082333696" timestamp=1720819569 llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors time=2024-07-12T21:26:09.262Z level=INFO source=server.go:604 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.8000 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: GRID P40-8Q, compute capability 6.1, VMM: no llm_load_tensors: ggml ctx size = 0.27 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 281.81 MiB llm_load_tensors: CUDA0 buffer size = 4155.99 MiB time=2024-07-12T21:26:49.980Z level=DEBUG source=server.go:615 msg="model load progress 0.06" time=2024-07-12T21:26:50.232Z level=DEBUG source=server.go:615 msg="model load progress 0.16" time=2024-07-12T21:26:51.237Z level=DEBUG source=server.go:615 msg="model load progress 0.18" time=2024-07-12T21:26:51.740Z level=DEBUG source=server.go:615 msg="model load progress 0.19" time=2024-07-12T21:26:52.242Z level=DEBUG source=server.go:615 msg="model load progress 0.20" time=2024-07-12T21:26:52.494Z level=DEBUG source=server.go:615 msg="model load progress 0.21" time=2024-07-12T21:26:53.248Z level=DEBUG source=server.go:615 msg="model load progress 0.22" time=2024-07-12T21:26:53.499Z level=DEBUG source=server.go:615 msg="model load progress 0.24" time=2024-07-12T21:26:54.504Z level=DEBUG source=server.go:615 msg="model load progress 0.25" time=2024-07-12T21:26:54.755Z level=DEBUG source=server.go:615 msg="model load progress 0.26" time=2024-07-12T21:26:55.258Z level=DEBUG source=server.go:615 msg="model load progress 0.27" time=2024-07-12T21:26:55.760Z level=DEBUG source=server.go:615 msg="model load progress 0.29" time=2024-07-12T21:26:56.766Z level=DEBUG source=server.go:615 msg="model load progress 0.31" time=2024-07-12T21:26:57.017Z level=DEBUG source=server.go:615 msg="model load progress 0.32" time=2024-07-12T21:26:57.771Z level=DEBUG source=server.go:615 msg="model load progress 0.33" time=2024-07-12T21:26:58.022Z level=DEBUG source=server.go:615 msg="model load progress 0.34" time=2024-07-12T21:26:58.525Z level=DEBUG source=server.go:615 msg="model load progress 0.35" time=2024-07-12T21:26:59.028Z level=DEBUG source=server.go:615 msg="model load progress 0.37" time=2024-07-12T21:27:00.033Z level=DEBUG source=server.go:615 msg="model load progress 0.39" time=2024-07-12T21:27:00.536Z level=DEBUG source=server.go:615 msg="model load progress 0.40" time=2024-07-12T21:27:01.038Z level=DEBUG source=server.go:615 msg="model load progress 0.41" time=2024-07-12T21:27:01.290Z level=DEBUG source=server.go:615 msg="model load progress 0.42" time=2024-07-12T21:27:02.044Z level=DEBUG source=server.go:615 msg="model load progress 0.43" time=2024-07-12T21:27:02.295Z level=DEBUG source=server.go:615 msg="model load progress 0.45" time=2024-07-12T21:27:03.301Z level=DEBUG source=server.go:615 msg="model load progress 0.47" time=2024-07-12T21:27:03.803Z level=DEBUG source=server.go:615 msg="model load progress 0.48" time=2024-07-12T21:27:04.557Z level=DEBUG source=server.go:615 msg="model load progress 0.50" time=2024-07-12T21:27:05.563Z level=DEBUG source=server.go:615 msg="model load progress 0.52" time=2024-07-12T21:27:05.814Z level=DEBUG source=server.go:615 msg="model load progress 0.53" time=2024-07-12T21:27:06.568Z level=DEBUG source=server.go:615 msg="model load progress 0.54" time=2024-07-12T21:27:06.819Z level=DEBUG source=server.go:615 msg="model load progress 0.55" time=2024-07-12T21:27:07.322Z level=DEBUG source=server.go:615 msg="model load progress 0.56" time=2024-07-12T21:27:07.825Z level=DEBUG source=server.go:615 msg="model load progress 0.58" time=2024-07-12T21:27:08.831Z level=DEBUG source=server.go:615 msg="model load progress 0.60" time=2024-07-12T21:27:09.333Z level=DEBUG source=server.go:615 msg="model load progress 0.61" time=2024-07-12T21:27:09.836Z level=DEBUG source=server.go:615 msg="model load progress 0.62" time=2024-07-12T21:27:10.087Z level=DEBUG source=server.go:615 msg="model load progress 0.63" time=2024-07-12T21:27:10.841Z level=DEBUG source=server.go:615 msg="model load progress 0.64" time=2024-07-12T21:27:11.092Z level=DEBUG source=server.go:615 msg="model load progress 0.65" time=2024-07-12T21:27:11.343Z level=DEBUG source=server.go:615 msg="model load progress 0.66" time=2024-07-12T21:27:12.098Z level=DEBUG source=server.go:615 msg="model load progress 0.67" time=2024-07-12T21:27:12.349Z level=DEBUG source=server.go:615 msg="model load progress 0.68" time=2024-07-12T21:27:12.600Z level=DEBUG source=server.go:615 msg="model load progress 0.69" time=2024-07-12T21:27:13.103Z level=DEBUG source=server.go:615 msg="model load progress 0.70" time=2024-07-12T21:27:13.354Z level=DEBUG source=server.go:615 msg="model load progress 0.71" time=2024-07-12T21:27:14.359Z level=DEBUG source=server.go:615 msg="model load progress 0.73" time=2024-07-12T21:27:14.611Z level=DEBUG source=server.go:615 msg="model load progress 0.74" time=2024-07-12T21:27:15.365Z level=DEBUG source=server.go:615 msg="model load progress 0.75" time=2024-07-12T21:27:15.616Z level=DEBUG source=server.go:615 msg="model load progress 0.76" time=2024-07-12T21:27:16.119Z level=DEBUG source=server.go:615 msg="model load progress 0.77" time=2024-07-12T21:27:16.621Z level=DEBUG source=server.go:615 msg="model load progress 0.78" time=2024-07-12T21:27:16.872Z level=DEBUG source=server.go:615 msg="model load progress 0.79" time=2024-07-12T21:27:17.627Z level=DEBUG source=server.go:615 msg="model load progress 0.81" time=2024-07-12T21:27:17.878Z level=DEBUG source=server.go:615 msg="model load progress 0.82" time=2024-07-12T21:27:18.632Z level=DEBUG source=server.go:615 msg="model load progress 0.83" time=2024-07-12T21:27:18.883Z level=DEBUG source=server.go:615 msg="model load progress 0.84" time=2024-07-12T21:27:19.637Z level=DEBUG source=server.go:615 msg="model load progress 0.85" time=2024-07-12T21:27:19.888Z level=DEBUG source=server.go:615 msg="model load progress 0.86" time=2024-07-12T21:27:20.139Z level=DEBUG source=server.go:615 msg="model load progress 0.87" time=2024-07-12T21:27:20.893Z level=DEBUG source=server.go:615 msg="model load progress 0.88" time=2024-07-12T21:27:21.144Z level=DEBUG source=server.go:615 msg="model load progress 0.89" time=2024-07-12T21:27:21.396Z level=DEBUG source=server.go:615 msg="model load progress 0.90" time=2024-07-12T21:27:22.150Z level=DEBUG source=server.go:615 msg="model load progress 0.91" time=2024-07-12T21:27:22.401Z level=DEBUG source=server.go:615 msg="model load progress 0.92" time=2024-07-12T21:27:22.904Z level=DEBUG source=server.go:615 msg="model load progress 0.93" time=2024-07-12T21:27:23.155Z level=DEBUG source=server.go:615 msg="model load progress 0.94" time=2024-07-12T21:27:23.407Z level=DEBUG source=server.go:615 msg="model load progress 0.95" time=2024-07-12T21:27:24.160Z level=DEBUG source=server.go:615 msg="model load progress 0.96" time=2024-07-12T21:27:24.411Z level=DEBUG source=server.go:615 msg="model load progress 0.97" time=2024-07-12T21:27:24.663Z level=DEBUG source=server.go:615 msg="model load progress 0.98" time=2024-07-12T21:27:25.417Z level=DEBUG source=server.go:615 msg="model load progress 0.99" llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 time=2024-07-12T21:27:25.668Z level=DEBUG source=server.go:615 msg="model load progress 1.00" llama_kv_cache_init: CUDA0 KV buffer size = 1024.00 MiB llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 2.02 MiB time=2024-07-12T21:27:25.919Z level=DEBUG source=server.go:618 msg="model load completed, waiting for server to become available" status="llm server loading model" llama_new_context_with_model: CUDA0 compute buffer size = 560.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 24.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 2 time=2024-07-12T21:32:26.020Z level=ERROR source=sched.go:480 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 1.00 - " time=2024-07-12T21:32:26.020Z level=DEBUG source=sched.go:483 msg="triggering expiration for failed load" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-12T21:32:26.020Z level=DEBUG source=sched.go:384 msg="runner expired event received" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-12T21:32:26.020Z level=DEBUG source=sched.go:400 msg="got lock to unload" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa [GIN] 2024/07/12 - 21:32:26 | 500 | 6m17s | 192.168.75.195 | POST "/api/chat" time=2024-07-12T21:32:26.020Z level=DEBUG source=gpu.go:336 msg="updating system memory data" before.total="23.5 GiB" before.free="20.4 GiB" now.total="23.5 GiB" now.free="19.9 GiB" CUDA driver version: 12.4 time=2024-07-12T21:32:26.223Z level=DEBUG source=gpu.go:377 msg="updating cuda memory data" gpu=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda name="GRID P40-8Q" before.total="8.0 GiB" before.free="7.4 GiB" now.total="8.0 GiB" now.free="1.6 GiB" now.used="6.4 GiB" releasing cuda driver library time=2024-07-12T21:32:26.223Z level=DEBUG source=server.go:1026 msg="stopping llama server" time=2024-07-12T21:32:26.224Z level=DEBUG source=server.go:1032 msg="waiting for llama server to exit" time=2024-07-12T21:32:26.311Z level=DEBUG source=server.go:1036 msg="llama server stopped" time=2024-07-12T21:32:26.311Z level=DEBUG source=sched.go:405 msg="runner released" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-12T21:32:26.474Z level=DEBUG source=gpu.go:336 msg="updating system memory data" before.total="23.5 GiB" before.free="19.9 GiB" now.total="23.5 GiB" now.free="20.4 GiB" CUDA driver version: 12.4 time=2024-07-12T21:32:26.619Z level=DEBUG source=gpu.go:377 msg="updating cuda memory data" gpu=GPU-2c3dceb7-4052-11ef-99c8-7f57aa5c9cda name="GRID P40-8Q" before.total="8.0 GiB" before.free="1.6 GiB" now.total="8.0 GiB" now.free="7.4 GiB" now.used="650.0 MiB" releasing cuda driver library time=2024-07-12T21:32:26.620Z level=DEBUG source=sched.go:684 msg="gpu VRAM free memory converged after 0.60 seconds" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-12T21:32:26.620Z level=DEBUG source=sched.go:409 msg="sending an unloaded event" modelPath=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-12T21:32:26.620Z level=DEBUG source=sched.go:332 msg="ignoring unload event with no pending requests" ```
Author
Owner

@pdevine commented on GitHub (Sep 12, 2024):

@NWBx01 this seems like it slipped through the cracks. Are you still seeing the issue?

<!-- gh-comment-id:2347309580 --> @pdevine commented on GitHub (Sep 12, 2024): @NWBx01 this seems like it slipped through the cracks. Are you still seeing the issue?
Author
Owner

@NWBx01 commented on GitHub (Nov 1, 2024):

@NWBx01 this seems like it slipped through the cracks. Are you still seeing the issue?

@pdevine My apologies, I did not see this response. I'm unsure whether the current version of Ollama still has this issue, but I would assume so. Since then, I have been running Ollama in a virtual machine with PCIe passthrough of the GPU directly instead of using Nvidia vGPU. This does work correctly and I have not experienced the same issues. It's unclear to me whether this is an incompatibly between Ollama and Nvidia vGPU on the part vGPU itself (is there maybe some issue with memory mapping?) or whether it's an issue with Ollama.

<!-- gh-comment-id:2452555833 --> @NWBx01 commented on GitHub (Nov 1, 2024): > @NWBx01 this seems like it slipped through the cracks. Are you still seeing the issue? @pdevine My apologies, I did not see this response. I'm unsure whether the current version of Ollama still has this issue, but I would assume so. Since then, I have been running Ollama in a virtual machine with PCIe passthrough of the GPU directly instead of using Nvidia vGPU. This does work correctly and I have not experienced the same issues. It's unclear to me whether this is an incompatibly between Ollama and Nvidia vGPU on the part vGPU itself (is there maybe some issue with memory mapping?) or whether it's an issue with Ollama.
Author
Owner

@Blakdawn commented on GitHub (Oct 1, 2025):

I have a similar problem, though my setup is slightly different. I have a bare-metal Ubuntu server running MicroK8S, with Ollama running as a docker container there, with GPU's passed through using the nvidia gpu plugin.

Here is the end of the log file:

time=2025-10-01T06:21:11.780Z level=DEBUG source=cache.go:235 msg="context limit hit - shifting" id=0 limit=15000 input=15000 keep=5 discard=7497
update: applying K-shift
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
time=2025-10-01T06:34:23.400Z level=DEBUG source=cache.go:235 msg="context limit hit - shifting" id=0 limit=15000 input=15000 keep=5 discard=7497
update: applying K-shift
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
time=2025-10-01T06:47:35.155Z level=DEBUG source=cache.go:235 msg="context limit hit - shifting" id=0 limit=15000 input=15000 keep=5 discard=7497
update: applying K-shift
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
time=2025-10-01T07:00:02.978Z level=DEBUG source=runner.go:502 msg="hit stop token" pending=[</s>] stop=</s>
[GIN] 2025/10/01 - 07:00:02 | 200 |        55m37s |  192.168.10.103 | POST     "/api/chat"
time=2025-10-01T07:00:02.979Z level=DEBUG source=sched.go:490 msg="context for request finished"
time=2025-10-01T07:00:02.979Z level=DEBUG source=sched.go:286 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/MODEL:latest runner.inference=cuda runner.devices=2 runner.size="10.5 GiB" runner.vram="10.5 GiB" runner.parallel=1 runner.pid=53959 runner.model=/root/.ollama/models/blobs/sha256-24a85a342d37802a3f0a0590c8c00140b37b6808ba3d8b1326e963a4243cb375 runner.num_ctx=15000 duration=5m0s
time=2025-10-01T07:00:02.979Z level=DEBUG source=sched.go:304 msg="after processing request finished event" runner.name=registry.ollama.ai/library/MODEL:latest runner.inference=cuda runner.devices=2 runner.size="10.5 GiB" runner.vram="10.5 GiB" runner.parallel=1 runner.pid=53959 runner.model=/root/.ollama/models/blobs/sha256-24a85a342d37802a3f0a0590c8c00140b37b6808ba3d8b1326e963a4243cb375 runner.num_ctx=15000 refCount=0
time=2025-10-01T07:05:02.983Z level=DEBUG source=sched.go:288 msg="timer expired, expiring to unload" runner.name=registry.ollama.ai/library/MODEL:latest runner.inference=cuda runner.devices=2 runner.size="10.5 GiB" runner.vram="10.5 GiB" runner.parallel=1 runner.pid=53959 runner.model=/root/.ollama/models/blobs/sha256-24a85a342d37802a3f0a0590c8c00140b37b6808ba3d8b1326e963a4243cb375 runner.num_ctx=15000
time=2025-10-01T07:05:02.983Z level=DEBUG source=sched.go:307 msg="runner expired event received" runner.name=registry.ollama.ai/library/MODEL:latest runner.inference=cuda runner.devices=2 runner.size="10.5 GiB" runner.vram="10.5 GiB" runner.parallel=1 runner.pid=53959 runner.model=/root/.ollama/models/blobs/sha256-24a85a342d37802a3f0a0590c8c00140b37b6808ba3d8b1326e963a4243cb375 runner.num_ctx=15000
time=2025-10-01T07:05:02.983Z level=DEBUG source=sched.go:322 msg="got lock to unload expired event" runner.name=registry.ollama.ai/library/MODEL:latest runner.inference=cuda runner.devices=2 runner.size="10.5 GiB" runner.vram="10.5 GiB" runner.parallel=1 runner.pid=53959 runner.model=/root/.ollama/models/blobs/sha256-24a85a342d37802a3f0a0590c8c00140b37b6808ba3d8b1326e963a4243cb375 runner.num_ctx=15000
time=2025-10-01T07:05:02.983Z level=DEBUG source=sched.go:345 msg="starting background wait for VRAM recovery" runner.name=registry.ollama.ai/library/MODEL:latest runner.inference=cuda runner.devices=2 runner.size="10.5 GiB" runner.vram="10.5 GiB" runner.parallel=1 runner.pid=53959 runner.model=/root/.ollama/models/blobs/sha256-24a85a342d37802a3f0a0590c8c00140b37b6808ba3d8b1326e963a4243cb375 runner.num_ctx=15000
time=2025-10-01T07:05:02.984Z level=DEBUG source=gpu.go:410 msg="updating system memory data" before.total="30.8 GiB" before.free="24.6 GiB" before.free_swap="1.6 GiB" now.total="30.8 GiB" now.free="24.4 GiB" now.free_swap="1.6 GiB"
initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.172.08
dlsym: cuInit - 0x7d140fd0cab0
dlsym: cuDriverGetVersion - 0x7d140fd0cad0
dlsym: cuDeviceGetCount - 0x7d140fd0cb10
dlsym: cuDeviceGet - 0x7d140fd0caf0
dlsym: cuDeviceGetAttribute - 0x7d140fd0cbf0
dlsym: cuDeviceGetUuid - 0x7d140fd0cb50
dlsym: cuDeviceGetName - 0x7d140fd0cb30
dlsym: cuCtxCreate_v3 - 0x7d140fd0cdd0
dlsym: cuMemGetInfo_v2 - 0x7d140fd2d190
dlsym: cuCtxDestroy - 0x7d140fd6bae0
calling cuInit
calling cuDriverGetVersion
raw version 0x2f30
CUDA driver version: 12.8
calling cuDeviceGetCount
device count 2
time=2025-10-01T07:05:03.146Z level=DEBUG source=gpu.go:460 msg="updating cuda memory data" gpu=GPU-c1eea051-dc74-dce9-b83b-462d775d490a name="NVIDIA GeForce RTX 4060" overhead="0 B" before.total="7.6 GiB" before.free="7.5 GiB" now.total="7.6 GiB" now.free="7.4 GiB" now.used="228.6 MiB"
time=2025-10-01T07:05:03.244Z level=DEBUG source=gpu.go:460 msg="updating cuda memory data" gpu=GPU-c83fbba6-bbdb-0915-5c5e-beffcab864d4 name="NVIDIA GeForce GTX 1080 Ti" overhead="0 B" before.total="10.9 GiB" before.free="10.8 GiB" now.total="10.9 GiB" now.free="570.4 MiB" now.used="10.3 GiB"
releasing cuda driver library
time=2025-10-01T07:05:03.295Z level=DEBUG source=server.go:1683 msg="stopping llama server" pid=53959
time=2025-10-01T07:05:03.296Z level=DEBUG source=server.go:1689 msg="waiting for llama server to exit" pid=53959
time=2025-10-01T07:05:03.414Z level=DEBUG source=server.go:1693 msg="llama server stopped" pid=53959
time=2025-10-01T07:05:03.414Z level=DEBUG source=sched.go:350 msg="runner terminated and removed from list, blocking for VRAM recovery" runner.size="10.5 GiB" runner.vram="10.5 GiB" runner.parallel=1 runner.pid=53959 runner.model=/root/.ollama/models/blobs/sha256-24a85a342d37802a3f0a0590c8c00140b37b6808ba3d8b1326e963a4243cb375
time=2025-10-01T07:05:03.495Z level=DEBUG source=gpu.go:410 msg="updating system memory data" before.total="30.8 GiB" before.free="24.4 GiB" before.free_swap="1.6 GiB" now.total="30.8 GiB" now.free="24.9 GiB" now.free_swap="1.6 GiB"
initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.172.08
dlsym: cuInit - 0x7d140fd0cab0
dlsym: cuDriverGetVersion - 0x7d140fd0cad0
dlsym: cuDeviceGetCount - 0x7d140fd0cb10
dlsym: cuDeviceGet - 0x7d140fd0caf0
dlsym: cuDeviceGetAttribute - 0x7d140fd0cbf0
dlsym: cuDeviceGetUuid - 0x7d140fd0cb50
dlsym: cuDeviceGetName - 0x7d140fd0cb30
dlsym: cuCtxCreate_v3 - 0x7d140fd0cdd0
dlsym: cuMemGetInfo_v2 - 0x7d140fd2d190
dlsym: cuCtxDestroy - 0x7d140fd6bae0
calling cuInit
calling cuDriverGetVersion
raw version 0x2f30
CUDA driver version: 12.8
calling cuDeviceGetCount
device count 2
time=2025-10-01T07:05:03.668Z level=DEBUG source=gpu.go:460 msg="updating cuda memory data" gpu=GPU-c1eea051-dc74-dce9-b83b-462d775d490a name="NVIDIA GeForce RTX 4060" overhead="0 B" before.total="7.6 GiB" before.free="7.4 GiB" now.total="7.6 GiB" now.free="7.5 GiB" now.used="131.4 MiB"
time=2025-10-01T07:05:03.791Z level=DEBUG source=gpu.go:460 msg="updating cuda memory data" gpu=GPU-c83fbba6-bbdb-0915-5c5e-beffcab864d4 name="NVIDIA GeForce GTX 1080 Ti" overhead="0 B" before.total="10.9 GiB" before.free="570.4 MiB" now.total="10.9 GiB" now.free="10.8 GiB" now.used="145.0 MiB"
releasing cuda driver library
time=2025-10-01T07:05:03.791Z level=DEBUG source=sched.go:662 msg="gpu VRAM free memory converged after 0.81 seconds" runner.size="10.5 GiB" runner.vram="10.5 GiB" runner.parallel=1 runner.pid=53959 runner.model=/root/.ollama/models/blobs/sha256-24a85a342d37802a3f0a0590c8c00140b37b6808ba3d8b1326e963a4243cb375
time=2025-10-01T07:05:03.791Z level=DEBUG source=sched.go:353 msg="sending an unloaded event" runner.size="10.5 GiB" runner.vram="10.5 GiB" runner.parallel=1 runner.pid=53959 runner.model=/root/.ollama/models/blobs/sha256-24a85a342d37802a3f0a0590c8c00140b37b6808ba3d8b1326e963a4243cb375
time=2025-10-01T07:05:03.791Z level=DEBUG source=sched.go:255 msg="ignoring unload event with no pending requests"
<!-- gh-comment-id:3355759487 --> @Blakdawn commented on GitHub (Oct 1, 2025): I have a similar problem, though my setup is slightly different. I have a bare-metal Ubuntu server running MicroK8S, with Ollama running as a docker container there, with GPU's passed through using the nvidia gpu plugin. Here is the end of the log file: ``` time=2025-10-01T06:21:11.780Z level=DEBUG source=cache.go:235 msg="context limit hit - shifting" id=0 limit=15000 input=15000 keep=5 discard=7497 update: applying K-shift graph_reserve: reserving a graph for ubatch with n_tokens = 512, n_seqs = 1, n_outputs = 512 time=2025-10-01T06:34:23.400Z level=DEBUG source=cache.go:235 msg="context limit hit - shifting" id=0 limit=15000 input=15000 keep=5 discard=7497 update: applying K-shift graph_reserve: reserving a graph for ubatch with n_tokens = 512, n_seqs = 1, n_outputs = 512 time=2025-10-01T06:47:35.155Z level=DEBUG source=cache.go:235 msg="context limit hit - shifting" id=0 limit=15000 input=15000 keep=5 discard=7497 update: applying K-shift graph_reserve: reserving a graph for ubatch with n_tokens = 512, n_seqs = 1, n_outputs = 512 time=2025-10-01T07:00:02.978Z level=DEBUG source=runner.go:502 msg="hit stop token" pending=[</s>] stop=</s> [GIN] 2025/10/01 - 07:00:02 | 200 | 55m37s | 192.168.10.103 | POST "/api/chat" time=2025-10-01T07:00:02.979Z level=DEBUG source=sched.go:490 msg="context for request finished" time=2025-10-01T07:00:02.979Z level=DEBUG source=sched.go:286 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/MODEL:latest runner.inference=cuda runner.devices=2 runner.size="10.5 GiB" runner.vram="10.5 GiB" runner.parallel=1 runner.pid=53959 runner.model=/root/.ollama/models/blobs/sha256-24a85a342d37802a3f0a0590c8c00140b37b6808ba3d8b1326e963a4243cb375 runner.num_ctx=15000 duration=5m0s time=2025-10-01T07:00:02.979Z level=DEBUG source=sched.go:304 msg="after processing request finished event" runner.name=registry.ollama.ai/library/MODEL:latest runner.inference=cuda runner.devices=2 runner.size="10.5 GiB" runner.vram="10.5 GiB" runner.parallel=1 runner.pid=53959 runner.model=/root/.ollama/models/blobs/sha256-24a85a342d37802a3f0a0590c8c00140b37b6808ba3d8b1326e963a4243cb375 runner.num_ctx=15000 refCount=0 time=2025-10-01T07:05:02.983Z level=DEBUG source=sched.go:288 msg="timer expired, expiring to unload" runner.name=registry.ollama.ai/library/MODEL:latest runner.inference=cuda runner.devices=2 runner.size="10.5 GiB" runner.vram="10.5 GiB" runner.parallel=1 runner.pid=53959 runner.model=/root/.ollama/models/blobs/sha256-24a85a342d37802a3f0a0590c8c00140b37b6808ba3d8b1326e963a4243cb375 runner.num_ctx=15000 time=2025-10-01T07:05:02.983Z level=DEBUG source=sched.go:307 msg="runner expired event received" runner.name=registry.ollama.ai/library/MODEL:latest runner.inference=cuda runner.devices=2 runner.size="10.5 GiB" runner.vram="10.5 GiB" runner.parallel=1 runner.pid=53959 runner.model=/root/.ollama/models/blobs/sha256-24a85a342d37802a3f0a0590c8c00140b37b6808ba3d8b1326e963a4243cb375 runner.num_ctx=15000 time=2025-10-01T07:05:02.983Z level=DEBUG source=sched.go:322 msg="got lock to unload expired event" runner.name=registry.ollama.ai/library/MODEL:latest runner.inference=cuda runner.devices=2 runner.size="10.5 GiB" runner.vram="10.5 GiB" runner.parallel=1 runner.pid=53959 runner.model=/root/.ollama/models/blobs/sha256-24a85a342d37802a3f0a0590c8c00140b37b6808ba3d8b1326e963a4243cb375 runner.num_ctx=15000 time=2025-10-01T07:05:02.983Z level=DEBUG source=sched.go:345 msg="starting background wait for VRAM recovery" runner.name=registry.ollama.ai/library/MODEL:latest runner.inference=cuda runner.devices=2 runner.size="10.5 GiB" runner.vram="10.5 GiB" runner.parallel=1 runner.pid=53959 runner.model=/root/.ollama/models/blobs/sha256-24a85a342d37802a3f0a0590c8c00140b37b6808ba3d8b1326e963a4243cb375 runner.num_ctx=15000 time=2025-10-01T07:05:02.984Z level=DEBUG source=gpu.go:410 msg="updating system memory data" before.total="30.8 GiB" before.free="24.6 GiB" before.free_swap="1.6 GiB" now.total="30.8 GiB" now.free="24.4 GiB" now.free_swap="1.6 GiB" initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.172.08 dlsym: cuInit - 0x7d140fd0cab0 dlsym: cuDriverGetVersion - 0x7d140fd0cad0 dlsym: cuDeviceGetCount - 0x7d140fd0cb10 dlsym: cuDeviceGet - 0x7d140fd0caf0 dlsym: cuDeviceGetAttribute - 0x7d140fd0cbf0 dlsym: cuDeviceGetUuid - 0x7d140fd0cb50 dlsym: cuDeviceGetName - 0x7d140fd0cb30 dlsym: cuCtxCreate_v3 - 0x7d140fd0cdd0 dlsym: cuMemGetInfo_v2 - 0x7d140fd2d190 dlsym: cuCtxDestroy - 0x7d140fd6bae0 calling cuInit calling cuDriverGetVersion raw version 0x2f30 CUDA driver version: 12.8 calling cuDeviceGetCount device count 2 time=2025-10-01T07:05:03.146Z level=DEBUG source=gpu.go:460 msg="updating cuda memory data" gpu=GPU-c1eea051-dc74-dce9-b83b-462d775d490a name="NVIDIA GeForce RTX 4060" overhead="0 B" before.total="7.6 GiB" before.free="7.5 GiB" now.total="7.6 GiB" now.free="7.4 GiB" now.used="228.6 MiB" time=2025-10-01T07:05:03.244Z level=DEBUG source=gpu.go:460 msg="updating cuda memory data" gpu=GPU-c83fbba6-bbdb-0915-5c5e-beffcab864d4 name="NVIDIA GeForce GTX 1080 Ti" overhead="0 B" before.total="10.9 GiB" before.free="10.8 GiB" now.total="10.9 GiB" now.free="570.4 MiB" now.used="10.3 GiB" releasing cuda driver library time=2025-10-01T07:05:03.295Z level=DEBUG source=server.go:1683 msg="stopping llama server" pid=53959 time=2025-10-01T07:05:03.296Z level=DEBUG source=server.go:1689 msg="waiting for llama server to exit" pid=53959 time=2025-10-01T07:05:03.414Z level=DEBUG source=server.go:1693 msg="llama server stopped" pid=53959 time=2025-10-01T07:05:03.414Z level=DEBUG source=sched.go:350 msg="runner terminated and removed from list, blocking for VRAM recovery" runner.size="10.5 GiB" runner.vram="10.5 GiB" runner.parallel=1 runner.pid=53959 runner.model=/root/.ollama/models/blobs/sha256-24a85a342d37802a3f0a0590c8c00140b37b6808ba3d8b1326e963a4243cb375 time=2025-10-01T07:05:03.495Z level=DEBUG source=gpu.go:410 msg="updating system memory data" before.total="30.8 GiB" before.free="24.4 GiB" before.free_swap="1.6 GiB" now.total="30.8 GiB" now.free="24.9 GiB" now.free_swap="1.6 GiB" initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.172.08 dlsym: cuInit - 0x7d140fd0cab0 dlsym: cuDriverGetVersion - 0x7d140fd0cad0 dlsym: cuDeviceGetCount - 0x7d140fd0cb10 dlsym: cuDeviceGet - 0x7d140fd0caf0 dlsym: cuDeviceGetAttribute - 0x7d140fd0cbf0 dlsym: cuDeviceGetUuid - 0x7d140fd0cb50 dlsym: cuDeviceGetName - 0x7d140fd0cb30 dlsym: cuCtxCreate_v3 - 0x7d140fd0cdd0 dlsym: cuMemGetInfo_v2 - 0x7d140fd2d190 dlsym: cuCtxDestroy - 0x7d140fd6bae0 calling cuInit calling cuDriverGetVersion raw version 0x2f30 CUDA driver version: 12.8 calling cuDeviceGetCount device count 2 time=2025-10-01T07:05:03.668Z level=DEBUG source=gpu.go:460 msg="updating cuda memory data" gpu=GPU-c1eea051-dc74-dce9-b83b-462d775d490a name="NVIDIA GeForce RTX 4060" overhead="0 B" before.total="7.6 GiB" before.free="7.4 GiB" now.total="7.6 GiB" now.free="7.5 GiB" now.used="131.4 MiB" time=2025-10-01T07:05:03.791Z level=DEBUG source=gpu.go:460 msg="updating cuda memory data" gpu=GPU-c83fbba6-bbdb-0915-5c5e-beffcab864d4 name="NVIDIA GeForce GTX 1080 Ti" overhead="0 B" before.total="10.9 GiB" before.free="570.4 MiB" now.total="10.9 GiB" now.free="10.8 GiB" now.used="145.0 MiB" releasing cuda driver library time=2025-10-01T07:05:03.791Z level=DEBUG source=sched.go:662 msg="gpu VRAM free memory converged after 0.81 seconds" runner.size="10.5 GiB" runner.vram="10.5 GiB" runner.parallel=1 runner.pid=53959 runner.model=/root/.ollama/models/blobs/sha256-24a85a342d37802a3f0a0590c8c00140b37b6808ba3d8b1326e963a4243cb375 time=2025-10-01T07:05:03.791Z level=DEBUG source=sched.go:353 msg="sending an unloaded event" runner.size="10.5 GiB" runner.vram="10.5 GiB" runner.parallel=1 runner.pid=53959 runner.model=/root/.ollama/models/blobs/sha256-24a85a342d37802a3f0a0590c8c00140b37b6808ba3d8b1326e963a4243cb375 time=2025-10-01T07:05:03.791Z level=DEBUG source=sched.go:255 msg="ignoring unload event with no pending requests" ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#65564