[GH-ISSUE #7311] ollama 0.4.0-rc3: deepseek-coder-v2-lite is not functioning correctly. #51157

Closed
opened 2026-04-28 18:40:28 -05:00 by GiteaMirror · 11 comments
Owner

Originally created by @emzaedu on GitHub (Oct 22, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7311

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

I encountered an error while attempting to run both q8_0 and q4_k_m.

Error: llama runner process has terminated: error loading model: error loading model vocabulary: wstring_convert::from_bytes

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.4.0-rc3

Originally created by @emzaedu on GitHub (Oct 22, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7311 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? I encountered an error while attempting to run both q8_0 and q4_k_m. `Error: llama runner process has terminated: error loading model: error loading model vocabulary: wstring_convert::from_bytes` ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.4.0-rc3
GiteaMirror added the bugwindows labels 2026-04-28 18:40:28 -05:00
Author
Owner

@rick-github commented on GitHub (Oct 22, 2024):

Where did you get the model from?

<!-- gh-comment-id:2428890620 --> @rick-github commented on GitHub (Oct 22, 2024): Where did you get the model from?
Author
Owner

@emzaedu commented on GitHub (Oct 22, 2024):

Where did you get the model from?

https://ollama.com/library/deepseek-coder-v2:16b-lite-instruct-q8_0

<!-- gh-comment-id:2428930720 --> @emzaedu commented on GitHub (Oct 22, 2024): > Where did you get the model from? https://ollama.com/library/deepseek-coder-v2:16b-lite-instruct-q8_0
Author
Owner

@rick-github commented on GitHub (Oct 22, 2024):

Server logs may aid in debugging.

$ ollama -v
ollama version is 0.4.0-rc3
$ ollama run deepseek-coder-v2:16b-lite-instruct-q8_0 'write a python script to print the numbers 1 to 100 without a for loop'
 Certainly! You can achieve this by using recursion. Here's a Python script that prints numbers from 1 to 100 without using a `for` or `while` loop:

```python
def print_numbers(n):
    if n <= 100:
        print(n)
        print_numbers(n + 1)

# Start the recursion with 1
print_numbers(1)
```

This script defines a function `print_numbers` that takes an integer `n` as its parameter. The function prints the current number and then calls itself with the next number (`n + 1`) until it reaches 100.
<!-- gh-comment-id:2429071050 --> @rick-github commented on GitHub (Oct 22, 2024): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging. ````console $ ollama -v ollama version is 0.4.0-rc3 $ ollama run deepseek-coder-v2:16b-lite-instruct-q8_0 'write a python script to print the numbers 1 to 100 without a for loop' Certainly! You can achieve this by using recursion. Here's a Python script that prints numbers from 1 to 100 without using a `for` or `while` loop: ```python def print_numbers(n): if n <= 100: print(n) print_numbers(n + 1) # Start the recursion with 1 print_numbers(1) ``` This script defines a function `print_numbers` that takes an integer `n` as its parameter. The function prints the current number and then calls itself with the next number (`n + 1`) until it reaches 100. ````
Author
Owner

@emzaedu commented on GitHub (Oct 22, 2024):

server.log

2024/10/22 16:49:53 routes.go:1170: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:24h0m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\User\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-10-22T16:49:53.112+03:00 level=INFO source=images.go:754 msg="total blobs: 101"
time=2024-10-22T16:49:53.117+03:00 level=INFO source=images.go:761 msg="total unused blobs removed: 0"
time=2024-10-22T16:49:53.121+03:00 level=INFO source=routes.go:1217 msg="Listening on [::]:11434 (version 0.4.0-rc3)"
time=2024-10-22T16:49:53.122+03:00 level=INFO source=common.go:82 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm]"
time=2024-10-22T16:49:53.122+03:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
time=2024-10-22T16:49:53.122+03:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2024-10-22T16:49:53.122+03:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=12 efficiency=0 threads=24
time=2024-10-22T16:49:53.277+03:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e library=cuda variant=v12 compute=8.9 driver=12.6 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB"
[GIN] 2024/10/22 - 16:49:53 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/10/22 - 16:49:53 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
[GIN] 2024/10/22 - 16:50:05 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/10/22 - 16:50:05 | 200 |     27.5007ms |       127.0.0.1 | POST     "/api/show"
time=2024-10-22T16:50:05.080+03:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\User\.ollama\models\blobs\sha256-373dcfc92e01372709b6164fc836f677a6280e25e9eac5c434c64223207bfc4f gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e parallel=4 available=21867708416 required="18.3 GiB"
time=2024-10-22T16:50:05.097+03:00 level=INFO source=llama-server.go:72 msg="system memory" total="191.6 GiB" free="163.6 GiB" free_swap="164.7 GiB"
time=2024-10-22T16:50:05.098+03:00 level=INFO source=memory.go:346 msg="offload to cuda" layers.requested=-1 layers.model=28 layers.offload=28 layers.split="" memory.available="[20.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="18.3 GiB" memory.required.partial="18.3 GiB" memory.required.kv="2.1 GiB" memory.required.allocations="[18.3 GiB]" memory.weights.total="17.2 GiB" memory.weights.repeating="17.0 GiB" memory.weights.nonrepeating="212.5 MiB" memory.graph.full="296.0 MiB" memory.graph.partial="391.4 MiB"
time=2024-10-22T16:50:05.104+03:00 level=INFO source=llama-server.go:355 msg="starting llama server" cmd="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12\\ollama_llama_server.exe --model C:\\Users\\User\\.ollama\\models\\blobs\\sha256-373dcfc92e01372709b6164fc836f677a6280e25e9eac5c434c64223207bfc4f --ctx-size 8192 --batch-size 512 --embedding --n-gpu-layers 28 --threads 12 --no-mmap --parallel 4 --port 50106"
time=2024-10-22T16:50:07.688+03:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2024-10-22T16:50:07.688+03:00 level=INFO source=llama-server.go:534 msg="waiting for llama runner to start responding"
time=2024-10-22T16:50:07.689+03:00 level=INFO source=llama-server.go:568 msg="waiting for server to become available" status="llm server error"
time=2024-10-22T16:50:08.897+03:00 level=INFO source=runner.go:856 msg="starting go runner"
time=2024-10-22T16:50:08.911+03:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:50106"
llama_model_loader: loaded meta data with 38 key-value pairs and 377 tensors from C:\Users\User\.ollama\models\blobs\sha256-373d _gc llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.name str              = DeepSeek-Coder-V2-Lite-Instruct
llama_model_loader: - kv   2:                      deepseek2.block_count u32              = 27
llama_model_loader: - kv   3:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv   4:                 deepseek2.embedding_length u32              = 2048
llama_model_loader: - kv   5:              deepseek2.feed_forward_length u32              = 10944
llama_model_loader: - kv   6:             deepseek2.attention.head_count u32              = 16
llama_model_loader: - kv   7:          deepseek2.attention.head_count_kv u32              = 16
llama_model_loader: - kv   8:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv   9: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                deepseek2.expert_used_count u32              = 6
llama_model_loader: - kv  11:                          general.file_type u32              = 7
llama_model_loader: - kv  12:        deepseek2.leading_dense_block_count u32              = 1
llama_model_loader: - kv  13:                       deepseek2.vocab_size u32              = 102400
llama_model_loader: - kv  14:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  15:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  16:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  17:       deepseek2.expert_feed_forward_length u32              = 1408
llama_model_loader: - kv  18:                     deepseek2.expert_count u32              = 64
llama_model_loader: - kv  19:              deepseek2.expert_shared_count u32              = 2
llama_model_loader: - kv  20:             deepseek2.expert_weights_scale f32              = 1.000000
llama_model_loader: - kv  21:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  22:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  23:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  24: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  25: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.070700
llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = deepseek-llm
llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,102400]  = ["!", "\"", "#", "$", "%", "&", "'",  cc llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,102400]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  cc llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,99757]   = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h   cc llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 100000
llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 100001
llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 100001
llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  35:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {% if not add_generation_prompt is dP cc llama_model_loader: - kv  37:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  108 tensors
llama_model_loader: - type q8_0:  269 tensors
time=2024-10-22T16:50:09.000+03:00 level=INFO source=llama-server.go:568 msg="waiting for server to become available" status="llm server loading model"
llama_model_load: error loading model: error loading model vocabulary: wstring_convert::from_bytes
llama_load_model_from_file: failed to load model
llama_new_context_with_model: model cannot be NULL
Exception 0xc0000005 0x0 0x4221 0x79aa50
PC=0x79aa50
signal arrived during external code execution

runtime.cgocall(0x7129c0, 0xc00007dd98)
	runtime/cgocall.go:157 +0x3e fp=0xc00007dd70 sp=0xc00007dd38 pc=0x495dde
github.com/ollama/ollama/llama._Cfunc_llama_add_bos_token(0x0)
	_cgo_gotypes.go:449 +0x52 fp=0xc00007dd98 sp=0xc00007dd70 pc=0x5975f2
main.(*Server).loadModel.(*Model).AddBOSToken.func1(0x80000002000?)
	github.com/ollama/ollama/llama/llama.go:242 +0x3e fp=0xc00007ddd8 sp=0xc00007dd98 pc=0x71097e
github.com/ollama/ollama/llama.(*Model).AddBOSToken(...)
	github.com/ollama/ollama/llama/llama.go:242
main.(*Server).loadModel(0xc0000be120, {0x1c, 0x0, 0x0, 0x0, {0x0, 0x0, 0x0}, 0xc0000221c0, 0x0}, ...)
	github.com/ollama/ollama/llama/runner/runner.go:792 +0x219 fp=0xc00007df38 sp=0xc00007ddd8 pc=0x710619
main.main.gowrap1()
	github.com/ollama/ollama/llama/runner/runner.go:889 +0x95 fp=0xc00007dfe0 sp=0xc00007df38 pc=0x711d55
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc00007dfe8 sp=0xc00007dfe0 pc=0x4fe2e1
created by main.main in goroutine 1
	github.com/ollama/ollama/llama/runner/runner.go:889 +0xbd6

goroutine 1 gp=0xc000064000 m=nil [IO wait]:
runtime.gopark(0xcf0e80?, 0x9bed40?, 0xa0?, 0xcc?, 0xc00008ccd0?)
	runtime/proc.go:402 +0xce fp=0xc0000a3748 sp=0xc0000a3728 pc=0x4cdfee
runtime.netpollblock(0x1d4?, 0x4958e6?, 0x0?)
	runtime/netpoll.go:573 +0xf7 fp=0xc0000a3780 sp=0xc0000a3748 pc=0x4c4d97
internal/poll.runtime_pollWait(0x1fc1d9a8000, 0x72)
	runtime/netpoll.go:345 +0x85 fp=0xc0000a37a0 sp=0xc0000a3780 pc=0x4f84c5
internal/poll.(*pollDesc).wait(0xcf1280?, 0x0?, 0x0)
	internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0000a37c8 sp=0xc0000a37a0 pc=0x54fc47
internal/poll.execIO(0xc00008cca0, 0xc0000a3868)
	internal/poll/fd_windows.go:175 +0xe6 fp=0xc0000a3838 sp=0xc0000a37c8 pc=0x550586
internal/poll.(*FD).acceptOne(0xc00008cc88, 0x1e4, {0xc00001a1e0?, 0x0?, 0x0?}, 0xcf0e80?)
	internal/poll/fd_windows.go:944 +0x67 fp=0xc0000a3898 sp=0xc0000a3838 pc=0x552567
internal/poll.(*FD).Accept(0xc00008cc88, 0xc0000a3a48)
	internal/poll/fd_windows.go:978 +0x1bc fp=0xc0000a3950 sp=0xc0000a3898 pc=0x55289c
net.(*netFD).accept(0xc00008cc88)
	net/fd_windows.go:178 +0x54 fp=0xc0000a3a68 sp=0xc0000a3950 pc=0x5ba974
net.(*TCPListener).accept(0xc000072740)
	net/tcpsock_posix.go:159 +0x1e fp=0xc0000a3a90 sp=0xc0000a3a68 pc=0x5cbe9e
net.(*TCPListener).Accept(0xc000072740)
	net/tcpsock.go:327 +0x30 fp=0xc0000a3ac0 sp=0xc0000a3a90 pc=0x5cb0d0
net/http.(*onceCloseListener).Accept(0xc0000be1b0?)
	<autogenerated>:1 +0x24 fp=0xc0000a3ad8 sp=0xc0000a3ac0 pc=0x6efae4
net/http.(*Server).Serve(0xc0000f0000, {0xadd5c0, 0xc000072740})
	net/http/server.go:3260 +0x33e fp=0xc0000a3c08 sp=0xc0000a3ad8 pc=0x6e68fe
main.main()
	github.com/ollama/ollama/llama/runner/runner.go:914 +0x104c fp=0xc0000a3f50 sp=0xc0000a3c08 pc=0x711a0c
runtime.main()
	runtime/proc.go:271 +0x28b fp=0xc0000a3fe0 sp=0xc0000a3f50 pc=0x4cdbeb
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc0000a3fe8 sp=0xc0000a3fe0 pc=0x4fe2e1

goroutine 2 gp=0xc000064700 m=nil [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:402 +0xce fp=0xc000067fa8 sp=0xc000067f88 pc=0x4cdfee
runtime.goparkunlock(...)
	runtime/proc.go:408
runtime.forcegchelper()
	runtime/proc.go:326 +0xb8 fp=0xc000067fe0 sp=0xc000067fa8 pc=0x4cde78
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc000067fe8 sp=0xc000067fe0 pc=0x4fe2e1
created by runtime.init.6 in goroutine 1
	runtime/proc.go:314 +0x1a

goroutine 3 gp=0xc000064a80 m=nil [GC sweep wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:402 +0xce fp=0xc000069f80 sp=0xc000069f60 pc=0x4cdfee
runtime.goparkunlock(...)
	runtime/proc.go:408
runtime.bgsweep(0xc000074000)
	runtime/mgcsweep.go:278 +0x94 fp=0xc000069fc8 sp=0xc000069f80 pc=0x4b7674
runtime.gcenable.gowrap1()
	runtime/mgc.go:203 +0x25 fp=0xc000069fe0 sp=0xc000069fc8 pc=0x4ac185
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc000069fe8 sp=0xc000069fe0 pc=0x4fe2e1
created by runtime.gcenable in goroutine 1
	runtime/mgc.go:203 +0x66

goroutine 4 gp=0xc000064c40 m=nil [GC scavenge wait]:
runtime.gopark(0xc000074000?, 0xad8090?, 0x1?, 0x0?, 0xc000064c40?)
	runtime/proc.go:402 +0xce fp=0xc00007bf78 sp=0xc00007bf58 pc=0x4cdfee
runtime.goparkunlock(...)
	runtime/proc.go:408
runtime.(*scavengerState).park(0xcf04c0)
	runtime/mgcscavenge.go:425 +0x49 fp=0xc00007bfa8 sp=0xc00007bf78 pc=0x4b5069
runtime.bgscavenge(0xc000074000)
	runtime/mgcscavenge.go:653 +0x3c fp=0xc00007bfc8 sp=0xc00007bfa8 pc=0x4b55fc
runtime.gcenable.gowrap2()
	runtime/mgc.go:204 +0x25 fp=0xc00007bfe0 sp=0xc00007bfc8 pc=0x4ac125
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc00007bfe8 sp=0xc00007bfe0 pc=0x4fe2e1
created by runtime.gcenable in goroutine 1
	runtime/mgc.go:204 +0xa5

goroutine 5 gp=0xc000065180 m=nil [finalizer wait]:
runtime.gopark(0xc00006be48?, 0x49fac5?, 0xa8?, 0x1?, 0xc000064000?)
	runtime/proc.go:402 +0xce fp=0xc00006be20 sp=0xc00006be00 pc=0x4cdfee
runtime.runfinq()
	runtime/mfinal.go:194 +0x107 fp=0xc00006bfe0 sp=0xc00006be20 pc=0x4ab207
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc00006bfe8 sp=0xc00006bfe0 pc=0x4fe2e1
created by runtime.createfing in goroutine 1
	runtime/mfinal.go:164 +0x3d

goroutine 7 gp=0xc000065500 m=nil [semacquire]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:402 +0xce fp=0xc000077d30 sp=0xc000077d10 pc=0x4cdfee
runtime.goparkunlock(...)
	runtime/proc.go:408
runtime.semacquire1(0xc0000be188, 0x0, 0x1, 0x0, 0x12)
	runtime/sema.go:160 +0x232 fp=0xc000077d98 sp=0xc000077d30 pc=0x4df492
sync.runtime_Semacquire(0x0?)
	runtime/sema.go:62 +0x25 fp=0xc000077dd0 sp=0xc000077d98 pc=0x4f9ba5
sync.(*WaitGroup).Wait(0x0?)
	sync/waitgroup.go:116 +0x48 fp=0xc000077df8 sp=0xc000077dd0 pc=0x51bd08
main.(*Server).run(0xc0000be120, {0xaddbf0, 0xc000082050})
	github.com/ollama/ollama/llama/runner/runner.go:324 +0x51 fp=0xc000077fb8 sp=0xc000077df8 pc=0x70d131
main.main.gowrap2()
	github.com/ollama/ollama/llama/runner/runner.go:894 +0x28 fp=0xc000077fe0 sp=0xc000077fb8 pc=0x711c88
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc000077fe8 sp=0xc000077fe0 pc=0x4fe2e1
created by main.main in goroutine 1
	github.com/ollama/ollama/llama/runner/runner.go:894 +0xcab

goroutine 8 gp=0xc0000656c0 m=nil [IO wait]:
runtime.gopark(0x0?, 0xc00008cf20?, 0xd0?, 0xcf?, 0xc00008cf50?)
	runtime/proc.go:402 +0xce fp=0xc000027890 sp=0xc000027870 pc=0x4cdfee
runtime.netpollblock(0x1d8?, 0x4958e6?, 0x0?)
	runtime/netpoll.go:573 +0xf7 fp=0xc0000278c8 sp=0xc000027890 pc=0x4c4d97
internal/poll.runtime_pollWait(0x1fc1d9a7f08, 0x72)
	runtime/netpoll.go:345 +0x85 fp=0xc0000278e8 sp=0xc0000278c8 pc=0x4f84c5
internal/poll.(*pollDesc).wait(0xc000027968?, 0x4a7b7d?, 0x0)
	internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000027910 sp=0xc0000278e8 pc=0x54fc47
internal/poll.execIO(0xc00008cf20, 0xa8be90)
	internal/poll/fd_windows.go:175 +0xe6 fp=0xc000027980 sp=0xc000027910 pc=0x550586
internal/poll.(*FD).Read(0xc00008cf08, {0xc000116000, 0x1000, 0x1000})
	internal/poll/fd_windows.go:436 +0x2b1 fp=0xc000027a28 sp=0xc000027980 pc=0x551231
net.(*netFD).Read(0xc00008cf08, {0xc000116000?, 0xc000027a98?, 0x550125?})
	net/fd_posix.go:55 +0x25 fp=0xc000027a70 sp=0xc000027a28 pc=0x5b9885
net.(*conn).Read(0xc00006e098, {0xc000116000?, 0x0?, 0xc000108038?})
	net/net.go:185 +0x45 fp=0xc000027ab8 sp=0xc000027a70 pc=0x5c5445
net.(*TCPConn).Read(0xc000108030?, {0xc000116000?, 0xc00008cf08?, 0xc000027af0?})
	<autogenerated>:1 +0x25 fp=0xc000027ae8 sp=0xc000027ab8 pc=0x5cefc5
net/http.(*connReader).Read(0xc000108030, {0xc000116000, 0x1000, 0x1000})
	net/http/server.go:789 +0x14b fp=0xc000027b38 sp=0xc000027ae8 pc=0x6dc70b
bufio.(*Reader).fill(0xc000114000)
	bufio/bufio.go:110 +0x103 fp=0xc000027b70 sp=0xc000027b38 pc=0x6997e3
bufio.(*Reader).Peek(0xc000114000, 0x4)
	bufio/bufio.go:148 +0x53 fp=0xc000027b90 sp=0xc000027b70 pc=0x699913
net/http.(*conn).serve(0xc0000be1b0, {0xaddbb8, 0xc00001cde0})
	net/http/server.go:2079 +0x749 fp=0xc000027fb8 sp=0xc000027b90 pc=0x6e2469
net/http.(*Server).Serve.gowrap3()
	net/http/server.go:3290 +0x28 fp=0xc000027fe0 sp=0xc000027fb8 pc=0x6e6ce8
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc000027fe8 sp=0xc000027fe0 pc=0x4fe2e1
created by net/http.(*Server).Serve in goroutine 1
	net/http/server.go:3290 +0x4b4
rax     0xc00007e000
rbx     0xc00007dd98
rcx     0x40f8
rdx     0xc00007dd28
rdi     0xc00007e000
rsi     0xc000064e00
rbp     0xba613ff8c0
rsp     0xba613ff858
r8      0xc000071808
r9      0x0
r10     0x1fc1d9e05a8
r11     0x0
r12     0xc000182010
r13     0x0
r14     0xc000065340
r15     0x3fffffffffffffff
rip     0x79aa50
rflags  0x10202
cs      0x33
fs      0x53
gs      0x2b
time=2024-10-22T16:50:09.265+03:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: error loading model: error loading model vocabulary: wstring_convert::from_bytes"
[GIN] 2024/10/22 - 16:50:09 | 500 |     4.223671s |       127.0.0.1 | POST     "/api/generate"
<!-- gh-comment-id:2429352972 --> @emzaedu commented on GitHub (Oct 22, 2024): server.log ``` 2024/10/22 16:49:53 routes.go:1170: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:24h0m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\User\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" time=2024-10-22T16:49:53.112+03:00 level=INFO source=images.go:754 msg="total blobs: 101" time=2024-10-22T16:49:53.117+03:00 level=INFO source=images.go:761 msg="total unused blobs removed: 0" time=2024-10-22T16:49:53.121+03:00 level=INFO source=routes.go:1217 msg="Listening on [::]:11434 (version 0.4.0-rc3)" time=2024-10-22T16:49:53.122+03:00 level=INFO source=common.go:82 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm]" time=2024-10-22T16:49:53.122+03:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs" time=2024-10-22T16:49:53.122+03:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2024-10-22T16:49:53.122+03:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=12 efficiency=0 threads=24 time=2024-10-22T16:49:53.277+03:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e library=cuda variant=v12 compute=8.9 driver=12.6 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB" [GIN] 2024/10/22 - 16:49:53 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/10/22 - 16:49:53 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2024/10/22 - 16:50:05 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/10/22 - 16:50:05 | 200 | 27.5007ms | 127.0.0.1 | POST "/api/show" time=2024-10-22T16:50:05.080+03:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\User\.ollama\models\blobs\sha256-373dcfc92e01372709b6164fc836f677a6280e25e9eac5c434c64223207bfc4f gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e parallel=4 available=21867708416 required="18.3 GiB" time=2024-10-22T16:50:05.097+03:00 level=INFO source=llama-server.go:72 msg="system memory" total="191.6 GiB" free="163.6 GiB" free_swap="164.7 GiB" time=2024-10-22T16:50:05.098+03:00 level=INFO source=memory.go:346 msg="offload to cuda" layers.requested=-1 layers.model=28 layers.offload=28 layers.split="" memory.available="[20.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="18.3 GiB" memory.required.partial="18.3 GiB" memory.required.kv="2.1 GiB" memory.required.allocations="[18.3 GiB]" memory.weights.total="17.2 GiB" memory.weights.repeating="17.0 GiB" memory.weights.nonrepeating="212.5 MiB" memory.graph.full="296.0 MiB" memory.graph.partial="391.4 MiB" time=2024-10-22T16:50:05.104+03:00 level=INFO source=llama-server.go:355 msg="starting llama server" cmd="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12\\ollama_llama_server.exe --model C:\\Users\\User\\.ollama\\models\\blobs\\sha256-373dcfc92e01372709b6164fc836f677a6280e25e9eac5c434c64223207bfc4f --ctx-size 8192 --batch-size 512 --embedding --n-gpu-layers 28 --threads 12 --no-mmap --parallel 4 --port 50106" time=2024-10-22T16:50:07.688+03:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 time=2024-10-22T16:50:07.688+03:00 level=INFO source=llama-server.go:534 msg="waiting for llama runner to start responding" time=2024-10-22T16:50:07.689+03:00 level=INFO source=llama-server.go:568 msg="waiting for server to become available" status="llm server error" time=2024-10-22T16:50:08.897+03:00 level=INFO source=runner.go:856 msg="starting go runner" time=2024-10-22T16:50:08.911+03:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:50106" llama_model_loader: loaded meta data with 38 key-value pairs and 377 tensors from C:\Users\User\.ollama\models\blobs\sha256-373d _gc llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = deepseek2 llama_model_loader: - kv 1: general.name str = DeepSeek-Coder-V2-Lite-Instruct llama_model_loader: - kv 2: deepseek2.block_count u32 = 27 llama_model_loader: - kv 3: deepseek2.context_length u32 = 163840 llama_model_loader: - kv 4: deepseek2.embedding_length u32 = 2048 llama_model_loader: - kv 5: deepseek2.feed_forward_length u32 = 10944 llama_model_loader: - kv 6: deepseek2.attention.head_count u32 = 16 llama_model_loader: - kv 7: deepseek2.attention.head_count_kv u32 = 16 llama_model_loader: - kv 8: deepseek2.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 9: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 10: deepseek2.expert_used_count u32 = 6 llama_model_loader: - kv 11: general.file_type u32 = 7 llama_model_loader: - kv 12: deepseek2.leading_dense_block_count u32 = 1 llama_model_loader: - kv 13: deepseek2.vocab_size u32 = 102400 llama_model_loader: - kv 14: deepseek2.attention.kv_lora_rank u32 = 512 llama_model_loader: - kv 15: deepseek2.attention.key_length u32 = 192 llama_model_loader: - kv 16: deepseek2.attention.value_length u32 = 128 llama_model_loader: - kv 17: deepseek2.expert_feed_forward_length u32 = 1408 llama_model_loader: - kv 18: deepseek2.expert_count u32 = 64 llama_model_loader: - kv 19: deepseek2.expert_shared_count u32 = 2 llama_model_loader: - kv 20: deepseek2.expert_weights_scale f32 = 1.000000 llama_model_loader: - kv 21: deepseek2.rope.dimension_count u32 = 64 llama_model_loader: - kv 22: deepseek2.rope.scaling.type str = yarn llama_model_loader: - kv 23: deepseek2.rope.scaling.factor f32 = 40.000000 llama_model_loader: - kv 24: deepseek2.rope.scaling.original_context_length u32 = 4096 llama_model_loader: - kv 25: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.070700 llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 27: tokenizer.ggml.pre str = deepseek-llm llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,102400] = ["!", "\"", "#", "$", "%", "&", "'", cc llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,102400] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, cc llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,99757] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h cc llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 100000 llama_model_loader: - kv 32: tokenizer.ggml.eos_token_id u32 = 100001 llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 100001 llama_model_loader: - kv 34: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 35: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 36: tokenizer.chat_template str = {% if not add_generation_prompt is dP cc llama_model_loader: - kv 37: general.quantization_version u32 = 2 llama_model_loader: - type f32: 108 tensors llama_model_loader: - type q8_0: 269 tensors time=2024-10-22T16:50:09.000+03:00 level=INFO source=llama-server.go:568 msg="waiting for server to become available" status="llm server loading model" llama_model_load: error loading model: error loading model vocabulary: wstring_convert::from_bytes llama_load_model_from_file: failed to load model llama_new_context_with_model: model cannot be NULL Exception 0xc0000005 0x0 0x4221 0x79aa50 PC=0x79aa50 signal arrived during external code execution runtime.cgocall(0x7129c0, 0xc00007dd98) runtime/cgocall.go:157 +0x3e fp=0xc00007dd70 sp=0xc00007dd38 pc=0x495dde github.com/ollama/ollama/llama._Cfunc_llama_add_bos_token(0x0) _cgo_gotypes.go:449 +0x52 fp=0xc00007dd98 sp=0xc00007dd70 pc=0x5975f2 main.(*Server).loadModel.(*Model).AddBOSToken.func1(0x80000002000?) github.com/ollama/ollama/llama/llama.go:242 +0x3e fp=0xc00007ddd8 sp=0xc00007dd98 pc=0x71097e github.com/ollama/ollama/llama.(*Model).AddBOSToken(...) github.com/ollama/ollama/llama/llama.go:242 main.(*Server).loadModel(0xc0000be120, {0x1c, 0x0, 0x0, 0x0, {0x0, 0x0, 0x0}, 0xc0000221c0, 0x0}, ...) github.com/ollama/ollama/llama/runner/runner.go:792 +0x219 fp=0xc00007df38 sp=0xc00007ddd8 pc=0x710619 main.main.gowrap1() github.com/ollama/ollama/llama/runner/runner.go:889 +0x95 fp=0xc00007dfe0 sp=0xc00007df38 pc=0x711d55 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc00007dfe8 sp=0xc00007dfe0 pc=0x4fe2e1 created by main.main in goroutine 1 github.com/ollama/ollama/llama/runner/runner.go:889 +0xbd6 goroutine 1 gp=0xc000064000 m=nil [IO wait]: runtime.gopark(0xcf0e80?, 0x9bed40?, 0xa0?, 0xcc?, 0xc00008ccd0?) runtime/proc.go:402 +0xce fp=0xc0000a3748 sp=0xc0000a3728 pc=0x4cdfee runtime.netpollblock(0x1d4?, 0x4958e6?, 0x0?) runtime/netpoll.go:573 +0xf7 fp=0xc0000a3780 sp=0xc0000a3748 pc=0x4c4d97 internal/poll.runtime_pollWait(0x1fc1d9a8000, 0x72) runtime/netpoll.go:345 +0x85 fp=0xc0000a37a0 sp=0xc0000a3780 pc=0x4f84c5 internal/poll.(*pollDesc).wait(0xcf1280?, 0x0?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0000a37c8 sp=0xc0000a37a0 pc=0x54fc47 internal/poll.execIO(0xc00008cca0, 0xc0000a3868) internal/poll/fd_windows.go:175 +0xe6 fp=0xc0000a3838 sp=0xc0000a37c8 pc=0x550586 internal/poll.(*FD).acceptOne(0xc00008cc88, 0x1e4, {0xc00001a1e0?, 0x0?, 0x0?}, 0xcf0e80?) internal/poll/fd_windows.go:944 +0x67 fp=0xc0000a3898 sp=0xc0000a3838 pc=0x552567 internal/poll.(*FD).Accept(0xc00008cc88, 0xc0000a3a48) internal/poll/fd_windows.go:978 +0x1bc fp=0xc0000a3950 sp=0xc0000a3898 pc=0x55289c net.(*netFD).accept(0xc00008cc88) net/fd_windows.go:178 +0x54 fp=0xc0000a3a68 sp=0xc0000a3950 pc=0x5ba974 net.(*TCPListener).accept(0xc000072740) net/tcpsock_posix.go:159 +0x1e fp=0xc0000a3a90 sp=0xc0000a3a68 pc=0x5cbe9e net.(*TCPListener).Accept(0xc000072740) net/tcpsock.go:327 +0x30 fp=0xc0000a3ac0 sp=0xc0000a3a90 pc=0x5cb0d0 net/http.(*onceCloseListener).Accept(0xc0000be1b0?) <autogenerated>:1 +0x24 fp=0xc0000a3ad8 sp=0xc0000a3ac0 pc=0x6efae4 net/http.(*Server).Serve(0xc0000f0000, {0xadd5c0, 0xc000072740}) net/http/server.go:3260 +0x33e fp=0xc0000a3c08 sp=0xc0000a3ad8 pc=0x6e68fe main.main() github.com/ollama/ollama/llama/runner/runner.go:914 +0x104c fp=0xc0000a3f50 sp=0xc0000a3c08 pc=0x711a0c runtime.main() runtime/proc.go:271 +0x28b fp=0xc0000a3fe0 sp=0xc0000a3f50 pc=0x4cdbeb runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc0000a3fe8 sp=0xc0000a3fe0 pc=0x4fe2e1 goroutine 2 gp=0xc000064700 m=nil [force gc (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:402 +0xce fp=0xc000067fa8 sp=0xc000067f88 pc=0x4cdfee runtime.goparkunlock(...) runtime/proc.go:408 runtime.forcegchelper() runtime/proc.go:326 +0xb8 fp=0xc000067fe0 sp=0xc000067fa8 pc=0x4cde78 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc000067fe8 sp=0xc000067fe0 pc=0x4fe2e1 created by runtime.init.6 in goroutine 1 runtime/proc.go:314 +0x1a goroutine 3 gp=0xc000064a80 m=nil [GC sweep wait]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:402 +0xce fp=0xc000069f80 sp=0xc000069f60 pc=0x4cdfee runtime.goparkunlock(...) runtime/proc.go:408 runtime.bgsweep(0xc000074000) runtime/mgcsweep.go:278 +0x94 fp=0xc000069fc8 sp=0xc000069f80 pc=0x4b7674 runtime.gcenable.gowrap1() runtime/mgc.go:203 +0x25 fp=0xc000069fe0 sp=0xc000069fc8 pc=0x4ac185 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc000069fe8 sp=0xc000069fe0 pc=0x4fe2e1 created by runtime.gcenable in goroutine 1 runtime/mgc.go:203 +0x66 goroutine 4 gp=0xc000064c40 m=nil [GC scavenge wait]: runtime.gopark(0xc000074000?, 0xad8090?, 0x1?, 0x0?, 0xc000064c40?) runtime/proc.go:402 +0xce fp=0xc00007bf78 sp=0xc00007bf58 pc=0x4cdfee runtime.goparkunlock(...) runtime/proc.go:408 runtime.(*scavengerState).park(0xcf04c0) runtime/mgcscavenge.go:425 +0x49 fp=0xc00007bfa8 sp=0xc00007bf78 pc=0x4b5069 runtime.bgscavenge(0xc000074000) runtime/mgcscavenge.go:653 +0x3c fp=0xc00007bfc8 sp=0xc00007bfa8 pc=0x4b55fc runtime.gcenable.gowrap2() runtime/mgc.go:204 +0x25 fp=0xc00007bfe0 sp=0xc00007bfc8 pc=0x4ac125 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc00007bfe8 sp=0xc00007bfe0 pc=0x4fe2e1 created by runtime.gcenable in goroutine 1 runtime/mgc.go:204 +0xa5 goroutine 5 gp=0xc000065180 m=nil [finalizer wait]: runtime.gopark(0xc00006be48?, 0x49fac5?, 0xa8?, 0x1?, 0xc000064000?) runtime/proc.go:402 +0xce fp=0xc00006be20 sp=0xc00006be00 pc=0x4cdfee runtime.runfinq() runtime/mfinal.go:194 +0x107 fp=0xc00006bfe0 sp=0xc00006be20 pc=0x4ab207 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc00006bfe8 sp=0xc00006bfe0 pc=0x4fe2e1 created by runtime.createfing in goroutine 1 runtime/mfinal.go:164 +0x3d goroutine 7 gp=0xc000065500 m=nil [semacquire]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:402 +0xce fp=0xc000077d30 sp=0xc000077d10 pc=0x4cdfee runtime.goparkunlock(...) runtime/proc.go:408 runtime.semacquire1(0xc0000be188, 0x0, 0x1, 0x0, 0x12) runtime/sema.go:160 +0x232 fp=0xc000077d98 sp=0xc000077d30 pc=0x4df492 sync.runtime_Semacquire(0x0?) runtime/sema.go:62 +0x25 fp=0xc000077dd0 sp=0xc000077d98 pc=0x4f9ba5 sync.(*WaitGroup).Wait(0x0?) sync/waitgroup.go:116 +0x48 fp=0xc000077df8 sp=0xc000077dd0 pc=0x51bd08 main.(*Server).run(0xc0000be120, {0xaddbf0, 0xc000082050}) github.com/ollama/ollama/llama/runner/runner.go:324 +0x51 fp=0xc000077fb8 sp=0xc000077df8 pc=0x70d131 main.main.gowrap2() github.com/ollama/ollama/llama/runner/runner.go:894 +0x28 fp=0xc000077fe0 sp=0xc000077fb8 pc=0x711c88 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc000077fe8 sp=0xc000077fe0 pc=0x4fe2e1 created by main.main in goroutine 1 github.com/ollama/ollama/llama/runner/runner.go:894 +0xcab goroutine 8 gp=0xc0000656c0 m=nil [IO wait]: runtime.gopark(0x0?, 0xc00008cf20?, 0xd0?, 0xcf?, 0xc00008cf50?) runtime/proc.go:402 +0xce fp=0xc000027890 sp=0xc000027870 pc=0x4cdfee runtime.netpollblock(0x1d8?, 0x4958e6?, 0x0?) runtime/netpoll.go:573 +0xf7 fp=0xc0000278c8 sp=0xc000027890 pc=0x4c4d97 internal/poll.runtime_pollWait(0x1fc1d9a7f08, 0x72) runtime/netpoll.go:345 +0x85 fp=0xc0000278e8 sp=0xc0000278c8 pc=0x4f84c5 internal/poll.(*pollDesc).wait(0xc000027968?, 0x4a7b7d?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000027910 sp=0xc0000278e8 pc=0x54fc47 internal/poll.execIO(0xc00008cf20, 0xa8be90) internal/poll/fd_windows.go:175 +0xe6 fp=0xc000027980 sp=0xc000027910 pc=0x550586 internal/poll.(*FD).Read(0xc00008cf08, {0xc000116000, 0x1000, 0x1000}) internal/poll/fd_windows.go:436 +0x2b1 fp=0xc000027a28 sp=0xc000027980 pc=0x551231 net.(*netFD).Read(0xc00008cf08, {0xc000116000?, 0xc000027a98?, 0x550125?}) net/fd_posix.go:55 +0x25 fp=0xc000027a70 sp=0xc000027a28 pc=0x5b9885 net.(*conn).Read(0xc00006e098, {0xc000116000?, 0x0?, 0xc000108038?}) net/net.go:185 +0x45 fp=0xc000027ab8 sp=0xc000027a70 pc=0x5c5445 net.(*TCPConn).Read(0xc000108030?, {0xc000116000?, 0xc00008cf08?, 0xc000027af0?}) <autogenerated>:1 +0x25 fp=0xc000027ae8 sp=0xc000027ab8 pc=0x5cefc5 net/http.(*connReader).Read(0xc000108030, {0xc000116000, 0x1000, 0x1000}) net/http/server.go:789 +0x14b fp=0xc000027b38 sp=0xc000027ae8 pc=0x6dc70b bufio.(*Reader).fill(0xc000114000) bufio/bufio.go:110 +0x103 fp=0xc000027b70 sp=0xc000027b38 pc=0x6997e3 bufio.(*Reader).Peek(0xc000114000, 0x4) bufio/bufio.go:148 +0x53 fp=0xc000027b90 sp=0xc000027b70 pc=0x699913 net/http.(*conn).serve(0xc0000be1b0, {0xaddbb8, 0xc00001cde0}) net/http/server.go:2079 +0x749 fp=0xc000027fb8 sp=0xc000027b90 pc=0x6e2469 net/http.(*Server).Serve.gowrap3() net/http/server.go:3290 +0x28 fp=0xc000027fe0 sp=0xc000027fb8 pc=0x6e6ce8 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc000027fe8 sp=0xc000027fe0 pc=0x4fe2e1 created by net/http.(*Server).Serve in goroutine 1 net/http/server.go:3290 +0x4b4 rax 0xc00007e000 rbx 0xc00007dd98 rcx 0x40f8 rdx 0xc00007dd28 rdi 0xc00007e000 rsi 0xc000064e00 rbp 0xba613ff8c0 rsp 0xba613ff858 r8 0xc000071808 r9 0x0 r10 0x1fc1d9e05a8 r11 0x0 r12 0xc000182010 r13 0x0 r14 0xc000065340 r15 0x3fffffffffffffff rip 0x79aa50 rflags 0x10202 cs 0x33 fs 0x53 gs 0x2b time=2024-10-22T16:50:09.265+03:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: error loading model: error loading model vocabulary: wstring_convert::from_bytes" [GIN] 2024/10/22 - 16:50:09 | 500 | 4.223671s | 127.0.0.1 | POST "/api/generate" ```
Author
Owner

@rick-github commented on GitHub (Oct 22, 2024):

Unfortunately not a great deal of detail. Please add OLLAMA_DEBUG=1 to the server environment and try again.

<!-- gh-comment-id:2429622628 --> @rick-github commented on GitHub (Oct 22, 2024): Unfortunately not a great deal of detail. Please add `OLLAMA_DEBUG=1` to the [server environment](https://github.com/ollama/ollama/blob/main/docs/faq.md#setting-environment-variables-on-windows) and try again.
Author
Owner

@emzaedu commented on GitHub (Oct 22, 2024):

I think I already turned on debug, here's another one


2024/10/22 19:17:47 routes.go:1170: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:24h0m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\User\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-10-22T19:17:47.697+03:00 level=INFO source=images.go:754 msg="total blobs: 101"
time=2024-10-22T19:17:47.702+03:00 level=INFO source=images.go:761 msg="total unused blobs removed: 0"
time=2024-10-22T19:17:47.705+03:00 level=INFO source=routes.go:1217 msg="Listening on [::]:11434 (version 0.4.0-rc3)"
time=2024-10-22T19:17:47.706+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu\ollama_llama_server.exe
time=2024-10-22T19:17:47.706+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx\ollama_llama_server.exe
time=2024-10-22T19:17:47.706+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe
time=2024-10-22T19:17:47.706+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v11\ollama_llama_server.exe
time=2024-10-22T19:17:47.706+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12\ollama_llama_server.exe
time=2024-10-22T19:17:47.706+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm\ollama_llama_server.exe
time=2024-10-22T19:17:47.706+03:00 level=INFO source=common.go:82 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm]"
time=2024-10-22T19:17:47.706+03:00 level=DEBUG source=common.go:83 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
time=2024-10-22T19:17:47.706+03:00 level=DEBUG source=sched.go:106 msg="starting llm scheduler"
time=2024-10-22T19:17:47.706+03:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
time=2024-10-22T19:17:47.706+03:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2024-10-22T19:17:47.706+03:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=12 efficiency=0 threads=24
time=2024-10-22T19:17:47.706+03:00 level=DEBUG source=gpu.go:94 msg="searching for GPU discovery libraries for NVIDIA"
time=2024-10-22T19:17:47.706+03:00 level=DEBUG source=gpu.go:505 msg="Searching for GPU library" name=nvml.dll
time=2024-10-22T19:17:47.706+03:00 level=DEBUG source=gpu.go:528 msg="gpu library search" globs="[C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvml.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\bin\\nvml.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\libnvvp\\nvml.dll C:\\Windows\\system32\\nvml.dll C:\\Windows\\nvml.dll C:\\Windows\\System32\\Wbem\\nvml.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvml.dll C:\\Windows\\System32\\OpenSSH\\nvml.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll C:\\Program Files\\dotnet\\nvml.dll C:\\Program Files\\Git\\bin\\nvml.dll C:\\Program Files\\Git\\cmd\\nvml.dll C:\\Program Files\\PowerShell\\7\\nvml.dll C:\\Program Files\\nodejs\\nvml.dll C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\nvml.dll C:\\Program Files\\Process Lasso\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\nvml.dll C:\\Program Files\\NVIDIA Corporation\\NVIDIA NvDLISR\\nvml.dll C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\Python\\Launcher\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\sqlite-tools\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\Python311\\Scripts\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\Python311\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\GnuWin32\\bin\\nvml.dll C:\\Users\\User\\AppData\\Local\\Microsoft\\WindowsApps\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\PHP\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\Python311\\Scripts\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\Python311\\nvml.dll C:\\Users\\User\\AppData\\Roaming\\npm\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\VSCodium\\bin\\nvml.dll C:\\Users\\User\\.dotnet\\tools\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\yt-dlp\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\nvml.dll c:\\Windows\\System32\\nvml.dll]"
time=2024-10-22T19:17:47.706+03:00 level=DEBUG source=gpu.go:533 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll"
time=2024-10-22T19:17:47.707+03:00 level=DEBUG source=gpu.go:562 msg="discovered GPU libraries" paths="[C:\\Windows\\system32\\nvml.dll c:\\Windows\\System32\\nvml.dll]"
time=2024-10-22T19:17:47.721+03:00 level=DEBUG source=gpu.go:115 msg="nvidia-ml loaded" library=C:\Windows\system32\nvml.dll
time=2024-10-22T19:17:47.721+03:00 level=DEBUG source=gpu.go:505 msg="Searching for GPU library" name=nvcuda.dll
time=2024-10-22T19:17:47.721+03:00 level=DEBUG source=gpu.go:528 msg="gpu library search" globs="[C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvcuda.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\bin\\nvcuda.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\libnvvp\\nvcuda.dll C:\\Windows\\system32\\nvcuda.dll C:\\Windows\\nvcuda.dll C:\\Windows\\System32\\Wbem\\nvcuda.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvcuda.dll C:\\Windows\\System32\\OpenSSH\\nvcuda.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll C:\\Program Files\\dotnet\\nvcuda.dll C:\\Program Files\\Git\\bin\\nvcuda.dll C:\\Program Files\\Git\\cmd\\nvcuda.dll C:\\Program Files\\PowerShell\\7\\nvcuda.dll C:\\Program Files\\nodejs\\nvcuda.dll C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\nvcuda.dll C:\\Program Files\\Process Lasso\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\nvcuda.dll C:\\Program Files\\NVIDIA Corporation\\NVIDIA NvDLISR\\nvcuda.dll C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\Python\\Launcher\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\sqlite-tools\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\Python311\\Scripts\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\Python311\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\GnuWin32\\bin\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Microsoft\\WindowsApps\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\PHP\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\Python311\\Scripts\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\Python311\\nvcuda.dll C:\\Users\\User\\AppData\\Roaming\\npm\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\VSCodium\\bin\\nvcuda.dll C:\\Users\\User\\.dotnet\\tools\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\yt-dlp\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\nvcuda.dll c:\\windows\\system*\\nvcuda.dll]"
time=2024-10-22T19:17:47.721+03:00 level=DEBUG source=gpu.go:533 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll"
time=2024-10-22T19:17:47.723+03:00 level=DEBUG source=gpu.go:562 msg="discovered GPU libraries" paths=[C:\Windows\system32\nvcuda.dll]
time=2024-10-22T19:17:47.738+03:00 level=DEBUG source=gpu.go:129 msg="detected GPUs" count=1 library=C:\Windows\system32\nvcuda.dll
time=2024-10-22T19:17:47.851+03:00 level=DEBUG source=amd_windows.go:35 msg="unable to load amdhip64_6.dll, please make sure to upgrade to the latest amd driver: The specified module could not be found."
time=2024-10-22T19:17:47.852+03:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e library=cuda variant=v12 compute=8.9 driver=12.6 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB"
[GIN] 2024/10/22 - 19:17:58 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/10/22 - 19:17:58 | 200 |       505.9µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2024/10/22 - 19:18:04 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/10/22 - 19:18:04 | 200 |       8.314ms |       127.0.0.1 | POST     "/api/show"
time=2024-10-22T19:18:04.397+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.5 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.5 GiB" now.free_swap="164.6 GiB"
time=2024-10-22T19:18:04.419+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB"
time=2024-10-22T19:18:04.420+03:00 level=DEBUG source=sched.go:182 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=0x132d0e0 gpu_count=1
time=2024-10-22T19:18:04.432+03:00 level=DEBUG source=sched.go:225 msg="loading first model" model=C:\Users\User\.ollama\models\blobs\sha256-373dcfc92e01372709b6164fc836f677a6280e25e9eac5c434c64223207bfc4f
time=2024-10-22T19:18:04.433+03:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[20.3 GiB]"
time=2024-10-22T19:18:04.433+03:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\User\.ollama\models\blobs\sha256-373dcfc92e01372709b6164fc836f677a6280e25e9eac5c434c64223207bfc4f gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e parallel=4 available=21831094272 required="18.3 GiB"
time=2024-10-22T19:18:04.433+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.5 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.5 GiB" now.free_swap="164.6 GiB"
time=2024-10-22T19:18:04.451+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB"
time=2024-10-22T19:18:04.451+03:00 level=INFO source=llama-server.go:72 msg="system memory" total="191.6 GiB" free="163.5 GiB" free_swap="164.6 GiB"
time=2024-10-22T19:18:04.451+03:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[20.3 GiB]"
time=2024-10-22T19:18:04.451+03:00 level=INFO source=memory.go:346 msg="offload to cuda" layers.requested=-1 layers.model=28 layers.offload=28 layers.split="" memory.available="[20.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="18.3 GiB" memory.required.partial="18.3 GiB" memory.required.kv="2.1 GiB" memory.required.allocations="[18.3 GiB]" memory.weights.total="17.2 GiB" memory.weights.repeating="17.0 GiB" memory.weights.nonrepeating="212.5 MiB" memory.graph.full="296.0 MiB" memory.graph.partial="391.4 MiB"
time=2024-10-22T19:18:04.452+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu\ollama_llama_server.exe
time=2024-10-22T19:18:04.452+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx\ollama_llama_server.exe
time=2024-10-22T19:18:04.452+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe
time=2024-10-22T19:18:04.452+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v11\ollama_llama_server.exe
time=2024-10-22T19:18:04.452+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12\ollama_llama_server.exe
time=2024-10-22T19:18:04.452+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm\ollama_llama_server.exe
time=2024-10-22T19:18:04.452+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu\ollama_llama_server.exe
time=2024-10-22T19:18:04.452+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx\ollama_llama_server.exe
time=2024-10-22T19:18:04.452+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe
time=2024-10-22T19:18:04.452+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v11\ollama_llama_server.exe
time=2024-10-22T19:18:04.452+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12\ollama_llama_server.exe
time=2024-10-22T19:18:04.452+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm\ollama_llama_server.exe
time=2024-10-22T19:18:04.458+03:00 level=INFO source=llama-server.go:355 msg="starting llama server" cmd="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12\\ollama_llama_server.exe --model C:\\Users\\User\\.ollama\\models\\blobs\\sha256-373dcfc92e01372709b6164fc836f677a6280e25e9eac5c434c64223207bfc4f --ctx-size 8192 --batch-size 512 --embedding --n-gpu-layers 28 --verbose --threads 12 --no-mmap --parallel 4 --port 60158"
time=2024-10-22T19:18:04.458+03:00 level=DEBUG source=llama-server.go:372 msg=subprocess environment="[CUDA_ERROR_LEVEL=50  CUDA_HOME=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6 CUDA_PATH=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6 CUDA_PATH_V12_6=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6 PATH=C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\libnvvp;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\dotnet\\;C:\\Program Files\\Git\\bin;C:\\Program Files\\Git\\cmd;C:\\Program Files\\PowerShell\\7\\;C:\\Program Files\\nodejs\\;C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\;C:\\Program Files\\Process Lasso\\;;C:\\Program Files\\NVIDIA Corporation\\NVIDIA NvDLISR;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Users\\User\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\User\\AppData\\Local\\Programs\\sqlite-tools;C:\\Users\\User\\AppData\\Local\\Programs\\Python311\\Scripts\\;C:\\Users\\User\\AppData\\Local\\Programs\\Python311\\;C:\\Users\\User\\AppData\\Local\\Programs\\GnuWin32\\bin;C:\\Users\\User\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\User\\AppData\\Local\\Programs\\PHP;C:\\Users\\User\\AppData\\Local\\Programs\\Python311\\Scripts;C:\\Users\\User\\AppData\\Local\\Programs\\Python311;C:\\Users\\User\\AppData\\Roaming\\npm;C:\\Users\\User\\AppData\\Local\\Programs\\VSCodium\\bin;C:\\Users\\User\\.dotnet\\tools;C:\\Users\\User\\AppData\\Local\\Programs\\Ollama;C:\\Users\\User\\AppData\\Local\\Programs\\yt-dlp; CUDA_VISIBLE_DEVICES=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e]"
time=2024-10-22T19:18:04.462+03:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2024-10-22T19:18:04.462+03:00 level=INFO source=llama-server.go:534 msg="waiting for llama runner to start responding"
time=2024-10-22T19:18:04.464+03:00 level=INFO source=llama-server.go:568 msg="waiting for server to become available" status="llm server error"
time=2024-10-22T19:18:04.562+03:00 level=INFO source=runner.go:856 msg="starting go runner"
time=2024-10-22T19:18:04.577+03:00 level=DEBUG source=runner.go:857 msg="system info" cpu="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " threads=12
time=2024-10-22T19:18:04.578+03:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:60158"
llama_model_loader: loaded meta data with 38 key-value pairs and 377 tensors from C:\Users\User\.ollama\models\blobs\sha256-373dPZ^ 0llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.name str              = DeepSeek-Coder-V2-Lite-Instruct
llama_model_loader: - kv   2:                      deepseek2.block_count u32              = 27
llama_model_loader: - kv   3:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv   4:                 deepseek2.embedding_length u32              = 2048
llama_model_loader: - kv   5:              deepseek2.feed_forward_length u32              = 10944
llama_model_loader: - kv   6:             deepseek2.attention.head_count u32              = 16
llama_model_loader: - kv   7:          deepseek2.attention.head_count_kv u32              = 16
llama_model_loader: - kv   8:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv   9: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                deepseek2.expert_used_count u32              = 6
llama_model_loader: - kv  11:                          general.file_type u32              = 7
llama_model_loader: - kv  12:        deepseek2.leading_dense_block_count u32              = 1
llama_model_loader: - kv  13:                       deepseek2.vocab_size u32              = 102400
llama_model_loader: - kv  14:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  15:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  16:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  17:       deepseek2.expert_feed_forward_length u32              = 1408
llama_model_loader: - kv  18:                     deepseek2.expert_count u32              = 64
llama_model_loader: - kv  19:              deepseek2.expert_shared_count u32              = 2
llama_model_loader: - kv  20:             deepseek2.expert_weights_scale f32              = 1.000000
llama_model_loader: - kv  21:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  22:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  23:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  24: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  25: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.070700
llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = deepseek-llm
llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,102400]  = ["!", "\"", "#", "$", "%", "&", "'",  Z 0llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,102400]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, Z 0llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,99757]   = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h   Z 0llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 100000
llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 100001
llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 100001
llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  35:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {% if not add_generation_prompt is dp Z 0llama_model_loader: - kv  37:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  108 tensors
llama_model_loader: - type q8_0:  269 tensors
llama_model_load: error loading model: error loading model vocabulary: wstring_convert::from_bytes
llama_load_model_from_file: failed to load model
llama_new_context_with_model: model cannot be NULL
Exception 0xc0000005 0x0 0x4221 0x8daa50
PC=0x8daa50
signal arrived during external code execution

runtime.cgocall(0x8529c0, 0xc000075d98)
	runtime/cgocall.go:157 +0x3e fp=0xc000075d70 sp=0xc000075d38 pc=0x5d5dde
github.com/ollama/ollama/llama._Cfunc_llama_add_bos_token(0x0)
	_cgo_gotypes.go:449 +0x52 fp=0xc000075d98 sp=0xc000075d70 pc=0x6d75f2
main.(*Server).loadModel.(*Model).AddBOSToken.func1(0x80000002000?)
	github.com/ollama/ollama/llama/llama.go:242 +0x3e fp=0xc000075dd8 sp=0xc000075d98 pc=0x85097e
github.com/ollama/ollama/llama.(*Model).AddBOSToken(...)
	github.com/ollama/ollama/llama/llama.go:242
main.(*Server).loadModel(0xc00015c120, {0x1c, 0x0, 0x0, 0x0, {0x0, 0x0, 0x0}, 0xc00010a190, 0x0}, ...)
	github.com/ollama/ollama/llama/runner/runner.go:792 +0x219 fp=0xc000075f38 sp=0xc000075dd8 pc=0x850619
main.main.gowrap1()
	github.com/ollama/ollama/llama/runner/runner.go:889 +0x95 fp=0xc000075fe0 sp=0xc000075f38 pc=0x851d55
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc000075fe8 sp=0xc000075fe0 pc=0x63e2e1
created by main.main in goroutine 1
	github.com/ollama/ollama/llama/runner/runner.go:889 +0xbd6

goroutine 1 gp=0xc000064000 m=nil [IO wait]:
runtime.gopark(0xe30e80?, 0xafed40?, 0xa0?, 0xc?, 0xc000120cd0?)
	runtime/proc.go:402 +0xce fp=0xc000139748 sp=0xc000139728 pc=0x60dfee
runtime.netpollblock(0x1e4?, 0x5d58e6?, 0x0?)
	runtime/netpoll.go:573 +0xf7 fp=0xc000139780 sp=0xc000139748 pc=0x604d97
internal/poll.runtime_pollWaittime=2024-10-22T19:18:04.717+03:00 level=INFO source=llama-server.go:568 msg="waiting for server to become available" status="llm server loading model"
(0x130ea49a780, 0x72)
	runtime/netpoll.go:345 +0x85 fp=0xc0001397a0 sp=0xc000139780 pc=0x6384c5
internal/poll.(*pollDesc).wait(0x5ec336?, 0xeb6ac0?, 0x0)
	internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0001397c8 sp=0xc0001397a0 pc=0x68fc47
internal/poll.execIO(0xc000120ca0, 0xc000139868)
	internal/poll/fd_windows.go:175 +0xe6 fp=0xc000139838 sp=0xc0001397c8 pc=0x690586
internal/poll.(*FD).acceptOne(0xc000120c88, 0x1fc, {0xc000220000?, 0x211f80?, 0x0?}, 0x0?)
	internal/poll/fd_windows.go:944 +0x67 fp=0xc000139898 sp=0xc000139838 pc=0x692567
internal/poll.(*FD).Accept(0xc000120c88, 0xc000139a48)
	internal/poll/fd_windows.go:978 +0x1bc fp=0xc000139950 sp=0xc000139898 pc=0x69289c
net.(*netFD).accept(0xc000120c88)
	net/fd_windows.go:178 +0x54 fp=0xc000139a68 sp=0xc000139950 pc=0x6fa974
net.(*TCPListener).accept(0xc000116720)
	net/tcpsock_posix.go:159 +0x1e fp=0xc000139a90 sp=0xc000139a68 pc=0x70be9e
net.(*TCPListener).Accept(0xc000116720)
	net/tcpsock.go:327 +0x30 fp=0xc000139ac0 sp=0xc000139a90 pc=0x70b0d0
net/http.(*onceCloseListener).Accept(0xc000210000?)
	<autogenerated>:1 +0x24 fp=0xc000139ad8 sp=0xc000139ac0 pc=0x82fae4
net/http.(*Server).Serve(0xc000194000, {0xc1d5c0, 0xc000116720})
	net/http/server.go:3260 +0x33e fp=0xc000139c08 sp=0xc000139ad8 pc=0x8268fe
main.main()
	github.com/ollama/ollama/llama/runner/runner.go:914 +0x104c fp=0xc000139f50 sp=0xc000139c08 pc=0x851a0c
runtime.main()
	runtime/proc.go:271 +0x28b fp=0xc000139fe0 sp=0xc000139f50 pc=0x60dbeb
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc000139fe8 sp=0xc000139fe0 pc=0x63e2e1

goroutine 2 gp=0xc000064700 m=nil [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:402 +0xce fp=0xc000067fa8 sp=0xc000067f88 pc=0x60dfee
runtime.goparkunlock(...)
	runtime/proc.go:408
runtime.forcegchelper()
	runtime/proc.go:326 +0xb8 fp=0xc000067fe0 sp=0xc000067fa8 pc=0x60de78
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc000067fe8 sp=0xc000067fe0 pc=0x63e2e1
created by runtime.init.6 in goroutine 1
	runtime/proc.go:314 +0x1a

goroutine 3 gp=0xc000064a80 m=nil [GC sweep wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:402 +0xce fp=0xc000069f80 sp=0xc000069f60 pc=0x60dfee
runtime.goparkunlock(...)
	runtime/proc.go:408
runtime.bgsweep(0xc00001a070)
	runtime/mgcsweep.go:278 +0x94 fp=0xc000069fc8 sp=0xc000069f80 pc=0x5f7674
runtime.gcenable.gowrap1()
	runtime/mgc.go:203 +0x25 fp=0xc000069fe0 sp=0xc000069fc8 pc=0x5ec185
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc000069fe8 sp=0xc000069fe0 pc=0x63e2e1
created by runtime.gcenable in goroutine 1
	runtime/mgc.go:203 +0x66

goroutine 4 gp=0xc000064c40 m=nil [GC scavenge wait]:
runtime.gopark(0xc00001a070?, 0xc18090?, 0x1?, 0x0?, 0xc000064c40?)
	runtime/proc.go:402 +0xce fp=0xc000079f78 sp=0xc000079f58 pc=0x60dfee
runtime.goparkunlock(...)
	runtime/proc.go:408
runtime.(*scavengerState).park(0xe304c0)
	runtime/mgcscavenge.go:425 +0x49 fp=0xc000079fa8 sp=0xc000079f78 pc=0x5f5069
runtime.bgscavenge(0xc00001a070)
	runtime/mgcscavenge.go:653 +0x3c fp=0xc000079fc8 sp=0xc000079fa8 pc=0x5f55fc
runtime.gcenable.gowrap2()
	runtime/mgc.go:204 +0x25 fp=0xc000079fe0 sp=0xc000079fc8 pc=0x5ec125
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc000079fe8 sp=0xc000079fe0 pc=0x63e2e1
created by runtime.gcenable in goroutine 1
	runtime/mgc.go:204 +0xa5

goroutine 18 gp=0xc0001061c0 m=nil [finalizer wait]:
runtime.gopark(0xc00006be48?, 0x5dfac5?, 0xa8?, 0x1?, 0xc000064000?)
	runtime/proc.go:402 +0xce fp=0xc00006be20 sp=0xc00006be00 pc=0x60dfee
runtime.runfinq()
	runtime/mfinal.go:194 +0x107 fp=0xc00006bfe0 sp=0xc00006be20 pc=0x5eb207
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc00006bfe8 sp=0xc00006bfe0 pc=0x63e2e1
created by runtime.createfing in goroutine 1
	runtime/mfinal.go:164 +0x3d

goroutine 20 gp=0xc000106540 m=nil [semacquire]:
runtime.gopark(0x0?, 0x0?, 0x60?, 0xe0?, 0x0?)
	runtime/proc.go:402 +0xce fp=0xc000077d30 sp=0xc000077d10 pc=0x60dfee
runtime.goparkunlock(...)
	runtime/proc.go:408
runtime.semacquire1(0xc00015c188, 0x0, 0x1, 0x0, 0x12)
	runtime/sema.go:160 +0x232 fp=0xc000077d98 sp=0xc000077d30 pc=0x61f492
sync.runtime_Semacquire(0x0?)
	runtime/sema.go:62 +0x25 fp=0xc000077dd0 sp=0xc000077d98 pc=0x639ba5
sync.(*WaitGroup).Wait(0x0?)
	sync/waitgroup.go:116 +0x48 fp=0xc000077df8 sp=0xc000077dd0 pc=0x65bd08
main.(*Server).run(0xc00015c120, {0xc1dbf0, 0xc000192000})
	github.com/ollama/ollama/llama/runner/runner.go:324 +0x51 fp=0xc000077fb8 sp=0xc000077df8 pc=0x84d131
main.main.gowrap2()
	github.com/ollama/ollama/llama/runner/runner.go:894 +0x28 fp=0xc000077fe0 sp=0xc000077fb8 pc=0x851c88
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc000077fe8 sp=0xc000077fe0 pc=0x63e2e1
created by main.main in goroutine 1
	github.com/ollama/ollama/llama/runner/runner.go:894 +0xcab

goroutine 34 gp=0xc000214000 m=nil [IO wait]:
runtime.gopark(0x0?, 0xc000206020?, 0xd0?, 0x60?, 0xc000206050?)
	runtime/proc.go:402 +0xce fp=0xc000027890 sp=0xc000027870 pc=0x60dfee
runtime.netpollblock(0x1f0?, 0x5d58e6?, 0x0?)
	runtime/netpoll.go:573 +0xf7 fp=0xc0000278c8 sp=0xc000027890 pc=0x604d97
internal/poll.runtime_pollWait(0x130ea49a688, 0x72)
	runtime/netpoll.go:345 +0x85 fp=0xc0000278e8 sp=0xc0000278c8 pc=0x6384c5
internal/poll.(*pollDesc).wait(0xc000027930?, 0x68f9fc?, 0x0)
	internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000027910 sp=0xc0000278e8 pc=0x68fc47
internal/poll.execIO(0xc000206020, 0xbcbe90)
	internal/poll/fd_windows.go:175 +0xe6 fp=0xc000027980 sp=0xc000027910 pc=0x690586
internal/poll.(*FD).Read(0xc000206008, {0xc000198000, 0x1000, 0x1000})
	internal/poll/fd_windows.go:436 +0x2b1 fp=0xc000027a28 sp=0xc000027980 pc=0x691231
net.(*netFD).Read(0xc000206008, {0xc000198000?, 0xc000027a98?, 0x690125?})
	net/fd_posix.go:55 +0x25 fp=0xc000027a70 sp=0xc000027a28 pc=0x6f9885
net.(*conn).Read(0xc00020e000, {0xc000198000?, 0x0?, 0xc000108e78?})
	net/net.go:185 +0x45 fp=0xc000027ab8 sp=0xc000027a70 pc=0x705445
net.(*TCPConn).Read(0xc000108e70?, {0xc000198000?, 0xc000206008?, 0xc000027af0?})
	<autogenerated>:1 +0x25 fp=0xc000027ae8 sp=0xc000027ab8 pc=0x70efc5
net/http.(*connReader).Read(0xc000108e70, {0xc000198000, 0x1000, 0x1000})
	net/http/server.go:789 +0x14b fp=0xc000027b38 sp=0xc000027ae8 pc=0x81c70b
bufio.(*Reader).fill(0xc00017e1e0)
	bufio/bufio.go:110 +0x103 fp=0xc000027b70 sp=0xc000027b38 pc=0x7d97e3
bufio.(*Reader).Peek(0xc00017e1e0, 0x4)
	bufio/bufio.go:148 +0x53 fp=0xc000027b90 sp=0xc000027b70 pc=0x7d9913
net/http.(*conn).serve(0xc000210000, {0xc1dbb8, 0xc000108de0})
	net/http/server.go:2079 +0x749 fp=0xc000027fb8 sp=0xc000027b90 pc=0x822469
net/http.(*Server).Serve.gowrap3()
	net/http/server.go:3290 +0x28 fp=0xc000027fe0 sp=0xc000027fb8 pc=0x826ce8
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc000027fe8 sp=0xc000027fe0 pc=0x63e2e1
created by net/http.(*Server).Serve in goroutine 1
	net/http/server.go:3290 +0x4b4
rax     0xc000076000
rbx     0xc000075d98
rcx     0x40f8
rdx     0xc000075d28
rdi     0xc000076000
rsi     0xc000064fc0
rbp     0x17015ff6e0
rsp     0x17015ff678
r8      0xc000082008
r9      0x0
r10     0x130ea30a4c8
r11     0x0
r12     0xc00000a060
r13     0x0
r14     0xc000106380
r15     0x1ffffffffffffff
rip     0x8daa50
rflags  0x10202
cs      0x33
fs      0x53
gs      0x2b
time=2024-10-22T19:18:04.736+03:00 level=DEBUG source=llama-server.go:395 msg="llama runner terminated" error="exit status 2"
time=2024-10-22T19:18:04.978+03:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: error loading model: error loading model vocabulary: wstring_convert::from_bytes"
time=2024-10-22T19:18:04.978+03:00 level=DEBUG source=sched.go:459 msg="triggering expiration for failed load" model=C:\Users\User\.ollama\models\blobs\sha256-373dcfc92e01372709b6164fc836f677a6280e25e9eac5c434c64223207bfc4f
time=2024-10-22T19:18:04.978+03:00 level=DEBUG source=sched.go:361 msg="runner expired event received" modelPath=C:\Users\User\.ollama\models\blobs\sha256-373dcfc92e01372709b6164fc836f677a6280e25e9eac5c434c64223207bfc4f
time=2024-10-22T19:18:04.978+03:00 level=DEBUG source=sched.go:376 msg="got lock to unload" modelPath=C:\Users\User\.ollama\models\blobs\sha256-373dcfc92e01372709b6164fc836f677a6280e25e9eac5c434c64223207bfc4f
time=2024-10-22T19:18:04.978+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.5 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.5 GiB" now.free_swap="164.6 GiB"
[GIN] 2024/10/22 - 19:18:04 | 500 |    589.2269ms |       127.0.0.1 | POST     "/api/generate"
time=2024-10-22T19:18:04.994+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB"
time=2024-10-22T19:18:04.995+03:00 level=DEBUG source=llama-server.go:1017 msg="stopping llama server"
time=2024-10-22T19:18:04.995+03:00 level=DEBUG source=sched.go:381 msg="runner released" modelPath=C:\Users\User\.ollama\models\blobs\sha256-373dcfc92e01372709b6164fc836f677a6280e25e9eac5c434c64223207bfc4f
time=2024-10-22T19:18:05.259+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.5 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.6 GiB" now.free_swap="164.6 GiB"
time=2024-10-22T19:18:05.275+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB"
time=2024-10-22T19:18:05.507+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.6 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.6 GiB" now.free_swap="164.6 GiB"
time=2024-10-22T19:18:05.523+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB"
time=2024-10-22T19:18:05.754+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.6 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.6 GiB" now.free_swap="164.6 GiB"
time=2024-10-22T19:18:05.770+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB"
time=2024-10-22T19:18:06.003+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.6 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.6 GiB" now.free_swap="164.6 GiB"
time=2024-10-22T19:18:06.018+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB"
time=2024-10-22T19:18:06.252+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.6 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.6 GiB" now.free_swap="164.6 GiB"
time=2024-10-22T19:18:06.267+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB"
time=2024-10-22T19:18:06.502+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.6 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.6 GiB" now.free_swap="164.6 GiB"
time=2024-10-22T19:18:06.518+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB"
time=2024-10-22T19:18:06.750+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.6 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.6 GiB" now.free_swap="164.6 GiB"
time=2024-10-22T19:18:06.765+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB"
time=2024-10-22T19:18:06.996+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.6 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.6 GiB" now.free_swap="164.6 GiB"
time=2024-10-22T19:18:07.013+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB"
time=2024-10-22T19:18:07.246+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.6 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.6 GiB" now.free_swap="164.6 GiB"
time=2024-10-22T19:18:07.261+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB"
time=2024-10-22T19:18:07.495+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.6 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.6 GiB" now.free_swap="164.6 GiB"
time=2024-10-22T19:18:07.510+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB"
time=2024-10-22T19:18:07.745+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.6 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.6 GiB" now.free_swap="164.6 GiB"
time=2024-10-22T19:18:07.761+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB"
time=2024-10-22T19:18:08.009+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.6 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.6 GiB" now.free_swap="164.6 GiB"
time=2024-10-22T19:18:08.026+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB"
time=2024-10-22T19:18:08.258+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.6 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.6 GiB" now.free_swap="164.6 GiB"
time=2024-10-22T19:18:08.273+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB"
time=2024-10-22T19:18:08.506+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.6 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.6 GiB" now.free_swap="164.6 GiB"
time=2024-10-22T19:18:08.521+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB"
time=2024-10-22T19:18:08.755+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.6 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.5 GiB" now.free_swap="164.6 GiB"
time=2024-10-22T19:18:08.771+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB"
time=2024-10-22T19:18:09.003+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.5 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.5 GiB" now.free_swap="164.6 GiB"
time=2024-10-22T19:18:09.019+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB"
time=2024-10-22T19:18:09.252+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.5 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.5 GiB" now.free_swap="164.6 GiB"
time=2024-10-22T19:18:09.267+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB"
time=2024-10-22T19:18:09.500+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.5 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.5 GiB" now.free_swap="164.6 GiB"
time=2024-10-22T19:18:09.516+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB"
time=2024-10-22T19:18:09.749+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.5 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.5 GiB" now.free_swap="164.6 GiB"
time=2024-10-22T19:18:09.765+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB"
time=2024-10-22T19:18:09.997+03:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0191493 model=C:\Users\User\.ollama\models\blobs\sha256-373dcfc92e01372709b6164fc836f677a6280e25e9eac5c434c64223207bfc4f
time=2024-10-22T19:18:09.997+03:00 level=DEBUG source=sched.go:385 msg="sending an unloaded event" modelPath=C:\Users\User\.ollama\models\blobs\sha256-373dcfc92e01372709b6164fc836f677a6280e25e9eac5c434c64223207bfc4f
time=2024-10-22T19:18:09.997+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.5 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.5 GiB" now.free_swap="164.6 GiB"
time=2024-10-22T19:18:09.997+03:00 level=DEBUG source=sched.go:309 msg="ignoring unload event with no pending requests"
time=2024-10-22T19:18:10.014+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB"
time=2024-10-22T19:18:10.260+03:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2817106 model=C:\Users\User\.ollama\models\blobs\sha256-373dcfc92e01372709b6164fc836f677a6280e25e9eac5c434c64223207bfc4f
time=2024-10-22T19:18:10.260+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.5 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.6 GiB" now.free_swap="164.6 GiB"
time=2024-10-22T19:18:10.275+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB"
time=2024-10-22T19:18:10.496+03:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5176842 model=C:\Users\User\.ollama\models\blobs\sha256-373dcfc92e01372709b6164fc836f677a6280e25e9eac5c434c64223207bfc4f

<!-- gh-comment-id:2429722829 --> @emzaedu commented on GitHub (Oct 22, 2024): I think I already turned on debug, here's another one ``` 2024/10/22 19:17:47 routes.go:1170: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:24h0m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\User\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" time=2024-10-22T19:17:47.697+03:00 level=INFO source=images.go:754 msg="total blobs: 101" time=2024-10-22T19:17:47.702+03:00 level=INFO source=images.go:761 msg="total unused blobs removed: 0" time=2024-10-22T19:17:47.705+03:00 level=INFO source=routes.go:1217 msg="Listening on [::]:11434 (version 0.4.0-rc3)" time=2024-10-22T19:17:47.706+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu\ollama_llama_server.exe time=2024-10-22T19:17:47.706+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx\ollama_llama_server.exe time=2024-10-22T19:17:47.706+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe time=2024-10-22T19:17:47.706+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v11\ollama_llama_server.exe time=2024-10-22T19:17:47.706+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12\ollama_llama_server.exe time=2024-10-22T19:17:47.706+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm\ollama_llama_server.exe time=2024-10-22T19:17:47.706+03:00 level=INFO source=common.go:82 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm]" time=2024-10-22T19:17:47.706+03:00 level=DEBUG source=common.go:83 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY" time=2024-10-22T19:17:47.706+03:00 level=DEBUG source=sched.go:106 msg="starting llm scheduler" time=2024-10-22T19:17:47.706+03:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs" time=2024-10-22T19:17:47.706+03:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2024-10-22T19:17:47.706+03:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=12 efficiency=0 threads=24 time=2024-10-22T19:17:47.706+03:00 level=DEBUG source=gpu.go:94 msg="searching for GPU discovery libraries for NVIDIA" time=2024-10-22T19:17:47.706+03:00 level=DEBUG source=gpu.go:505 msg="Searching for GPU library" name=nvml.dll time=2024-10-22T19:17:47.706+03:00 level=DEBUG source=gpu.go:528 msg="gpu library search" globs="[C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvml.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\bin\\nvml.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\libnvvp\\nvml.dll C:\\Windows\\system32\\nvml.dll C:\\Windows\\nvml.dll C:\\Windows\\System32\\Wbem\\nvml.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvml.dll C:\\Windows\\System32\\OpenSSH\\nvml.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll C:\\Program Files\\dotnet\\nvml.dll C:\\Program Files\\Git\\bin\\nvml.dll C:\\Program Files\\Git\\cmd\\nvml.dll C:\\Program Files\\PowerShell\\7\\nvml.dll C:\\Program Files\\nodejs\\nvml.dll C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\nvml.dll C:\\Program Files\\Process Lasso\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\nvml.dll C:\\Program Files\\NVIDIA Corporation\\NVIDIA NvDLISR\\nvml.dll C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\Python\\Launcher\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\sqlite-tools\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\Python311\\Scripts\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\Python311\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\GnuWin32\\bin\\nvml.dll C:\\Users\\User\\AppData\\Local\\Microsoft\\WindowsApps\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\PHP\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\Python311\\Scripts\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\Python311\\nvml.dll C:\\Users\\User\\AppData\\Roaming\\npm\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\VSCodium\\bin\\nvml.dll C:\\Users\\User\\.dotnet\\tools\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\yt-dlp\\nvml.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\nvml.dll c:\\Windows\\System32\\nvml.dll]" time=2024-10-22T19:17:47.706+03:00 level=DEBUG source=gpu.go:533 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll" time=2024-10-22T19:17:47.707+03:00 level=DEBUG source=gpu.go:562 msg="discovered GPU libraries" paths="[C:\\Windows\\system32\\nvml.dll c:\\Windows\\System32\\nvml.dll]" time=2024-10-22T19:17:47.721+03:00 level=DEBUG source=gpu.go:115 msg="nvidia-ml loaded" library=C:\Windows\system32\nvml.dll time=2024-10-22T19:17:47.721+03:00 level=DEBUG source=gpu.go:505 msg="Searching for GPU library" name=nvcuda.dll time=2024-10-22T19:17:47.721+03:00 level=DEBUG source=gpu.go:528 msg="gpu library search" globs="[C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvcuda.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\bin\\nvcuda.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\libnvvp\\nvcuda.dll C:\\Windows\\system32\\nvcuda.dll C:\\Windows\\nvcuda.dll C:\\Windows\\System32\\Wbem\\nvcuda.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvcuda.dll C:\\Windows\\System32\\OpenSSH\\nvcuda.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll C:\\Program Files\\dotnet\\nvcuda.dll C:\\Program Files\\Git\\bin\\nvcuda.dll C:\\Program Files\\Git\\cmd\\nvcuda.dll C:\\Program Files\\PowerShell\\7\\nvcuda.dll C:\\Program Files\\nodejs\\nvcuda.dll C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\nvcuda.dll C:\\Program Files\\Process Lasso\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\nvcuda.dll C:\\Program Files\\NVIDIA Corporation\\NVIDIA NvDLISR\\nvcuda.dll C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\Python\\Launcher\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\sqlite-tools\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\Python311\\Scripts\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\Python311\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\GnuWin32\\bin\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Microsoft\\WindowsApps\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\PHP\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\Python311\\Scripts\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\Python311\\nvcuda.dll C:\\Users\\User\\AppData\\Roaming\\npm\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\VSCodium\\bin\\nvcuda.dll C:\\Users\\User\\.dotnet\\tools\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\yt-dlp\\nvcuda.dll C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\nvcuda.dll c:\\windows\\system*\\nvcuda.dll]" time=2024-10-22T19:17:47.721+03:00 level=DEBUG source=gpu.go:533 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll" time=2024-10-22T19:17:47.723+03:00 level=DEBUG source=gpu.go:562 msg="discovered GPU libraries" paths=[C:\Windows\system32\nvcuda.dll] time=2024-10-22T19:17:47.738+03:00 level=DEBUG source=gpu.go:129 msg="detected GPUs" count=1 library=C:\Windows\system32\nvcuda.dll time=2024-10-22T19:17:47.851+03:00 level=DEBUG source=amd_windows.go:35 msg="unable to load amdhip64_6.dll, please make sure to upgrade to the latest amd driver: The specified module could not be found." time=2024-10-22T19:17:47.852+03:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e library=cuda variant=v12 compute=8.9 driver=12.6 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB" [GIN] 2024/10/22 - 19:17:58 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/10/22 - 19:17:58 | 200 | 505.9µs | 127.0.0.1 | GET "/api/ps" [GIN] 2024/10/22 - 19:18:04 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/10/22 - 19:18:04 | 200 | 8.314ms | 127.0.0.1 | POST "/api/show" time=2024-10-22T19:18:04.397+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.5 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.5 GiB" now.free_swap="164.6 GiB" time=2024-10-22T19:18:04.419+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB" time=2024-10-22T19:18:04.420+03:00 level=DEBUG source=sched.go:182 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=0x132d0e0 gpu_count=1 time=2024-10-22T19:18:04.432+03:00 level=DEBUG source=sched.go:225 msg="loading first model" model=C:\Users\User\.ollama\models\blobs\sha256-373dcfc92e01372709b6164fc836f677a6280e25e9eac5c434c64223207bfc4f time=2024-10-22T19:18:04.433+03:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[20.3 GiB]" time=2024-10-22T19:18:04.433+03:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\User\.ollama\models\blobs\sha256-373dcfc92e01372709b6164fc836f677a6280e25e9eac5c434c64223207bfc4f gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e parallel=4 available=21831094272 required="18.3 GiB" time=2024-10-22T19:18:04.433+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.5 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.5 GiB" now.free_swap="164.6 GiB" time=2024-10-22T19:18:04.451+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB" time=2024-10-22T19:18:04.451+03:00 level=INFO source=llama-server.go:72 msg="system memory" total="191.6 GiB" free="163.5 GiB" free_swap="164.6 GiB" time=2024-10-22T19:18:04.451+03:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[20.3 GiB]" time=2024-10-22T19:18:04.451+03:00 level=INFO source=memory.go:346 msg="offload to cuda" layers.requested=-1 layers.model=28 layers.offload=28 layers.split="" memory.available="[20.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="18.3 GiB" memory.required.partial="18.3 GiB" memory.required.kv="2.1 GiB" memory.required.allocations="[18.3 GiB]" memory.weights.total="17.2 GiB" memory.weights.repeating="17.0 GiB" memory.weights.nonrepeating="212.5 MiB" memory.graph.full="296.0 MiB" memory.graph.partial="391.4 MiB" time=2024-10-22T19:18:04.452+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu\ollama_llama_server.exe time=2024-10-22T19:18:04.452+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx\ollama_llama_server.exe time=2024-10-22T19:18:04.452+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe time=2024-10-22T19:18:04.452+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v11\ollama_llama_server.exe time=2024-10-22T19:18:04.452+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12\ollama_llama_server.exe time=2024-10-22T19:18:04.452+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm\ollama_llama_server.exe time=2024-10-22T19:18:04.452+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu\ollama_llama_server.exe time=2024-10-22T19:18:04.452+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx\ollama_llama_server.exe time=2024-10-22T19:18:04.452+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe time=2024-10-22T19:18:04.452+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v11\ollama_llama_server.exe time=2024-10-22T19:18:04.452+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12\ollama_llama_server.exe time=2024-10-22T19:18:04.452+03:00 level=DEBUG source=common.go:327 msg="availableServers : found" file=C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm\ollama_llama_server.exe time=2024-10-22T19:18:04.458+03:00 level=INFO source=llama-server.go:355 msg="starting llama server" cmd="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12\\ollama_llama_server.exe --model C:\\Users\\User\\.ollama\\models\\blobs\\sha256-373dcfc92e01372709b6164fc836f677a6280e25e9eac5c434c64223207bfc4f --ctx-size 8192 --batch-size 512 --embedding --n-gpu-layers 28 --verbose --threads 12 --no-mmap --parallel 4 --port 60158" time=2024-10-22T19:18:04.458+03:00 level=DEBUG source=llama-server.go:372 msg=subprocess environment="[CUDA_ERROR_LEVEL=50 CUDA_HOME=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6 CUDA_PATH=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6 CUDA_PATH_V12_6=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6 PATH=C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\libnvvp;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\dotnet\\;C:\\Program Files\\Git\\bin;C:\\Program Files\\Git\\cmd;C:\\Program Files\\PowerShell\\7\\;C:\\Program Files\\nodejs\\;C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\;C:\\Program Files\\Process Lasso\\;;C:\\Program Files\\NVIDIA Corporation\\NVIDIA NvDLISR;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit\\;C:\\Users\\User\\AppData\\Local\\Programs\\Python\\Launcher\\;C:\\Users\\User\\AppData\\Local\\Programs\\sqlite-tools;C:\\Users\\User\\AppData\\Local\\Programs\\Python311\\Scripts\\;C:\\Users\\User\\AppData\\Local\\Programs\\Python311\\;C:\\Users\\User\\AppData\\Local\\Programs\\GnuWin32\\bin;C:\\Users\\User\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\User\\AppData\\Local\\Programs\\PHP;C:\\Users\\User\\AppData\\Local\\Programs\\Python311\\Scripts;C:\\Users\\User\\AppData\\Local\\Programs\\Python311;C:\\Users\\User\\AppData\\Roaming\\npm;C:\\Users\\User\\AppData\\Local\\Programs\\VSCodium\\bin;C:\\Users\\User\\.dotnet\\tools;C:\\Users\\User\\AppData\\Local\\Programs\\Ollama;C:\\Users\\User\\AppData\\Local\\Programs\\yt-dlp; CUDA_VISIBLE_DEVICES=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e]" time=2024-10-22T19:18:04.462+03:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 time=2024-10-22T19:18:04.462+03:00 level=INFO source=llama-server.go:534 msg="waiting for llama runner to start responding" time=2024-10-22T19:18:04.464+03:00 level=INFO source=llama-server.go:568 msg="waiting for server to become available" status="llm server error" time=2024-10-22T19:18:04.562+03:00 level=INFO source=runner.go:856 msg="starting go runner" time=2024-10-22T19:18:04.577+03:00 level=DEBUG source=runner.go:857 msg="system info" cpu="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " threads=12 time=2024-10-22T19:18:04.578+03:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:60158" llama_model_loader: loaded meta data with 38 key-value pairs and 377 tensors from C:\Users\User\.ollama\models\blobs\sha256-373dPZ^ 0llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = deepseek2 llama_model_loader: - kv 1: general.name str = DeepSeek-Coder-V2-Lite-Instruct llama_model_loader: - kv 2: deepseek2.block_count u32 = 27 llama_model_loader: - kv 3: deepseek2.context_length u32 = 163840 llama_model_loader: - kv 4: deepseek2.embedding_length u32 = 2048 llama_model_loader: - kv 5: deepseek2.feed_forward_length u32 = 10944 llama_model_loader: - kv 6: deepseek2.attention.head_count u32 = 16 llama_model_loader: - kv 7: deepseek2.attention.head_count_kv u32 = 16 llama_model_loader: - kv 8: deepseek2.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 9: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 10: deepseek2.expert_used_count u32 = 6 llama_model_loader: - kv 11: general.file_type u32 = 7 llama_model_loader: - kv 12: deepseek2.leading_dense_block_count u32 = 1 llama_model_loader: - kv 13: deepseek2.vocab_size u32 = 102400 llama_model_loader: - kv 14: deepseek2.attention.kv_lora_rank u32 = 512 llama_model_loader: - kv 15: deepseek2.attention.key_length u32 = 192 llama_model_loader: - kv 16: deepseek2.attention.value_length u32 = 128 llama_model_loader: - kv 17: deepseek2.expert_feed_forward_length u32 = 1408 llama_model_loader: - kv 18: deepseek2.expert_count u32 = 64 llama_model_loader: - kv 19: deepseek2.expert_shared_count u32 = 2 llama_model_loader: - kv 20: deepseek2.expert_weights_scale f32 = 1.000000 llama_model_loader: - kv 21: deepseek2.rope.dimension_count u32 = 64 llama_model_loader: - kv 22: deepseek2.rope.scaling.type str = yarn llama_model_loader: - kv 23: deepseek2.rope.scaling.factor f32 = 40.000000 llama_model_loader: - kv 24: deepseek2.rope.scaling.original_context_length u32 = 4096 llama_model_loader: - kv 25: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.070700 llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 27: tokenizer.ggml.pre str = deepseek-llm llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,102400] = ["!", "\"", "#", "$", "%", "&", "'", Z 0llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,102400] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, Z 0llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,99757] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h Z 0llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 100000 llama_model_loader: - kv 32: tokenizer.ggml.eos_token_id u32 = 100001 llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 100001 llama_model_loader: - kv 34: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 35: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 36: tokenizer.chat_template str = {% if not add_generation_prompt is dp Z 0llama_model_loader: - kv 37: general.quantization_version u32 = 2 llama_model_loader: - type f32: 108 tensors llama_model_loader: - type q8_0: 269 tensors llama_model_load: error loading model: error loading model vocabulary: wstring_convert::from_bytes llama_load_model_from_file: failed to load model llama_new_context_with_model: model cannot be NULL Exception 0xc0000005 0x0 0x4221 0x8daa50 PC=0x8daa50 signal arrived during external code execution runtime.cgocall(0x8529c0, 0xc000075d98) runtime/cgocall.go:157 +0x3e fp=0xc000075d70 sp=0xc000075d38 pc=0x5d5dde github.com/ollama/ollama/llama._Cfunc_llama_add_bos_token(0x0) _cgo_gotypes.go:449 +0x52 fp=0xc000075d98 sp=0xc000075d70 pc=0x6d75f2 main.(*Server).loadModel.(*Model).AddBOSToken.func1(0x80000002000?) github.com/ollama/ollama/llama/llama.go:242 +0x3e fp=0xc000075dd8 sp=0xc000075d98 pc=0x85097e github.com/ollama/ollama/llama.(*Model).AddBOSToken(...) github.com/ollama/ollama/llama/llama.go:242 main.(*Server).loadModel(0xc00015c120, {0x1c, 0x0, 0x0, 0x0, {0x0, 0x0, 0x0}, 0xc00010a190, 0x0}, ...) github.com/ollama/ollama/llama/runner/runner.go:792 +0x219 fp=0xc000075f38 sp=0xc000075dd8 pc=0x850619 main.main.gowrap1() github.com/ollama/ollama/llama/runner/runner.go:889 +0x95 fp=0xc000075fe0 sp=0xc000075f38 pc=0x851d55 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc000075fe8 sp=0xc000075fe0 pc=0x63e2e1 created by main.main in goroutine 1 github.com/ollama/ollama/llama/runner/runner.go:889 +0xbd6 goroutine 1 gp=0xc000064000 m=nil [IO wait]: runtime.gopark(0xe30e80?, 0xafed40?, 0xa0?, 0xc?, 0xc000120cd0?) runtime/proc.go:402 +0xce fp=0xc000139748 sp=0xc000139728 pc=0x60dfee runtime.netpollblock(0x1e4?, 0x5d58e6?, 0x0?) runtime/netpoll.go:573 +0xf7 fp=0xc000139780 sp=0xc000139748 pc=0x604d97 internal/poll.runtime_pollWaittime=2024-10-22T19:18:04.717+03:00 level=INFO source=llama-server.go:568 msg="waiting for server to become available" status="llm server loading model" (0x130ea49a780, 0x72) runtime/netpoll.go:345 +0x85 fp=0xc0001397a0 sp=0xc000139780 pc=0x6384c5 internal/poll.(*pollDesc).wait(0x5ec336?, 0xeb6ac0?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0001397c8 sp=0xc0001397a0 pc=0x68fc47 internal/poll.execIO(0xc000120ca0, 0xc000139868) internal/poll/fd_windows.go:175 +0xe6 fp=0xc000139838 sp=0xc0001397c8 pc=0x690586 internal/poll.(*FD).acceptOne(0xc000120c88, 0x1fc, {0xc000220000?, 0x211f80?, 0x0?}, 0x0?) internal/poll/fd_windows.go:944 +0x67 fp=0xc000139898 sp=0xc000139838 pc=0x692567 internal/poll.(*FD).Accept(0xc000120c88, 0xc000139a48) internal/poll/fd_windows.go:978 +0x1bc fp=0xc000139950 sp=0xc000139898 pc=0x69289c net.(*netFD).accept(0xc000120c88) net/fd_windows.go:178 +0x54 fp=0xc000139a68 sp=0xc000139950 pc=0x6fa974 net.(*TCPListener).accept(0xc000116720) net/tcpsock_posix.go:159 +0x1e fp=0xc000139a90 sp=0xc000139a68 pc=0x70be9e net.(*TCPListener).Accept(0xc000116720) net/tcpsock.go:327 +0x30 fp=0xc000139ac0 sp=0xc000139a90 pc=0x70b0d0 net/http.(*onceCloseListener).Accept(0xc000210000?) <autogenerated>:1 +0x24 fp=0xc000139ad8 sp=0xc000139ac0 pc=0x82fae4 net/http.(*Server).Serve(0xc000194000, {0xc1d5c0, 0xc000116720}) net/http/server.go:3260 +0x33e fp=0xc000139c08 sp=0xc000139ad8 pc=0x8268fe main.main() github.com/ollama/ollama/llama/runner/runner.go:914 +0x104c fp=0xc000139f50 sp=0xc000139c08 pc=0x851a0c runtime.main() runtime/proc.go:271 +0x28b fp=0xc000139fe0 sp=0xc000139f50 pc=0x60dbeb runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc000139fe8 sp=0xc000139fe0 pc=0x63e2e1 goroutine 2 gp=0xc000064700 m=nil [force gc (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:402 +0xce fp=0xc000067fa8 sp=0xc000067f88 pc=0x60dfee runtime.goparkunlock(...) runtime/proc.go:408 runtime.forcegchelper() runtime/proc.go:326 +0xb8 fp=0xc000067fe0 sp=0xc000067fa8 pc=0x60de78 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc000067fe8 sp=0xc000067fe0 pc=0x63e2e1 created by runtime.init.6 in goroutine 1 runtime/proc.go:314 +0x1a goroutine 3 gp=0xc000064a80 m=nil [GC sweep wait]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:402 +0xce fp=0xc000069f80 sp=0xc000069f60 pc=0x60dfee runtime.goparkunlock(...) runtime/proc.go:408 runtime.bgsweep(0xc00001a070) runtime/mgcsweep.go:278 +0x94 fp=0xc000069fc8 sp=0xc000069f80 pc=0x5f7674 runtime.gcenable.gowrap1() runtime/mgc.go:203 +0x25 fp=0xc000069fe0 sp=0xc000069fc8 pc=0x5ec185 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc000069fe8 sp=0xc000069fe0 pc=0x63e2e1 created by runtime.gcenable in goroutine 1 runtime/mgc.go:203 +0x66 goroutine 4 gp=0xc000064c40 m=nil [GC scavenge wait]: runtime.gopark(0xc00001a070?, 0xc18090?, 0x1?, 0x0?, 0xc000064c40?) runtime/proc.go:402 +0xce fp=0xc000079f78 sp=0xc000079f58 pc=0x60dfee runtime.goparkunlock(...) runtime/proc.go:408 runtime.(*scavengerState).park(0xe304c0) runtime/mgcscavenge.go:425 +0x49 fp=0xc000079fa8 sp=0xc000079f78 pc=0x5f5069 runtime.bgscavenge(0xc00001a070) runtime/mgcscavenge.go:653 +0x3c fp=0xc000079fc8 sp=0xc000079fa8 pc=0x5f55fc runtime.gcenable.gowrap2() runtime/mgc.go:204 +0x25 fp=0xc000079fe0 sp=0xc000079fc8 pc=0x5ec125 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc000079fe8 sp=0xc000079fe0 pc=0x63e2e1 created by runtime.gcenable in goroutine 1 runtime/mgc.go:204 +0xa5 goroutine 18 gp=0xc0001061c0 m=nil [finalizer wait]: runtime.gopark(0xc00006be48?, 0x5dfac5?, 0xa8?, 0x1?, 0xc000064000?) runtime/proc.go:402 +0xce fp=0xc00006be20 sp=0xc00006be00 pc=0x60dfee runtime.runfinq() runtime/mfinal.go:194 +0x107 fp=0xc00006bfe0 sp=0xc00006be20 pc=0x5eb207 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc00006bfe8 sp=0xc00006bfe0 pc=0x63e2e1 created by runtime.createfing in goroutine 1 runtime/mfinal.go:164 +0x3d goroutine 20 gp=0xc000106540 m=nil [semacquire]: runtime.gopark(0x0?, 0x0?, 0x60?, 0xe0?, 0x0?) runtime/proc.go:402 +0xce fp=0xc000077d30 sp=0xc000077d10 pc=0x60dfee runtime.goparkunlock(...) runtime/proc.go:408 runtime.semacquire1(0xc00015c188, 0x0, 0x1, 0x0, 0x12) runtime/sema.go:160 +0x232 fp=0xc000077d98 sp=0xc000077d30 pc=0x61f492 sync.runtime_Semacquire(0x0?) runtime/sema.go:62 +0x25 fp=0xc000077dd0 sp=0xc000077d98 pc=0x639ba5 sync.(*WaitGroup).Wait(0x0?) sync/waitgroup.go:116 +0x48 fp=0xc000077df8 sp=0xc000077dd0 pc=0x65bd08 main.(*Server).run(0xc00015c120, {0xc1dbf0, 0xc000192000}) github.com/ollama/ollama/llama/runner/runner.go:324 +0x51 fp=0xc000077fb8 sp=0xc000077df8 pc=0x84d131 main.main.gowrap2() github.com/ollama/ollama/llama/runner/runner.go:894 +0x28 fp=0xc000077fe0 sp=0xc000077fb8 pc=0x851c88 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc000077fe8 sp=0xc000077fe0 pc=0x63e2e1 created by main.main in goroutine 1 github.com/ollama/ollama/llama/runner/runner.go:894 +0xcab goroutine 34 gp=0xc000214000 m=nil [IO wait]: runtime.gopark(0x0?, 0xc000206020?, 0xd0?, 0x60?, 0xc000206050?) runtime/proc.go:402 +0xce fp=0xc000027890 sp=0xc000027870 pc=0x60dfee runtime.netpollblock(0x1f0?, 0x5d58e6?, 0x0?) runtime/netpoll.go:573 +0xf7 fp=0xc0000278c8 sp=0xc000027890 pc=0x604d97 internal/poll.runtime_pollWait(0x130ea49a688, 0x72) runtime/netpoll.go:345 +0x85 fp=0xc0000278e8 sp=0xc0000278c8 pc=0x6384c5 internal/poll.(*pollDesc).wait(0xc000027930?, 0x68f9fc?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000027910 sp=0xc0000278e8 pc=0x68fc47 internal/poll.execIO(0xc000206020, 0xbcbe90) internal/poll/fd_windows.go:175 +0xe6 fp=0xc000027980 sp=0xc000027910 pc=0x690586 internal/poll.(*FD).Read(0xc000206008, {0xc000198000, 0x1000, 0x1000}) internal/poll/fd_windows.go:436 +0x2b1 fp=0xc000027a28 sp=0xc000027980 pc=0x691231 net.(*netFD).Read(0xc000206008, {0xc000198000?, 0xc000027a98?, 0x690125?}) net/fd_posix.go:55 +0x25 fp=0xc000027a70 sp=0xc000027a28 pc=0x6f9885 net.(*conn).Read(0xc00020e000, {0xc000198000?, 0x0?, 0xc000108e78?}) net/net.go:185 +0x45 fp=0xc000027ab8 sp=0xc000027a70 pc=0x705445 net.(*TCPConn).Read(0xc000108e70?, {0xc000198000?, 0xc000206008?, 0xc000027af0?}) <autogenerated>:1 +0x25 fp=0xc000027ae8 sp=0xc000027ab8 pc=0x70efc5 net/http.(*connReader).Read(0xc000108e70, {0xc000198000, 0x1000, 0x1000}) net/http/server.go:789 +0x14b fp=0xc000027b38 sp=0xc000027ae8 pc=0x81c70b bufio.(*Reader).fill(0xc00017e1e0) bufio/bufio.go:110 +0x103 fp=0xc000027b70 sp=0xc000027b38 pc=0x7d97e3 bufio.(*Reader).Peek(0xc00017e1e0, 0x4) bufio/bufio.go:148 +0x53 fp=0xc000027b90 sp=0xc000027b70 pc=0x7d9913 net/http.(*conn).serve(0xc000210000, {0xc1dbb8, 0xc000108de0}) net/http/server.go:2079 +0x749 fp=0xc000027fb8 sp=0xc000027b90 pc=0x822469 net/http.(*Server).Serve.gowrap3() net/http/server.go:3290 +0x28 fp=0xc000027fe0 sp=0xc000027fb8 pc=0x826ce8 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc000027fe8 sp=0xc000027fe0 pc=0x63e2e1 created by net/http.(*Server).Serve in goroutine 1 net/http/server.go:3290 +0x4b4 rax 0xc000076000 rbx 0xc000075d98 rcx 0x40f8 rdx 0xc000075d28 rdi 0xc000076000 rsi 0xc000064fc0 rbp 0x17015ff6e0 rsp 0x17015ff678 r8 0xc000082008 r9 0x0 r10 0x130ea30a4c8 r11 0x0 r12 0xc00000a060 r13 0x0 r14 0xc000106380 r15 0x1ffffffffffffff rip 0x8daa50 rflags 0x10202 cs 0x33 fs 0x53 gs 0x2b time=2024-10-22T19:18:04.736+03:00 level=DEBUG source=llama-server.go:395 msg="llama runner terminated" error="exit status 2" time=2024-10-22T19:18:04.978+03:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: error loading model: error loading model vocabulary: wstring_convert::from_bytes" time=2024-10-22T19:18:04.978+03:00 level=DEBUG source=sched.go:459 msg="triggering expiration for failed load" model=C:\Users\User\.ollama\models\blobs\sha256-373dcfc92e01372709b6164fc836f677a6280e25e9eac5c434c64223207bfc4f time=2024-10-22T19:18:04.978+03:00 level=DEBUG source=sched.go:361 msg="runner expired event received" modelPath=C:\Users\User\.ollama\models\blobs\sha256-373dcfc92e01372709b6164fc836f677a6280e25e9eac5c434c64223207bfc4f time=2024-10-22T19:18:04.978+03:00 level=DEBUG source=sched.go:376 msg="got lock to unload" modelPath=C:\Users\User\.ollama\models\blobs\sha256-373dcfc92e01372709b6164fc836f677a6280e25e9eac5c434c64223207bfc4f time=2024-10-22T19:18:04.978+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.5 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.5 GiB" now.free_swap="164.6 GiB" [GIN] 2024/10/22 - 19:18:04 | 500 | 589.2269ms | 127.0.0.1 | POST "/api/generate" time=2024-10-22T19:18:04.994+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB" time=2024-10-22T19:18:04.995+03:00 level=DEBUG source=llama-server.go:1017 msg="stopping llama server" time=2024-10-22T19:18:04.995+03:00 level=DEBUG source=sched.go:381 msg="runner released" modelPath=C:\Users\User\.ollama\models\blobs\sha256-373dcfc92e01372709b6164fc836f677a6280e25e9eac5c434c64223207bfc4f time=2024-10-22T19:18:05.259+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.5 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.6 GiB" now.free_swap="164.6 GiB" time=2024-10-22T19:18:05.275+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB" time=2024-10-22T19:18:05.507+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.6 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.6 GiB" now.free_swap="164.6 GiB" time=2024-10-22T19:18:05.523+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB" time=2024-10-22T19:18:05.754+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.6 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.6 GiB" now.free_swap="164.6 GiB" time=2024-10-22T19:18:05.770+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB" time=2024-10-22T19:18:06.003+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.6 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.6 GiB" now.free_swap="164.6 GiB" time=2024-10-22T19:18:06.018+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB" time=2024-10-22T19:18:06.252+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.6 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.6 GiB" now.free_swap="164.6 GiB" time=2024-10-22T19:18:06.267+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB" time=2024-10-22T19:18:06.502+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.6 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.6 GiB" now.free_swap="164.6 GiB" time=2024-10-22T19:18:06.518+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB" time=2024-10-22T19:18:06.750+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.6 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.6 GiB" now.free_swap="164.6 GiB" time=2024-10-22T19:18:06.765+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB" time=2024-10-22T19:18:06.996+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.6 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.6 GiB" now.free_swap="164.6 GiB" time=2024-10-22T19:18:07.013+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB" time=2024-10-22T19:18:07.246+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.6 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.6 GiB" now.free_swap="164.6 GiB" time=2024-10-22T19:18:07.261+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB" time=2024-10-22T19:18:07.495+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.6 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.6 GiB" now.free_swap="164.6 GiB" time=2024-10-22T19:18:07.510+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB" time=2024-10-22T19:18:07.745+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.6 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.6 GiB" now.free_swap="164.6 GiB" time=2024-10-22T19:18:07.761+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB" time=2024-10-22T19:18:08.009+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.6 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.6 GiB" now.free_swap="164.6 GiB" time=2024-10-22T19:18:08.026+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB" time=2024-10-22T19:18:08.258+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.6 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.6 GiB" now.free_swap="164.6 GiB" time=2024-10-22T19:18:08.273+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB" time=2024-10-22T19:18:08.506+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.6 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.6 GiB" now.free_swap="164.6 GiB" time=2024-10-22T19:18:08.521+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB" time=2024-10-22T19:18:08.755+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.6 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.5 GiB" now.free_swap="164.6 GiB" time=2024-10-22T19:18:08.771+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB" time=2024-10-22T19:18:09.003+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.5 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.5 GiB" now.free_swap="164.6 GiB" time=2024-10-22T19:18:09.019+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB" time=2024-10-22T19:18:09.252+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.5 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.5 GiB" now.free_swap="164.6 GiB" time=2024-10-22T19:18:09.267+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB" time=2024-10-22T19:18:09.500+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.5 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.5 GiB" now.free_swap="164.6 GiB" time=2024-10-22T19:18:09.516+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB" time=2024-10-22T19:18:09.749+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.5 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.5 GiB" now.free_swap="164.6 GiB" time=2024-10-22T19:18:09.765+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB" time=2024-10-22T19:18:09.997+03:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0191493 model=C:\Users\User\.ollama\models\blobs\sha256-373dcfc92e01372709b6164fc836f677a6280e25e9eac5c434c64223207bfc4f time=2024-10-22T19:18:09.997+03:00 level=DEBUG source=sched.go:385 msg="sending an unloaded event" modelPath=C:\Users\User\.ollama\models\blobs\sha256-373dcfc92e01372709b6164fc836f677a6280e25e9eac5c434c64223207bfc4f time=2024-10-22T19:18:09.997+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.5 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.5 GiB" now.free_swap="164.6 GiB" time=2024-10-22T19:18:09.997+03:00 level=DEBUG source=sched.go:309 msg="ignoring unload event with no pending requests" time=2024-10-22T19:18:10.014+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB" time=2024-10-22T19:18:10.260+03:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2817106 model=C:\Users\User\.ollama\models\blobs\sha256-373dcfc92e01372709b6164fc836f677a6280e25e9eac5c434c64223207bfc4f time=2024-10-22T19:18:10.260+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="191.6 GiB" before.free="163.5 GiB" before.free_swap="164.6 GiB" now.total="191.6 GiB" now.free="163.6 GiB" now.free_swap="164.6 GiB" time=2024-10-22T19:18:10.275+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4dbedfaa-2842-c28f-e99a-3e3c4d45775e name="NVIDIA GeForce RTX 4090" overhead="0 B" before.total="24.0 GiB" before.free="20.3 GiB" now.total="24.0 GiB" now.free="20.3 GiB" now.used="3.7 GiB" time=2024-10-22T19:18:10.496+03:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5176842 model=C:\Users\User\.ollama\models\blobs\sha256-373dcfc92e01372709b6164fc836f677a6280e25e9eac5c434c64223207bfc4f ```
Author
Owner

@emzaedu commented on GitHub (Oct 22, 2024):

So I noticed that in the log from 0.3.14 I see such lines in it

llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,102400] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,99757] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e...
llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 100000

llama_model_loader: - kv 37: general.quantization_version u32 = 2

which are missing when trying to load via 0.4.0-rc3

<!-- gh-comment-id:2429756046 --> @emzaedu commented on GitHub (Oct 22, 2024): So I noticed that in the log from 0.3.14 I see such lines in it ``` llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,102400] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,99757] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e... llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 100000 ``` ``` llama_model_loader: - kv 37: general.quantization_version u32 = 2 ``` which are missing when trying to load via 0.4.0-rc3
Author
Owner

@rick-github commented on GitHub (Oct 22, 2024):

They are there, scroll to the right and you will see Z 0�llama_model_loader: - kv 29: on the end of the line that starts with llama_model_loader: - kv 28:. It's a wrapping issue. 0.4.0 is switching over to the new go runner and some of the formatting is inconsistent with the old runner in 0.3.14.

<!-- gh-comment-id:2429765021 --> @rick-github commented on GitHub (Oct 22, 2024): They are there, scroll to the right and you will see `Z 0�llama_model_loader: - kv 29:` on the end of the line that starts with `llama_model_loader: - kv 28:`. It's a wrapping issue. 0.4.0 is switching over to the new go runner and some of the formatting is inconsistent with the old runner in 0.3.14.
Author
Owner

@emzaedu commented on GitHub (Oct 22, 2024):

They are there, scroll to the right and you will see Z 0�llama_model_loader: - kv 29: on the end of the line that starts with llama_model_loader: - kv 28:. It's a wrapping issue. 0.4.0 is switching over to the new go runner and some of the formatting is inconsistent with the old runner in 0.3.14.

Yes, it's my lack of attention, but in any case, it still works fine on 0.3.14 and does not load on 0.4.0.

Do the provided logs at least clarify the situation?
Is there anything else I can do to provide more information?

<!-- gh-comment-id:2429773921 --> @emzaedu commented on GitHub (Oct 22, 2024): > They are there, scroll to the right and you will see `Z 0�llama_model_loader: - kv 29:` on the end of the line that starts with `llama_model_loader: - kv 28:`. It's a wrapping issue. 0.4.0 is switching over to the new go runner and some of the formatting is inconsistent with the old runner in 0.3.14. Yes, it's my lack of attention, but in any case, it still works fine on 0.3.14 and does not load on 0.4.0. Do the provided logs at least clarify the situation? Is there anything else I can do to provide more information?
Author
Owner

@rick-github commented on GitHub (Oct 22, 2024):

I'm afraid we'll have to leave it in the hands of the runner developers, but since replicating it seems problematical it may take some effort to fix it. On the the other hand, it may just be the difference between Windows and Linux. I don't have a Windows machine to test on but the developers will, so perhaps replicating and the resolving will be done soon.

<!-- gh-comment-id:2429799043 --> @rick-github commented on GitHub (Oct 22, 2024): I'm afraid we'll have to leave it in the hands of the runner developers, but since replicating it seems problematical it may take some effort to fix it. On the the other hand, it may just be the difference between Windows and Linux. I don't have a Windows machine to test on but the developers will, so perhaps replicating and the resolving will be done soon.
Author
Owner

@jessegross commented on GitHub (Oct 22, 2024):

I was able to reproduce it - confirmed that it affects situations with all of the following:

  • 0.4.0
  • Windows
  • Deepseek v2 models, which have multi-byte UTF-8 characters in the tokenizer regex
<!-- gh-comment-id:2430436077 --> @jessegross commented on GitHub (Oct 22, 2024): I was able to reproduce it - confirmed that it affects situations with all of the following: - 0.4.0 - Windows - Deepseek v2 models, which have multi-byte UTF-8 characters in the tokenizer regex
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#51157