[GH-ISSUE #9176] Can't get Ollama to run on Intel Arc B580, 'std::runtime_error' #5976

Closed
opened 2026-04-12 17:19:40 -05:00 by GiteaMirror · 12 comments
Owner

Originally created by @Ejo2001 on GitHub (Feb 17, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9176

What is the issue?

Hello!

I've been trying to run Ollama on my dual Intel Arc B580 setup, but I can't get it to work, no matter what I try. I have followed the following guides:

https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quickstart/ollama_quickstart.md

https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quickstart/bmg_quickstart.md

https://syslynx.net/llm-intel-b580-linux/

I keep getting different errors, but the most recent one I've ran in to is this:

terminate called after throwing an instance of 'std::runtime_error'
  what():  can not find preferred GPU platform
SIGABRT: abort
PC=0x7e33dfe969fc m=0 sigcode=18446744073709551610
signal arrived during cgo execution

This error shows up when following this guide: https://syslynx.net/llm-intel-b580-linux/

I managed to reach the Ollama API after adding port 11434 to the ipex/ollama podman command, however, it returns "{"error":"POST predict: Post "http://127.0.0.1:41959/completion": EOF"}" no matter what I do.

I have tried running Ollama on both Ubuntu 24.04.1 LTS, Ubuntu 24.10, and Windows 10, and I can't get it to work on any operating system. I have tried multiple times with multiple reinstalls, but I keep running into problems. I have verified that my dual B580 setup works with LM Studio in Windows 10, so it's not a hardware issue.

My hardware setup:

CPU: Intel Core i7-9700K

GPU: 2x Intel Arc B580 12Gb

RAM: 32Gb DDR4 3200Mhz

PSU: Corsair GS700

Does anyone have any idea of how to resolve this? Is it a bug, or am I just losing my mind? Will there be a native implementation of IPEX in Ollama itself anytime soon?

Relevant log output

terminate called after throwing an instance of 'std::runtime_error'
  what():  can not find preferred GPU platform
SIGABRT: abort
PC=0x7e33dfe969fc m=0 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 1 gp=0xc0000061c0 m=0 mp=0x63fa60ba1480 [syscall]:
runtime.cgocall(0x63fa5fe58950, 0xc0000c3960)
        runtime/cgocall.go:167 +0x4b fp=0xc0000c3938 sp=0xc0000c3900 pc=0x63fa5f2b754b
ollama/llama/llamafile._Cfunc_llama_print_system_info()
        _cgo_gotypes.go:838 +0x4c fp=0xc0000c3960 sp=0xc0000c3938 pc=0x63fa5f67aaec
ollama/llama/llamafile.PrintSystemInfo()
        ollama/llama/llamafile/llama.go:70 +0x79 fp=0xc0000c39a8 sp=0xc0000c3960 pc=0x63fa5f67bb79
ollama/cmd.NewCLI()
        ollama/cmd/cmd.go:1427 +0xc08 fp=0xc0000c3f30 sp=0xc0000c39a8 pc=0x63fa5fe50f08
main.main()
        ollama/main.go:12 +0x13 fp=0xc0000c3f50 sp=0xc0000c3f30 pc=0x63fa5fe57d93
runtime.main()
        runtime/proc.go:272 +0x29d fp=0xc0000c3fe0 sp=0xc0000c3f50 pc=0x63fa5f288f5d
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc0000c3fe8 sp=0xc0000c3fe0 pc=0x63fa5f2c6021

goroutine 2 gp=0xc000006c40 m=nil [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:424 +0xce fp=0xc000084fa8 sp=0xc000084f88 pc=0x63fa5f2bdc4e
runtime.goparkunlock(...)
        runtime/proc.go:430
runtime.forcegchelper()
        runtime/proc.go:337 +0xb8 fp=0xc000084fe0 sp=0xc000084fa8 pc=0x63fa5f289298
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc000084fe8 sp=0xc000084fe0 pc=0x63fa5f2c6021
created by runtime.init.7 in goroutine 1
        runtime/proc.go:325 +0x1a

goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]:
runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:424 +0xce fp=0xc000085780 sp=0xc000085760 pc=0x63fa5f2bdc4e
runtime.goparkunlock(...)
        runtime/proc.go:430
runtime.bgsweep(0xc0000b0000)
        runtime/mgcsweep.go:317 +0xdf fp=0xc0000857c8 sp=0xc000085780 pc=0x63fa5f27393f
runtime.gcenable.gowrap1()
        runtime/mgc.go:204 +0x25 fp=0xc0000857e0 sp=0xc0000857c8 pc=0x63fa5f267f85
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc0000857e8 sp=0xc0000857e0 pc=0x63fa5f2c6021
created by runtime.gcenable in goroutine 1
        runtime/mgc.go:204 +0x66

goroutine 4 gp=0xc000007340 m=nil [GC scavenge wait]:
runtime.gopark(0x63fa60b9ed80?, 0x63fa60003070?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:424 +0xce fp=0xc000085f78 sp=0xc000085f58 pc=0x63fa5f2bdc4e
runtime.goparkunlock(...)
        runtime/proc.go:430
runtime.(*scavengerState).park(0x63fa60b9ed80)
        runtime/mgcscavenge.go:425 +0x49 fp=0xc000085fa8 sp=0xc000085f78 pc=0x63fa5f271309
runtime.bgscavenge(0xc0000b0000)
        runtime/mgcscavenge.go:658 +0x59 fp=0xc000085fc8 sp=0xc000085fa8 pc=0x63fa5f271899
runtime.gcenable.gowrap2()
        runtime/mgc.go:205 +0x25 fp=0xc000085fe0 sp=0xc000085fc8 pc=0x63fa5f267f25
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc000085fe8 sp=0xc000085fe0 pc=0x63fa5f2c6021
created by runtime.gcenable in goroutine 1
        runtime/mgc.go:205 +0xa5

goroutine 18 gp=0xc000104700 m=nil [finalizer wait]:
runtime.gopark(0xc000084648?, 0x63fa5f25e485?, 0xb0?, 0x1?, 0xc0000061c0?)
        runtime/proc.go:424 +0xce fp=0xc000084620 sp=0xc000084600 pc=0x63fa5f2bdc4e
runtime.runfinq()
        runtime/mfinal.go:193 +0x107 fp=0xc0000847e0 sp=0xc000084620 pc=0x63fa5f267007
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc0000847e8 sp=0xc0000847e0 pc=0x63fa5f2c6021
created by runtime.createfing in goroutine 1
        runtime/mfinal.go:163 +0x3d

goroutine 19 gp=0xc000105880 m=nil [chan receive]:
runtime.gopark(0xc000080760?, 0x63fa5f399685?, 0x40?, 0xc8?, 0x63fa60419c00?)
        runtime/proc.go:424 +0xce fp=0xc000080718 sp=0xc0000806f8 pc=0x63fa5f2bdc4e
runtime.chanrecv(0xc0001122a0, 0x0, 0x1)
        runtime/chan.go:639 +0x41c fp=0xc000080790 sp=0xc000080718 pc=0x63fa5f25767c
runtime.chanrecv1(0x0?, 0x0?)
        runtime/chan.go:489 +0x12 fp=0xc0000807b8 sp=0xc000080790 pc=0x63fa5f257232
runtime.unique_runtime_registerUniqueMapCleanup.func1(...)
        runtime/mgc.go:1781
runtime.unique_runtime_registerUniqueMapCleanup.gowrap1()
        runtime/mgc.go:1784 +0x2f fp=0xc0000807e0 sp=0xc0000807b8 pc=0x63fa5f26afef
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc0000807e8 sp=0xc0000807e0 pc=0x63fa5f2c6021
created by unique.runtime_registerUniqueMapCleanup in goroutine 1
        runtime/mgc.go:1779 +0x96

goroutine 20 gp=0xc0004681c0 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:424 +0xce fp=0xc000080f38 sp=0xc000080f18 pc=0x63fa5f2bdc4e
runtime.gcBgMarkWorker(0xc0001136c0)
        runtime/mgc.go:1412 +0xe9 fp=0xc000080fc8 sp=0xc000080f38 pc=0x63fa5f26a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1328 +0x25 fp=0xc000080fe0 sp=0xc000080fc8 pc=0x63fa5f26a1c5
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc000080fe8 sp=0xc000080fe0 pc=0x63fa5f2c6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1328 +0x105

goroutine 21 gp=0xc000468380 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:424 +0xce fp=0xc000081738 sp=0xc000081718 pc=0x63fa5f2bdc4e
runtime.gcBgMarkWorker(0xc0001136c0)
        runtime/mgc.go:1412 +0xe9 fp=0xc0000817c8 sp=0xc000081738 pc=0x63fa5f26a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1328 +0x25 fp=0xc0000817e0 sp=0xc0000817c8 pc=0x63fa5f26a1c5
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc0000817e8 sp=0xc0000817e0 pc=0x63fa5f2c6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1328 +0x105

goroutine 22 gp=0xc000468540 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:424 +0xce fp=0xc000081f38 sp=0xc000081f18 pc=0x63fa5f2bdc4e
runtime.gcBgMarkWorker(0xc0001136c0)
        runtime/mgc.go:1412 +0xe9 fp=0xc000081fc8 sp=0xc000081f38 pc=0x63fa5f26a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1328 +0x25 fp=0xc000081fe0 sp=0xc000081fc8 pc=0x63fa5f26a1c5
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc000081fe8 sp=0xc000081fe0 pc=0x63fa5f2c6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1328 +0x105

goroutine 23 gp=0xc000468700 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:424 +0xce fp=0xc000082738 sp=0xc000082718 pc=0x63fa5f2bdc4e
runtime.gcBgMarkWorker(0xc0001136c0)
        runtime/mgc.go:1412 +0xe9 fp=0xc0000827c8 sp=0xc000082738 pc=0x63fa5f26a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1328 +0x25 fp=0xc0000827e0 sp=0xc0000827c8 pc=0x63fa5f26a1c5
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc0000827e8 sp=0xc0000827e0 pc=0x63fa5f2c6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1328 +0x105

goroutine 24 gp=0xc0004688c0 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:424 +0xce fp=0xc000082f38 sp=0xc000082f18 pc=0x63fa5f2bdc4e
runtime.gcBgMarkWorker(0xc0001136c0)
        runtime/mgc.go:1412 +0xe9 fp=0xc000082fc8 sp=0xc000082f38 pc=0x63fa5f26a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1328 +0x25 fp=0xc000082fe0 sp=0xc000082fc8 pc=0x63fa5f26a1c5
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc000082fe8 sp=0xc000082fe0 pc=0x63fa5f2c6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1328 +0x105

goroutine 25 gp=0xc000468a80 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:424 +0xce fp=0xc000083738 sp=0xc000083718 pc=0x63fa5f2bdc4e
runtime.gcBgMarkWorker(0xc0001136c0)
        runtime/mgc.go:1412 +0xe9 fp=0xc0000837c8 sp=0xc000083738 pc=0x63fa5f26a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1328 +0x25 fp=0xc0000837e0 sp=0xc0000837c8 pc=0x63fa5f26a1c5
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc0000837e8 sp=0xc0000837e0 pc=0x63fa5f2c6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1328 +0x105

goroutine 26 gp=0xc000468c40 m=nil [GC worker (idle)]:
runtime.gopark(0x1c5271e40da?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:424 +0xce fp=0xc000083f38 sp=0xc000083f18 pc=0x63fa5f2bdc4e
runtime.gcBgMarkWorker(0xc0001136c0)
        runtime/mgc.go:1412 +0xe9 fp=0xc000083fc8 sp=0xc000083f38 pc=0x63fa5f26a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1328 +0x25 fp=0xc000083fe0 sp=0xc000083fc8 pc=0x63fa5f26a1c5
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc000083fe8 sp=0xc000083fe0 pc=0x63fa5f2c6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1328 +0x105

goroutine 27 gp=0xc000468e00 m=nil [GC worker (idle)]:
runtime.gopark(0x1c5271ee646?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:424 +0xce fp=0xc000484738 sp=0xc000484718 pc=0x63fa5f2bdc4e
runtime.gcBgMarkWorker(0xc0001136c0)
        runtime/mgc.go:1412 +0xe9 fp=0xc0004847c8 sp=0xc000484738 pc=0x63fa5f26a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1328 +0x25 fp=0xc0004847e0 sp=0xc0004847c8 pc=0x63fa5f26a1c5
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc0004847e8 sp=0xc0004847e0 pc=0x63fa5f2c6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1328 +0x105

OS

Linux

GPU

Intel

CPU

Intel

Ollama version

version 0.5.4-ipexllm-20250217

EDIT

If I run this on my server:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2:3b",
  "prompt": "What is 9 + 2?",
  "stream": false
}'

Then I get this response:

{"model":"llama3.2:3b","created_at":"2025-02-17T16:10:24.85482232Z","message":{"role":"assistant","content":""},"done_reason":"load","done":true}
Originally created by @Ejo2001 on GitHub (Feb 17, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9176 ### What is the issue? Hello! I've been trying to run Ollama on my dual Intel Arc B580 setup, but I can't get it to work, no matter what I try. I have followed the following guides: https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quickstart/ollama_quickstart.md https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quickstart/bmg_quickstart.md https://syslynx.net/llm-intel-b580-linux/ I keep getting different errors, but the most recent one I've ran in to is this: ``` terminate called after throwing an instance of 'std::runtime_error' what(): can not find preferred GPU platform SIGABRT: abort PC=0x7e33dfe969fc m=0 sigcode=18446744073709551610 signal arrived during cgo execution ``` This error shows up when following this guide: https://syslynx.net/llm-intel-b580-linux/ I managed to reach the Ollama API after adding port 11434 to the ipex/ollama podman command, however, it returns "{"error":"POST predict: Post \"http://127.0.0.1:41959/completion\": EOF"}" no matter what I do. I have tried running Ollama on both Ubuntu 24.04.1 LTS, Ubuntu 24.10, and Windows 10, and I can't get it to work on any operating system. I have tried multiple times with multiple reinstalls, but I keep running into problems. I have verified that my dual B580 setup works with LM Studio in Windows 10, so it's not a hardware issue. **My hardware setup:** CPU: Intel Core i7-9700K GPU: 2x Intel Arc B580 12Gb RAM: 32Gb DDR4 3200Mhz PSU: Corsair GS700 Does anyone have any idea of how to resolve this? Is it a bug, or am I just losing my mind? Will there be a native implementation of IPEX in Ollama itself anytime soon? ### Relevant log output ```shell terminate called after throwing an instance of 'std::runtime_error' what(): can not find preferred GPU platform SIGABRT: abort PC=0x7e33dfe969fc m=0 sigcode=18446744073709551610 signal arrived during cgo execution goroutine 1 gp=0xc0000061c0 m=0 mp=0x63fa60ba1480 [syscall]: runtime.cgocall(0x63fa5fe58950, 0xc0000c3960) runtime/cgocall.go:167 +0x4b fp=0xc0000c3938 sp=0xc0000c3900 pc=0x63fa5f2b754b ollama/llama/llamafile._Cfunc_llama_print_system_info() _cgo_gotypes.go:838 +0x4c fp=0xc0000c3960 sp=0xc0000c3938 pc=0x63fa5f67aaec ollama/llama/llamafile.PrintSystemInfo() ollama/llama/llamafile/llama.go:70 +0x79 fp=0xc0000c39a8 sp=0xc0000c3960 pc=0x63fa5f67bb79 ollama/cmd.NewCLI() ollama/cmd/cmd.go:1427 +0xc08 fp=0xc0000c3f30 sp=0xc0000c39a8 pc=0x63fa5fe50f08 main.main() ollama/main.go:12 +0x13 fp=0xc0000c3f50 sp=0xc0000c3f30 pc=0x63fa5fe57d93 runtime.main() runtime/proc.go:272 +0x29d fp=0xc0000c3fe0 sp=0xc0000c3f50 pc=0x63fa5f288f5d runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000c3fe8 sp=0xc0000c3fe0 pc=0x63fa5f2c6021 goroutine 2 gp=0xc000006c40 m=nil [force gc (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000084fa8 sp=0xc000084f88 pc=0x63fa5f2bdc4e runtime.goparkunlock(...) runtime/proc.go:430 runtime.forcegchelper() runtime/proc.go:337 +0xb8 fp=0xc000084fe0 sp=0xc000084fa8 pc=0x63fa5f289298 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000084fe8 sp=0xc000084fe0 pc=0x63fa5f2c6021 created by runtime.init.7 in goroutine 1 runtime/proc.go:325 +0x1a goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]: runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000085780 sp=0xc000085760 pc=0x63fa5f2bdc4e runtime.goparkunlock(...) runtime/proc.go:430 runtime.bgsweep(0xc0000b0000) runtime/mgcsweep.go:317 +0xdf fp=0xc0000857c8 sp=0xc000085780 pc=0x63fa5f27393f runtime.gcenable.gowrap1() runtime/mgc.go:204 +0x25 fp=0xc0000857e0 sp=0xc0000857c8 pc=0x63fa5f267f85 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000857e8 sp=0xc0000857e0 pc=0x63fa5f2c6021 created by runtime.gcenable in goroutine 1 runtime/mgc.go:204 +0x66 goroutine 4 gp=0xc000007340 m=nil [GC scavenge wait]: runtime.gopark(0x63fa60b9ed80?, 0x63fa60003070?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000085f78 sp=0xc000085f58 pc=0x63fa5f2bdc4e runtime.goparkunlock(...) runtime/proc.go:430 runtime.(*scavengerState).park(0x63fa60b9ed80) runtime/mgcscavenge.go:425 +0x49 fp=0xc000085fa8 sp=0xc000085f78 pc=0x63fa5f271309 runtime.bgscavenge(0xc0000b0000) runtime/mgcscavenge.go:658 +0x59 fp=0xc000085fc8 sp=0xc000085fa8 pc=0x63fa5f271899 runtime.gcenable.gowrap2() runtime/mgc.go:205 +0x25 fp=0xc000085fe0 sp=0xc000085fc8 pc=0x63fa5f267f25 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000085fe8 sp=0xc000085fe0 pc=0x63fa5f2c6021 created by runtime.gcenable in goroutine 1 runtime/mgc.go:205 +0xa5 goroutine 18 gp=0xc000104700 m=nil [finalizer wait]: runtime.gopark(0xc000084648?, 0x63fa5f25e485?, 0xb0?, 0x1?, 0xc0000061c0?) runtime/proc.go:424 +0xce fp=0xc000084620 sp=0xc000084600 pc=0x63fa5f2bdc4e runtime.runfinq() runtime/mfinal.go:193 +0x107 fp=0xc0000847e0 sp=0xc000084620 pc=0x63fa5f267007 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000847e8 sp=0xc0000847e0 pc=0x63fa5f2c6021 created by runtime.createfing in goroutine 1 runtime/mfinal.go:163 +0x3d goroutine 19 gp=0xc000105880 m=nil [chan receive]: runtime.gopark(0xc000080760?, 0x63fa5f399685?, 0x40?, 0xc8?, 0x63fa60419c00?) runtime/proc.go:424 +0xce fp=0xc000080718 sp=0xc0000806f8 pc=0x63fa5f2bdc4e runtime.chanrecv(0xc0001122a0, 0x0, 0x1) runtime/chan.go:639 +0x41c fp=0xc000080790 sp=0xc000080718 pc=0x63fa5f25767c runtime.chanrecv1(0x0?, 0x0?) runtime/chan.go:489 +0x12 fp=0xc0000807b8 sp=0xc000080790 pc=0x63fa5f257232 runtime.unique_runtime_registerUniqueMapCleanup.func1(...) runtime/mgc.go:1781 runtime.unique_runtime_registerUniqueMapCleanup.gowrap1() runtime/mgc.go:1784 +0x2f fp=0xc0000807e0 sp=0xc0000807b8 pc=0x63fa5f26afef runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000807e8 sp=0xc0000807e0 pc=0x63fa5f2c6021 created by unique.runtime_registerUniqueMapCleanup in goroutine 1 runtime/mgc.go:1779 +0x96 goroutine 20 gp=0xc0004681c0 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000080f38 sp=0xc000080f18 pc=0x63fa5f2bdc4e runtime.gcBgMarkWorker(0xc0001136c0) runtime/mgc.go:1412 +0xe9 fp=0xc000080fc8 sp=0xc000080f38 pc=0x63fa5f26a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc000080fe0 sp=0xc000080fc8 pc=0x63fa5f26a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000080fe8 sp=0xc000080fe0 pc=0x63fa5f2c6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 21 gp=0xc000468380 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000081738 sp=0xc000081718 pc=0x63fa5f2bdc4e runtime.gcBgMarkWorker(0xc0001136c0) runtime/mgc.go:1412 +0xe9 fp=0xc0000817c8 sp=0xc000081738 pc=0x63fa5f26a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc0000817e0 sp=0xc0000817c8 pc=0x63fa5f26a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000817e8 sp=0xc0000817e0 pc=0x63fa5f2c6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 22 gp=0xc000468540 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000081f38 sp=0xc000081f18 pc=0x63fa5f2bdc4e runtime.gcBgMarkWorker(0xc0001136c0) runtime/mgc.go:1412 +0xe9 fp=0xc000081fc8 sp=0xc000081f38 pc=0x63fa5f26a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc000081fe0 sp=0xc000081fc8 pc=0x63fa5f26a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000081fe8 sp=0xc000081fe0 pc=0x63fa5f2c6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 23 gp=0xc000468700 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000082738 sp=0xc000082718 pc=0x63fa5f2bdc4e runtime.gcBgMarkWorker(0xc0001136c0) runtime/mgc.go:1412 +0xe9 fp=0xc0000827c8 sp=0xc000082738 pc=0x63fa5f26a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc0000827e0 sp=0xc0000827c8 pc=0x63fa5f26a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000827e8 sp=0xc0000827e0 pc=0x63fa5f2c6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 24 gp=0xc0004688c0 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000082f38 sp=0xc000082f18 pc=0x63fa5f2bdc4e runtime.gcBgMarkWorker(0xc0001136c0) runtime/mgc.go:1412 +0xe9 fp=0xc000082fc8 sp=0xc000082f38 pc=0x63fa5f26a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc000082fe0 sp=0xc000082fc8 pc=0x63fa5f26a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000082fe8 sp=0xc000082fe0 pc=0x63fa5f2c6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 25 gp=0xc000468a80 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000083738 sp=0xc000083718 pc=0x63fa5f2bdc4e runtime.gcBgMarkWorker(0xc0001136c0) runtime/mgc.go:1412 +0xe9 fp=0xc0000837c8 sp=0xc000083738 pc=0x63fa5f26a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc0000837e0 sp=0xc0000837c8 pc=0x63fa5f26a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000837e8 sp=0xc0000837e0 pc=0x63fa5f2c6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 26 gp=0xc000468c40 m=nil [GC worker (idle)]: runtime.gopark(0x1c5271e40da?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000083f38 sp=0xc000083f18 pc=0x63fa5f2bdc4e runtime.gcBgMarkWorker(0xc0001136c0) runtime/mgc.go:1412 +0xe9 fp=0xc000083fc8 sp=0xc000083f38 pc=0x63fa5f26a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc000083fe0 sp=0xc000083fc8 pc=0x63fa5f26a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000083fe8 sp=0xc000083fe0 pc=0x63fa5f2c6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 27 gp=0xc000468e00 m=nil [GC worker (idle)]: runtime.gopark(0x1c5271ee646?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000484738 sp=0xc000484718 pc=0x63fa5f2bdc4e runtime.gcBgMarkWorker(0xc0001136c0) runtime/mgc.go:1412 +0xe9 fp=0xc0004847c8 sp=0xc000484738 pc=0x63fa5f26a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc0004847e0 sp=0xc0004847c8 pc=0x63fa5f26a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0004847e8 sp=0xc0004847e0 pc=0x63fa5f2c6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 ``` ### OS Linux ### GPU Intel ### CPU Intel ### Ollama version version 0.5.4-ipexllm-20250217 ## EDIT If I run this on my server: ``` curl http://localhost:11434/api/chat -d '{ "model": "llama3.2:3b", "prompt": "What is 9 + 2?", "stream": false }' ``` Then I get this response: ``` {"model":"llama3.2:3b","created_at":"2025-02-17T16:10:24.85482232Z","message":{"role":"assistant","content":""},"done_reason":"load","done":true} ```
GiteaMirror added the bug label 2026-04-12 17:19:40 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 17, 2025):

Your edit indicates that you have a working system, you've just got the format of the request wrong.

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2:3b",
  "prompt": "What is 9 + 2?",
  "stream": false
}'

or

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2:3b",
  "messages":[{"role":"user","content": "What is 9 + 2?"}],
  "stream": false
}'
<!-- gh-comment-id:2663572935 --> @rick-github commented on GitHub (Feb 17, 2025): Your edit indicates that you have a working system, you've just got the format of the request wrong. ```console curl http://localhost:11434/api/generate -d '{ "model": "llama3.2:3b", "prompt": "What is 9 + 2?", "stream": false }' ``` or ```console curl http://localhost:11434/api/chat -d '{ "model": "llama3.2:3b", "messages":[{"role":"user","content": "What is 9 + 2?"}], "stream": false }' ```
Author
Owner

@Ejo2001 commented on GitHub (Feb 17, 2025):

@rick-github You're right, my bad 😅

When I use the one you sent me, I get this:

{"error":"POST predict: Post "http://127.0.0.1:39041/completion": EOF"}

<!-- gh-comment-id:2663646800 --> @Ejo2001 commented on GitHub (Feb 17, 2025): @rick-github You're right, my bad 😅 When I use the one you sent me, I get this: {"error":"POST predict: Post \"http://127.0.0.1:39041/completion\": EOF"}
Author
Owner

@rick-github commented on GitHub (Feb 17, 2025):

Please post a full log, earlier parts of the log will include information about device detection, etc.

<!-- gh-comment-id:2663653612 --> @rick-github commented on GitHub (Feb 17, 2025): Please post a full log, earlier parts of the log will include information about device detection, etc.
Author
Owner

@Ejo2001 commented on GitHub (Feb 17, 2025):

@rick-github It's huuuuuge

ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 2 SYCL devices:
Couldn't find '/root/.ollama/id_ed25519'. Generating new private key.
Your new public key is: 

ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIJd9w8VDZHSpRWILi74iB1v7MDHSyZtXdFau2kzQL/1p

2025/02/17 23:53:47 routes.go:1259: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:localhost,127.0.0.1]"
time=2025-02-17T23:53:47.328+08:00 level=INFO source=images.go:757 msg="total blobs: 6"
time=2025-02-17T23:53:47.328+08:00 level=INFO source=images.go:764 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:	export GIN_MODE=release
 - using code:	gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> ollama/server.(*Server).PullHandler-fm (5 handlers)
[GIN-debug] POST   /api/generate             --> ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST   /api/chat                 --> ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST   /api/embed                --> ollama/server.(*Server).EmbedHandler-fm (5 handlers)
[GIN-debug] POST   /api/embeddings           --> ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST   /api/create               --> ollama/server.(*Server).CreateHandler-fm (5 handlers)
[GIN-debug] POST   /api/push                 --> ollama/server.(*Server).PushHandler-fm (5 handlers)
[GIN-debug] POST   /api/copy                 --> ollama/server.(*Server).CopyHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete               --> ollama/server.(*Server).DeleteHandler-fm (5 handlers)
[GIN-debug] POST   /api/show                 --> ollama/server.(*Server).ShowHandler-fm (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET    /api/ps                   --> ollama/server.(*Server).PsHandler-fm (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST   /v1/completions           --> ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] POST   /v1/embeddings            --> ollama/server.(*Server).EmbedHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models                --> ollama/server.(*Server).ListHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models/:model         --> ollama/server.(*Server).ShowHandler-fm (6 handlers)
[GIN-debug] GET    /                         --> ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /api/tags                 --> ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] GET    /api/version              --> ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /                         --> ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/version              --> ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2025-02-17T23:53:47.328+08:00 level=INFO source=routes.go:1310 msg="Listening on [::]:11434 (version 0.5.4-ipexllm-20250217)"
time=2025-02-17T23:53:47.328+08:00 level=INFO source=routes.go:1339 msg="Dynamic LLM libraries" runners=[ipex_llm]
[GIN] 2025/02/17 - 23:54:16 | 200 |    7.261733ms |       10.89.0.1 | GET      "/api/tags"
[GIN] 2025/02/17 - 23:54:16 | 200 |     125.667µs |       10.89.0.1 | GET      "/api/version"
[GIN] 2025/02/17 - 23:54:25 | 404 |    1.632384ms |       10.89.0.4 | POST     "/api/generate"
time=2025-02-17T23:54:35.903+08:00 level=INFO source=gpu.go:226 msg="looking for compatible GPUs"
time=2025-02-17T23:54:35.935+08:00 level=INFO source=server.go:104 msg="system memory" total="30.3 GiB" free="29.0 GiB" free_swap="8.0 GiB"
time=2025-02-17T23:54:35.936+08:00 level=INFO source=memory.go:356 msg="offload to device" layers.requested=-1 layers.model=29 layers.offload=0 layers.split="" memory.available="[29.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="4.6 GiB" memory.required.partial="0 B" memory.required.kv="1.8 GiB" memory.required.allocations="[4.6 GiB]" memory.weights.total="3.3 GiB" memory.weights.repeating="3.0 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="824.0 MiB" memory.graph.partial="881.1 MiB"
time=2025-02-17T23:54:35.936+08:00 level=INFO source=server.go:392 msg="starting llama server" cmd="/usr/local/lib/python3.11/dist-packages/bigdl/cpp/libs/ollama runner --model /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 16384 --batch-size 512 --n-gpu-layers 999 --threads 8 --no-mmap --parallel 1 --port 41795"
time=2025-02-17T23:54:35.937+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-02-17T23:54:35.937+08:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2025-02-17T23:54:35.938+08:00 level=INFO source=server.go:605 msg="waiting for server to become available" status="llm server error"
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 2 SYCL devices:
time=2025-02-17T23:54:36.074+08:00 level=INFO source=runner.go:967 msg="starting go runner"
time=2025-02-17T23:54:36.075+08:00 level=INFO source=runner.go:968 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=8
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llama_load_model_from_file: using device SYCL0 (Intel(R) Graphics [0xe20b]) - 11605 MiB free
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llama_load_model_from_file: using device SYCL1 (Intel(R) Graphics [0xe20b]) - 11605 MiB free
time=2025-02-17T23:54:36.075+08:00 level=INFO source=runner.go:1026 msg="Server listening on 127.0.0.1:41795"
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 28
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
time=2025-02-17T23:54:36.190+08:00 level=INFO source=server.go:605 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 3
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 3.21 B
llm_load_print_meta: model size       = 1.87 GiB (5.01 BPW) 
llm_load_print_meta: general.name     = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:        SYCL0 model buffer size =   852.89 MiB
llm_load_tensors:        SYCL1 model buffer size =  1065.46 MiB
llm_load_tensors:          CPU model buffer size =   308.23 MiB
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 16384
llama_new_context_with_model: n_ctx_per_seq = 16384
llama_new_context_with_model: n_batch       = 512
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 500000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
Found 2 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Graphics [0xe20b]|   20.1|    160|    1024|   32| 12168M|     1.6.32224.500000|
| 1| [level_zero:gpu:1]|                Intel Graphics [0xe20b]|   20.1|    160|    1024|   32| 12168M|     1.6.32224.500000|
llama_kv_cache_init:      SYCL0 KV buffer size =   960.00 MiB
llama_kv_cache_init:      SYCL1 KV buffer size =   832.00 MiB
llama_new_context_with_model: KV self size  = 1792.00 MiB, K (f16):  896.00 MiB, V (f16):  896.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.50 MiB
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model:      SYCL0 compute buffer size =   202.01 MiB
llama_new_context_with_model:      SYCL1 compute buffer size =   408.52 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =   134.02 MiB
llama_new_context_with_model: graph nodes  = 790
llama_new_context_with_model: graph splits = 3
time=2025-02-17T23:54:43.679+08:00 level=WARN source=runner.go:892 msg="%s: warming up the model with an empty run - please wait ... " !BADKEY=loadModel
time=2025-02-17T23:54:43.720+08:00 level=INFO source=server.go:610 msg="llama runner started in 7.78 seconds"
SIGILL: illegal instruction
PC=0x74136c80bc2f m=4 sigcode=2
signal arrived during cgo execution
instruction bytes: 0xf3 0xf 0xc7 0xf8 0x25 0xff 0x3 0x0 0x0 0x48 0x8b 0xd 0xe1 0xc2 0x2a 0x0

goroutine 39 gp=0xc0001fb180 m=4 mp=0xc00008b508 [syscall]:
runtime.cgocall(0x601e0fa584e0, 0xc000159b90)
	runtime/cgocall.go:167 +0x4b fp=0xc000159b68 sp=0xc000159b30 pc=0x601e0eeb754b
ollama/llama/llamafile._Cfunc_llama_decode(0x7412ebc57ee0, {0x1f, 0x7412e801bf90, 0x0, 0x0, 0x7412e801c7a0, 0x7412e801cfb0, 0x7412e801d7c0, 0x7412e83aec30})
	_cgo_gotypes.go:558 +0x4f fp=0xc000159b90 sp=0xc000159b68 pc=0x601e0f27996f
ollama/llama/llamafile.(*Context).Decode.func1(0x601e0f2886eb?, 0x7412ebc57ee0?)
	ollama/llama/llamafile/llama.go:143 +0xf5 fp=0xc000159c80 sp=0xc000159b90 pc=0x601e0f27c595
ollama/llama/llamafile.(*Context).Decode(0xc000080d70?, 0x0?)
	ollama/llama/llamafile/llama.go:143 +0x13 fp=0xc000159cc8 sp=0xc000159c80 pc=0x601e0f27c413
ollama/llama/runner.(*Server).processBatch(0xc0001a75f0, 0xc0003fc9c0, 0xc000080f20)
	ollama/llama/runner/runner.go:434 +0x23f fp=0xc000159ee0 sp=0xc000159cc8 pc=0x601e0f2873bf
ollama/llama/runner.(*Server).run(0xc0001a75f0, {0x601e100089c0, 0xc000529680})
	ollama/llama/runner/runner.go:342 +0x1d5 fp=0xc000159fb8 sp=0xc000159ee0 pc=0x601e0f286df5
ollama/llama/runner.Execute.gowrap2()
	ollama/llama/runner/runner.go:1006 +0x28 fp=0xc000159fe0 sp=0xc000159fb8 pc=0x601e0f28c068
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000159fe8 sp=0xc000159fe0 pc=0x601e0eec6021
created by ollama/llama/runner.Execute in goroutine 1
	ollama/llama/runner/runner.go:1006 +0xde5

goroutine 1 gp=0xc0000061c0 m=nil [IO wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000127560 sp=0xc000127540 pc=0x601e0eebdc4e
runtime.netpollblock(0x4adf80?, 0xee54a66?, 0x1e?)
	runtime/netpoll.go:575 +0xf7 fp=0xc000127598 sp=0xc000127560 pc=0x601e0ee818b7
internal/poll.runtime_pollWait(0x74136e335680, 0x72)
	runtime/netpoll.go:351 +0x85 fp=0xc0001275b8 sp=0xc000127598 pc=0x601e0eebcf45
internal/poll.(*pollDesc).wait(0xc000506a00?, 0x2c?, 0x0)
	internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0001275e0 sp=0xc0001275b8 pc=0x601e0ef44567
internal/poll.(*pollDesc).waitRead(...)
	internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc000506a00)
	internal/poll/fd_unix.go:620 +0x295 fp=0xc000127688 sp=0xc0001275e0 pc=0x601e0ef49935
net.(*netFD).accept(0xc000506a00)
	net/fd_unix.go:172 +0x29 fp=0xc000127740 sp=0xc000127688 pc=0x601e0efb2009
net.(*TCPListener).accept(0xc000535e00)
	net/tcpsock_posix.go:159 +0x1e fp=0xc000127790 sp=0xc000127740 pc=0x601e0efc7c7e
net.(*TCPListener).Accept(0xc000535e00)
	net/tcpsock.go:372 +0x30 fp=0xc0001277c0 sp=0xc000127790 pc=0x601e0efc6b30
net/http.(*onceCloseListener).Accept(0xc0004ac000?)
	<autogenerated>:1 +0x24 fp=0xc0001277d8 sp=0xc0001277c0 pc=0x601e0f240284
net/http.(*Server).Serve(0xc00053af00, {0x601e10006700, 0xc000535e00})
	net/http/server.go:3330 +0x30c fp=0xc000127908 sp=0xc0001277d8 pc=0x601e0f21820c
ollama/llama/runner.Execute({0xc000036130?, 0x0?, 0x0?})
	ollama/llama/runner/runner.go:1027 +0x11a9 fp=0xc000127ca8 sp=0xc000127908 pc=0x601e0f28bd49
ollama/cmd.NewCLI.func2(0xc0001cf200?, {0x601e0fa5cf9d?, 0x4?, 0x601e0fa5cfa1?})
	ollama/cmd/cmd.go:1430 +0x45 fp=0xc000127cd0 sp=0xc000127ca8 pc=0x601e0fa57765
github.com/spf13/cobra.(*Command).execute(0xc000130908, {0xc00053a870, 0xf, 0xf})
	github.com/spf13/cobra@v1.8.1/command.go:985 +0xaaa fp=0xc000127e58 sp=0xc000127cd0 pc=0x601e0f04b3ea
github.com/spf13/cobra.(*Command).ExecuteC(0xc00012e308)
	github.com/spf13/cobra@v1.8.1/command.go:1117 +0x3ff fp=0xc000127f30 sp=0xc000127e58 pc=0x601e0f04bcbf
github.com/spf13/cobra.(*Command).Execute(...)
	github.com/spf13/cobra@v1.8.1/command.go:1041
github.com/spf13/cobra.(*Command).ExecuteContext(...)
	github.com/spf13/cobra@v1.8.1/command.go:1034
main.main()
	ollama/main.go:12 +0x4d fp=0xc000127f50 sp=0xc000127f30 pc=0x601e0fa57dcd
runtime.main()
	runtime/proc.go:272 +0x29d fp=0xc000127fe0 sp=0xc000127f50 pc=0x601e0ee88f5d
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000127fe8 sp=0xc000127fe0 pc=0x601e0eec6021

goroutine 2 gp=0xc000006c40 m=nil [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000084fa8 sp=0xc000084f88 pc=0x601e0eebdc4e
runtime.goparkunlock(...)
	runtime/proc.go:430
runtime.forcegchelper()
	runtime/proc.go:337 +0xb8 fp=0xc000084fe0 sp=0xc000084fa8 pc=0x601e0ee89298
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000084fe8 sp=0xc000084fe0 pc=0x601e0eec6021
created by runtime.init.7 in goroutine 1
	runtime/proc.go:325 +0x1a

goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]:
runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000085780 sp=0xc000085760 pc=0x601e0eebdc4e
runtime.goparkunlock(...)
	runtime/proc.go:430
runtime.bgsweep(0xc0000b2000)
	runtime/mgcsweep.go:317 +0xdf fp=0xc0000857c8 sp=0xc000085780 pc=0x601e0ee7393f
runtime.gcenable.gowrap1()
	runtime/mgc.go:204 +0x25 fp=0xc0000857e0 sp=0xc0000857c8 pc=0x601e0ee67f85
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000857e8 sp=0xc0000857e0 pc=0x601e0eec6021
created by runtime.gcenable in goroutine 1
	runtime/mgc.go:204 +0x66

goroutine 4 gp=0xc000007340 m=nil [GC scavenge wait]:
runtime.gopark(0x10000?, 0x601e0fc03070?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000085f78 sp=0xc000085f58 pc=0x601e0eebdc4e
runtime.goparkunlock(...)
	runtime/proc.go:430
runtime.(*scavengerState).park(0x601e1079ed80)
	runtime/mgcscavenge.go:425 +0x49 fp=0xc000085fa8 sp=0xc000085f78 pc=0x601e0ee71309
runtime.bgscavenge(0xc0000b2000)
	runtime/mgcscavenge.go:658 +0x59 fp=0xc000085fc8 sp=0xc000085fa8 pc=0x601e0ee71899
runtime.gcenable.gowrap2()
	runtime/mgc.go:205 +0x25 fp=0xc000085fe0 sp=0xc000085fc8 pc=0x601e0ee67f25
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000085fe8 sp=0xc000085fe0 pc=0x601e0eec6021
created by runtime.gcenable in goroutine 1
	runtime/mgc.go:205 +0xa5

goroutine 5 gp=0xc000007c00 m=nil [finalizer wait]:
runtime.gopark(0xc000084648?, 0x601e0ee5e485?, 0xb0?, 0x1?, 0xc0000061c0?)
	runtime/proc.go:424 +0xce fp=0xc000084620 sp=0xc000084600 pc=0x601e0eebdc4e
runtime.runfinq()
	runtime/mfinal.go:193 +0x107 fp=0xc0000847e0 sp=0xc000084620 pc=0x601e0ee67007
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000847e8 sp=0xc0000847e0 pc=0x601e0eec6021
created by runtime.createfing in goroutine 1
	runtime/mfinal.go:163 +0x3d

goroutine 6 gp=0xc0001fae00 m=nil [chan receive]:
runtime.gopark(0xc000086760?, 0x601e0ef99685?, 0x40?, 0xe8?, 0x601e10019c00?)
	runtime/proc.go:424 +0xce fp=0xc000086718 sp=0xc0000866f8 pc=0x601e0eebdc4e
runtime.chanrecv(0xc00004e310, 0x0, 0x1)
	runtime/chan.go:639 +0x41c fp=0xc000086790 sp=0xc000086718 pc=0x601e0ee5767c
runtime.chanrecv1(0x0?, 0x0?)
	runtime/chan.go:489 +0x12 fp=0xc0000867b8 sp=0xc000086790 pc=0x601e0ee57232
runtime.unique_runtime_registerUniqueMapCleanup.func1(...)
	runtime/mgc.go:1781
runtime.unique_runtime_registerUniqueMapCleanup.gowrap1()
	runtime/mgc.go:1784 +0x2f fp=0xc0000867e0 sp=0xc0000867b8 pc=0x601e0ee6afef
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000867e8 sp=0xc0000867e0 pc=0x601e0eec6021
created by unique.runtime_registerUniqueMapCleanup in goroutine 1
	runtime/mgc.go:1779 +0x96

goroutine 7 gp=0xc0001fb880 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000086f38 sp=0xc000086f18 pc=0x601e0eebdc4e
runtime.gcBgMarkWorker(0xc00004f730)
	runtime/mgc.go:1412 +0xe9 fp=0xc000086fc8 sp=0xc000086f38 pc=0x601e0ee6a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1328 +0x25 fp=0xc000086fe0 sp=0xc000086fc8 pc=0x601e0ee6a1c5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000086fe8 sp=0xc000086fe0 pc=0x601e0eec6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1328 +0x105

goroutine 18 gp=0xc000504000 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000080738 sp=0xc000080718 pc=0x601e0eebdc4e
runtime.gcBgMarkWorker(0xc00004f730)
	runtime/mgc.go:1412 +0xe9 fp=0xc0000807c8 sp=0xc000080738 pc=0x601e0ee6a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1328 +0x25 fp=0xc0000807e0 sp=0xc0000807c8 pc=0x601e0ee6a1c5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000807e8 sp=0xc0000807e0 pc=0x601e0eec6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1328 +0x105

goroutine 34 gp=0xc000104380 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc00011c738 sp=0xc00011c718 pc=0x601e0eebdc4e
runtime.gcBgMarkWorker(0xc00004f730)
	runtime/mgc.go:1412 +0xe9 fp=0xc00011c7c8 sp=0xc00011c738 pc=0x601e0ee6a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1328 +0x25 fp=0xc00011c7e0 sp=0xc00011c7c8 pc=0x601e0ee6a1c5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc00011c7e8 sp=0xc00011c7e0 pc=0x601e0eec6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1328 +0x105

goroutine 8 gp=0xc0001fba40 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000087738 sp=0xc000087718 pc=0x601e0eebdc4e
runtime.gcBgMarkWorker(0xc00004f730)
	runtime/mgc.go:1412 +0xe9 fp=0xc0000877c8 sp=0xc000087738 pc=0x601e0ee6a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1328 +0x25 fp=0xc0000877e0 sp=0xc0000877c8 pc=0x601e0ee6a1c5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000877e8 sp=0xc0000877e0 pc=0x601e0eec6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1328 +0x105

goroutine 35 gp=0xc000104540 m=nil [GC worker (idle)]:
runtime.gopark(0x18c3012f41?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc00011cf38 sp=0xc00011cf18 pc=0x601e0eebdc4e
runtime.gcBgMarkWorker(0xc00004f730)
	runtime/mgc.go:1412 +0xe9 fp=0xc00011cfc8 sp=0xc00011cf38 pc=0x601e0ee6a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1328 +0x25 fp=0xc00011cfe0 sp=0xc00011cfc8 pc=0x601e0ee6a1c5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc00011cfe8 sp=0xc00011cfe0 pc=0x601e0eec6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1328 +0x105

goroutine 36 gp=0xc000104700 m=nil [GC worker (idle)]:
runtime.gopark(0x18c301349c?, 0x3?, 0x77?, 0xf7?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc00011d738 sp=0xc00011d718 pc=0x601e0eebdc4e
runtime.gcBgMarkWorker(0xc00004f730)
	runtime/mgc.go:1412 +0xe9 fp=0xc00011d7c8 sp=0xc00011d738 pc=0x601e0ee6a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1328 +0x25 fp=0xc00011d7e0 sp=0xc00011d7c8 pc=0x601e0ee6a1c5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc00011d7e8 sp=0xc00011d7e0 pc=0x601e0eec6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1328 +0x105

goroutine 37 gp=0xc0001048c0 m=nil [GC worker (idle)]:
runtime.gopark(0x18c300b8f7?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc00011df38 sp=0xc00011df18 pc=0x601e0eebdc4e
runtime.gcBgMarkWorker(0xc00004f730)
	runtime/mgc.go:1412 +0xe9 fp=0xc00011dfc8 sp=0xc00011df38 pc=0x601e0ee6a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1328 +0x25 fp=0xc00011dfe0 sp=0xc00011dfc8 pc=0x601e0ee6a1c5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc00011dfe8 sp=0xc00011dfe0 pc=0x601e0eec6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1328 +0x105

goroutine 9 gp=0xc0001fbc00 m=nil [GC worker (idle)]:
runtime.gopark(0x18c30154d9?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000087f38 sp=0xc000087f18 pc=0x601e0eebdc4e
runtime.gcBgMarkWorker(0xc00004f730)
	runtime/mgc.go:1412 +0xe9 fp=0xc000087fc8 sp=0xc000087f38 pc=0x601e0ee6a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1328 +0x25 fp=0xc000087fe0 sp=0xc000087fc8 pc=0x601e0ee6a1c5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000087fe8 sp=0xc000087fe0 pc=0x601e0eec6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1328 +0x105

goroutine 10 gp=0xc000104a80 m=nil [select]:
runtime.gopark(0xc00005ba68?, 0x2?, 0xce?, 0x36?, 0xc00005b834?)
	runtime/proc.go:424 +0xce fp=0xc00005b650 sp=0xc00005b630 pc=0x601e0eebdc4e
runtime.selectgo(0xc00005ba68, 0xc00005b830, 0x1f?, 0x0, 0x1?, 0x1)
	runtime/select.go:335 +0x7a5 fp=0xc00005b778 sp=0xc00005b650 pc=0x601e0ee9af45
ollama/llama/runner.(*Server).completion(0xc0001a75f0, {0x601e10006910, 0xc000632ee0}, 0xc00033d180)
	ollama/llama/runner/runner.go:696 +0xab6 fp=0xc00005bac0 sp=0xc00005b778 pc=0x601e0f289236
ollama/llama/runner.(*Server).completion-fm({0x601e10006910?, 0xc000632ee0?}, 0x601e0f221fe7?)
	<autogenerated>:1 +0x36 fp=0xc00005baf0 sp=0xc00005bac0 pc=0x601e0f28c916
net/http.HandlerFunc.ServeHTTP(0xc000541340?, {0x601e10006910?, 0xc000632ee0?}, 0x0?)
	net/http/server.go:2220 +0x29 fp=0xc00005bb18 sp=0xc00005baf0 pc=0x601e0f214809
net/http.(*ServeMux).ServeHTTP(0x601e0ee5e485?, {0x601e10006910, 0xc000632ee0}, 0xc00033d180)
	net/http/server.go:2747 +0x1ca fp=0xc00005bb68 sp=0xc00005bb18 pc=0x601e0f21670a
net/http.serverHandler.ServeHTTP({0x601e10003510?}, {0x601e10006910?, 0xc000632ee0?}, 0x6?)
	net/http/server.go:3210 +0x8e fp=0xc00005bb98 sp=0xc00005bb68 pc=0x601e0f233c6e
net/http.(*conn).serve(0xc0004ac000, {0x601e10008988, 0xc00060af30})
	net/http/server.go:2092 +0x5d0 fp=0xc00005bfb8 sp=0xc00005bb98 pc=0x601e0f2131b0
net/http.(*Server).Serve.gowrap3()
	net/http/server.go:3360 +0x28 fp=0xc00005bfe0 sp=0xc00005bfb8 pc=0x601e0f218608
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc00005bfe8 sp=0xc00005bfe0 pc=0x601e0eec6021
created by net/http.(*Server).Serve in goroutine 1
	net/http/server.go:3360 +0x485

goroutine 31 gp=0xc000584a80 m=nil [IO wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0xb?)
	runtime/proc.go:424 +0xce fp=0xc00011bda8 sp=0xc00011bd88 pc=0x601e0eebdc4e
runtime.netpollblock(0x601e0eee0e78?, 0xee54a66?, 0x1e?)
	runtime/netpoll.go:575 +0xf7 fp=0xc00011bde0 sp=0xc00011bda8 pc=0x601e0ee818b7
internal/poll.runtime_pollWait(0x74136e335568, 0x72)
	runtime/netpoll.go:351 +0x85 fp=0xc00011be00 sp=0xc00011bde0 pc=0x601e0eebcf45
internal/poll.(*pollDesc).wait(0xc000624500?, 0xc00060afa1?, 0x0)
	internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00011be28 sp=0xc00011be00 pc=0x601e0ef44567
internal/poll.(*pollDesc).waitRead(...)
	internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc000624500, {0xc00060afa1, 0x1, 0x1})
	internal/poll/fd_unix.go:165 +0x27a fp=0xc00011bec0 sp=0xc00011be28 pc=0x601e0ef4585a
net.(*netFD).Read(0xc000624500, {0xc00060afa1?, 0x0?, 0x0?})
	net/fd_posix.go:55 +0x25 fp=0xc00011bf08 sp=0xc00011bec0 pc=0x601e0efb0045
net.(*conn).Read(0xc000520030, {0xc00060afa1?, 0x0?, 0x0?})
	net/net.go:189 +0x45 fp=0xc00011bf50 sp=0xc00011bf08 pc=0x601e0efbe645
net.(*TCPConn).Read(0x0?, {0xc00060afa1?, 0x0?, 0x0?})
	<autogenerated>:1 +0x25 fp=0xc00011bf80 sp=0xc00011bf50 pc=0x601e0efd1845
net/http.(*connReader).backgroundRead(0xc00060af90)
	net/http/server.go:690 +0x37 fp=0xc00011bfc8 sp=0xc00011bf80 pc=0x601e0f20db37
net/http.(*connReader).startBackgroundRead.gowrap2()
	net/http/server.go:686 +0x25 fp=0xc00011bfe0 sp=0xc00011bfc8 pc=0x601e0f20da65
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc00011bfe8 sp=0xc00011bfe0 pc=0x601e0eec6021
created by net/http.(*connReader).startBackgroundRead in goroutine 10
	net/http/server.go:686 +0xb6

rax    0x0
rbx    0x0
rcx    0x7412f80063c0
rdx    0x741307dfebf0
rdi    0x6
rsi    0x601e2f6f7060
rbp    0x741307dfe960
rsp    0x741307dfe3f0
r8     0x7412f8006540
r9     0x0
r10    0x74136e8f8f28
r11    0x7412f8006540
r12    0x7412f8006540
r13    0x7412e80cd298
r14    0x7412f80063c0
r15    0x7412f8006540
rip    0x74136c80bc2f
rflags 0x10202
cs     0x33
fs     0x0
gs     0x0
[GIN] 2025/02/17 - 23:54:43 | 500 |  7.913867252s |       10.89.0.4 | POST     "/api/generate"
time=2025-02-17T23:54:57.191+08:00 level=INFO source=server.go:104 msg="system memory" total="30.3 GiB" free="28.6 GiB" free_swap="8.0 GiB"
time=2025-02-17T23:54:57.191+08:00 level=INFO source=memory.go:356 msg="offload to device" layers.requested=-1 layers.model=29 layers.offload=0 layers.split="" memory.available="[28.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="4.6 GiB" memory.required.partial="0 B" memory.required.kv="1.8 GiB" memory.required.allocations="[4.6 GiB]" memory.weights.total="3.3 GiB" memory.weights.repeating="3.0 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="824.0 MiB" memory.graph.partial="881.1 MiB"
time=2025-02-17T23:54:57.192+08:00 level=INFO source=server.go:392 msg="starting llama server" cmd="/usr/local/lib/python3.11/dist-packages/bigdl/cpp/libs/ollama runner --model /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 16384 --batch-size 512 --n-gpu-layers 999 --threads 8 --no-mmap --parallel 1 --port 41959"
time=2025-02-17T23:54:57.192+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-02-17T23:54:57.192+08:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2025-02-17T23:54:57.192+08:00 level=INFO source=server.go:605 msg="waiting for server to become available" status="llm server error"
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 2 SYCL devices:
time=2025-02-17T23:54:57.326+08:00 level=INFO source=runner.go:967 msg="starting go runner"
time=2025-02-17T23:54:57.326+08:00 level=INFO source=runner.go:968 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=8
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llama_load_model_from_file: using device SYCL0 (Intel(R) Graphics [0xe20b]) - 11605 MiB free
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llama_load_model_from_file: using device SYCL1 (Intel(R) Graphics [0xe20b]) - 11605 MiB free
time=2025-02-17T23:54:57.326+08:00 level=INFO source=runner.go:1026 msg="Server listening on 127.0.0.1:41959"
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 28
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
time=2025-02-17T23:54:57.443+08:00 level=INFO source=server.go:605 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 3
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 3.21 B
llm_load_print_meta: model size       = 1.87 GiB (5.01 BPW) 
llm_load_print_meta: general.name     = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:        SYCL0 model buffer size =   852.89 MiB
llm_load_tensors:        SYCL1 model buffer size =  1065.46 MiB
llm_load_tensors:          CPU model buffer size =   308.23 MiB
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 16384
llama_new_context_with_model: n_ctx_per_seq = 16384
llama_new_context_with_model: n_batch       = 512
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 500000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
Found 2 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Graphics [0xe20b]|   20.1|    160|    1024|   32| 12168M|     1.6.32224.500000|
| 1| [level_zero:gpu:1]|                Intel Graphics [0xe20b]|   20.1|    160|    1024|   32| 12168M|     1.6.32224.500000|
llama_kv_cache_init:      SYCL0 KV buffer size =   960.00 MiB
llama_kv_cache_init:      SYCL1 KV buffer size =   832.00 MiB
llama_new_context_with_model: KV self size  = 1792.00 MiB, K (f16):  896.00 MiB, V (f16):  896.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.50 MiB
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model:      SYCL0 compute buffer size =   202.01 MiB
llama_new_context_with_model:      SYCL1 compute buffer size =   408.52 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =   134.02 MiB
llama_new_context_with_model: graph nodes  = 790
llama_new_context_with_model: graph splits = 3
time=2025-02-17T23:54:59.613+08:00 level=WARN source=runner.go:892 msg="%s: warming up the model with an empty run - please wait ... " !BADKEY=loadModel
time=2025-02-17T23:54:59.702+08:00 level=INFO source=server.go:610 msg="llama runner started in 2.51 seconds"
SIGILL: illegal instruction
PC=0x76bc5f60bc2f m=3 sigcode=2
signal arrived during cgo execution
instruction bytes: 0xf3 0xf 0xc7 0xf8 0x25 0xff 0x3 0x0 0x0 0x48 0x8b 0xd 0xe1 0xc2 0x2a 0x0

goroutine 8 gp=0xc000484c40 m=3 mp=0xc00008ae08 [syscall]:
runtime.cgocall(0x5d46b18584e0, 0xc000097b90)
	runtime/cgocall.go:167 +0x4b fp=0xc000097b68 sp=0xc000097b30 pc=0x5d46b0cb754b
ollama/llama/llamafile._Cfunc_llama_decode(0x76bbefc576b0, {0x21, 0x76bbec020350, 0x0, 0x0, 0x76bbec01b8f0, 0x76bbec01c100, 0x76bbec01c910, 0x76bbec0813f0})
	_cgo_gotypes.go:558 +0x4f fp=0xc000097b90 sp=0xc000097b68 pc=0x5d46b107996f
ollama/llama/llamafile.(*Context).Decode.func1(0x5d46b10886eb?, 0x76bbefc576b0?)
	ollama/llama/llamafile/llama.go:143 +0xf5 fp=0xc000097c80 sp=0xc000097b90 pc=0x5d46b107c595
ollama/llama/llamafile.(*Context).Decode(0xc00048d570?, 0x0?)
	ollama/llama/llamafile/llama.go:143 +0x13 fp=0xc000097cc8 sp=0xc000097c80 pc=0x5d46b107c413
ollama/llama/runner.(*Server).processBatch(0xc00019d560, 0xc00001e060, 0xc00048d720)
	ollama/llama/runner/runner.go:434 +0x23f fp=0xc000097ee0 sp=0xc000097cc8 pc=0x5d46b10873bf
ollama/llama/runner.(*Server).run(0xc00019d560, {0x5d46b1e089c0, 0xc0005227d0})
	ollama/llama/runner/runner.go:342 +0x1d5 fp=0xc000097fb8 sp=0xc000097ee0 pc=0x5d46b1086df5
ollama/llama/runner.Execute.gowrap2()
	ollama/llama/runner/runner.go:1006 +0x28 fp=0xc000097fe0 sp=0xc000097fb8 pc=0x5d46b108c068
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000097fe8 sp=0xc000097fe0 pc=0x5d46b0cc6021
created by ollama/llama/runner.Execute in goroutine 1
	ollama/llama/runner/runner.go:1006 +0xde5

goroutine 1 gp=0xc0000061c0 m=nil [IO wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000507560 sp=0xc000507540 pc=0x5d46b0cbdc4e
runtime.netpollblock(0x4e5f80?, 0xb0c54a66?, 0x46?)
	runtime/netpoll.go:575 +0xf7 fp=0xc000507598 sp=0xc000507560 pc=0x5d46b0c818b7
internal/poll.runtime_pollWait(0x76bc60dc8df0, 0x72)
	runtime/netpoll.go:351 +0x85 fp=0xc0005075b8 sp=0xc000507598 pc=0x5d46b0cbcf45
internal/poll.(*pollDesc).wait(0xc000604700?, 0x2c?, 0x0)
	internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0005075e0 sp=0xc0005075b8 pc=0x5d46b0d44567
internal/poll.(*pollDesc).waitRead(...)
	internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc000604700)
	internal/poll/fd_unix.go:620 +0x295 fp=0xc000507688 sp=0xc0005075e0 pc=0x5d46b0d49935
net.(*netFD).accept(0xc000604700)
	net/fd_unix.go:172 +0x29 fp=0xc000507740 sp=0xc000507688 pc=0x5d46b0db2009
net.(*TCPListener).accept(0xc000366740)
	net/tcpsock_posix.go:159 +0x1e fp=0xc000507790 sp=0xc000507740 pc=0x5d46b0dc7c7e
net.(*TCPListener).Accept(0xc000366740)
	net/tcpsock.go:372 +0x30 fp=0xc0005077c0 sp=0xc000507790 pc=0x5d46b0dc6b30
net/http.(*onceCloseListener).Accept(0xc0004e4000?)
	<autogenerated>:1 +0x24 fp=0xc0005077d8 sp=0xc0005077c0 pc=0x5d46b1040284
net/http.(*Server).Serve(0xc0004cd0e0, {0x5d46b1e06700, 0xc000366740})
	net/http/server.go:3330 +0x30c fp=0xc000507908 sp=0xc0005077d8 pc=0x5d46b101820c
ollama/llama/runner.Execute({0xc000136010?, 0x0?, 0x0?})
	ollama/llama/runner/runner.go:1027 +0x11a9 fp=0xc000507ca8 sp=0xc000507908 pc=0x5d46b108bd49
ollama/cmd.NewCLI.func2(0xc0001c9400?, {0x5d46b185cf9d?, 0x4?, 0x5d46b185cfa1?})
	ollama/cmd/cmd.go:1430 +0x45 fp=0xc000507cd0 sp=0xc000507ca8 pc=0x5d46b1857765
github.com/spf13/cobra.(*Command).execute(0xc0000ca908, {0xc0004cca50, 0xf, 0xf})
	github.com/spf13/cobra@v1.8.1/command.go:985 +0xaaa fp=0xc000507e58 sp=0xc000507cd0 pc=0x5d46b0e4b3ea
github.com/spf13/cobra.(*Command).ExecuteC(0xc0004c8f08)
	github.com/spf13/cobra@v1.8.1/command.go:1117 +0x3ff fp=0xc000507f30 sp=0xc000507e58 pc=0x5d46b0e4bcbf
github.com/spf13/cobra.(*Command).Execute(...)
	github.com/spf13/cobra@v1.8.1/command.go:1041
github.com/spf13/cobra.(*Command).ExecuteContext(...)
	github.com/spf13/cobra@v1.8.1/command.go:1034
main.main()
	ollama/main.go:12 +0x4d fp=0xc000507f50 sp=0xc000507f30 pc=0x5d46b1857dcd
runtime.main()
	runtime/proc.go:272 +0x29d fp=0xc000507fe0 sp=0xc000507f50 pc=0x5d46b0c88f5d
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000507fe8 sp=0xc000507fe0 pc=0x5d46b0cc6021

goroutine 2 gp=0xc000006c40 m=nil [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000084fa8 sp=0xc000084f88 pc=0x5d46b0cbdc4e
runtime.goparkunlock(...)
	runtime/proc.go:430
runtime.forcegchelper()
	runtime/proc.go:337 +0xb8 fp=0xc000084fe0 sp=0xc000084fa8 pc=0x5d46b0c89298
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000084fe8 sp=0xc000084fe0 pc=0x5d46b0cc6021
created by runtime.init.7 in goroutine 1
	runtime/proc.go:325 +0x1a

goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]:
runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000085780 sp=0xc000085760 pc=0x5d46b0cbdc4e
runtime.goparkunlock(...)
	runtime/proc.go:430
runtime.bgsweep(0xc0000b2000)
	runtime/mgcsweep.go:317 +0xdf fp=0xc0000857c8 sp=0xc000085780 pc=0x5d46b0c7393f
runtime.gcenable.gowrap1()
	runtime/mgc.go:204 +0x25 fp=0xc0000857e0 sp=0xc0000857c8 pc=0x5d46b0c67f85
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000857e8 sp=0xc0000857e0 pc=0x5d46b0cc6021
created by runtime.gcenable in goroutine 1
	runtime/mgc.go:204 +0x66

goroutine 4 gp=0xc000007340 m=nil [GC scavenge wait]:
runtime.gopark(0x10000?, 0x5d46b1a03070?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000085f78 sp=0xc000085f58 pc=0x5d46b0cbdc4e
runtime.goparkunlock(...)
	runtime/proc.go:430
runtime.(*scavengerState).park(0x5d46b259ed80)
	runtime/mgcscavenge.go:425 +0x49 fp=0xc000085fa8 sp=0xc000085f78 pc=0x5d46b0c71309
runtime.bgscavenge(0xc0000b2000)
	runtime/mgcscavenge.go:658 +0x59 fp=0xc000085fc8 sp=0xc000085fa8 pc=0x5d46b0c71899
runtime.gcenable.gowrap2()
	runtime/mgc.go:205 +0x25 fp=0xc000085fe0 sp=0xc000085fc8 pc=0x5d46b0c67f25
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000085fe8 sp=0xc000085fe0 pc=0x5d46b0cc6021
created by runtime.gcenable in goroutine 1
	runtime/mgc.go:205 +0xa5

goroutine 18 gp=0xc000104700 m=nil [finalizer wait]:
runtime.gopark(0xc000084648?, 0x5d46b0c5e485?, 0xb0?, 0x1?, 0xc0000061c0?)
	runtime/proc.go:424 +0xce fp=0xc000084620 sp=0xc000084600 pc=0x5d46b0cbdc4e
runtime.runfinq()
	runtime/mfinal.go:193 +0x107 fp=0xc0000847e0 sp=0xc000084620 pc=0x5d46b0c67007
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000847e8 sp=0xc0000847e0 pc=0x5d46b0cc6021
created by runtime.createfing in goroutine 1
	runtime/mfinal.go:163 +0x3d

goroutine 19 gp=0xc000105880 m=nil [chan receive]:
runtime.gopark(0xc000080760?, 0x5d46b0d99685?, 0x40?, 0xc8?, 0x5d46b1e19c00?)
	runtime/proc.go:424 +0xce fp=0xc000080718 sp=0xc0000806f8 pc=0x5d46b0cbdc4e
runtime.chanrecv(0xc0001122a0, 0x0, 0x1)
	runtime/chan.go:639 +0x41c fp=0xc000080790 sp=0xc000080718 pc=0x5d46b0c5767c
runtime.chanrecv1(0x0?, 0x0?)
	runtime/chan.go:489 +0x12 fp=0xc0000807b8 sp=0xc000080790 pc=0x5d46b0c57232
runtime.unique_runtime_registerUniqueMapCleanup.func1(...)
	runtime/mgc.go:1781
runtime.unique_runtime_registerUniqueMapCleanup.gowrap1()
	runtime/mgc.go:1784 +0x2f fp=0xc0000807e0 sp=0xc0000807b8 pc=0x5d46b0c6afef
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000807e8 sp=0xc0000807e0 pc=0x5d46b0cc6021
created by unique.runtime_registerUniqueMapCleanup in goroutine 1
	runtime/mgc.go:1779 +0x96

goroutine 20 gp=0xc0004701c0 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000080f38 sp=0xc000080f18 pc=0x5d46b0cbdc4e
runtime.gcBgMarkWorker(0xc0001136c0)
	runtime/mgc.go:1412 +0xe9 fp=0xc000080fc8 sp=0xc000080f38 pc=0x5d46b0c6a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1328 +0x25 fp=0xc000080fe0 sp=0xc000080fc8 pc=0x5d46b0c6a1c5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000080fe8 sp=0xc000080fe0 pc=0x5d46b0cc6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1328 +0x105

goroutine 34 gp=0xc000484000 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc00048a738 sp=0xc00048a718 pc=0x5d46b0cbdc4e
runtime.gcBgMarkWorker(0xc0001136c0)
	runtime/mgc.go:1412 +0xe9 fp=0xc00048a7c8 sp=0xc00048a738 pc=0x5d46b0c6a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1328 +0x25 fp=0xc00048a7e0 sp=0xc00048a7c8 pc=0x5d46b0c6a1c5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc00048a7e8 sp=0xc00048a7e0 pc=0x5d46b0cc6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1328 +0x105

goroutine 5 gp=0xc000007880 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000086738 sp=0xc000086718 pc=0x5d46b0cbdc4e
runtime.gcBgMarkWorker(0xc0001136c0)
	runtime/mgc.go:1412 +0xe9 fp=0xc0000867c8 sp=0xc000086738 pc=0x5d46b0c6a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1328 +0x25 fp=0xc0000867e0 sp=0xc0000867c8 pc=0x5d46b0c6a1c5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000867e8 sp=0xc0000867e0 pc=0x5d46b0cc6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1328 +0x105

goroutine 21 gp=0xc000470380 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000081738 sp=0xc000081718 pc=0x5d46b0cbdc4e
runtime.gcBgMarkWorker(0xc0001136c0)
	runtime/mgc.go:1412 +0xe9 fp=0xc0000817c8 sp=0xc000081738 pc=0x5d46b0c6a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1328 +0x25 fp=0xc0000817e0 sp=0xc0000817c8 pc=0x5d46b0c6a1c5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000817e8 sp=0xc0000817e0 pc=0x5d46b0cc6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1328 +0x105

goroutine 22 gp=0xc000470540 m=nil [GC worker (idle)]:
runtime.gopark(0x1db5e35f8f?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000081f38 sp=0xc000081f18 pc=0x5d46b0cbdc4e
runtime.gcBgMarkWorker(0xc0001136c0)
	runtime/mgc.go:1412 +0xe9 fp=0xc000081fc8 sp=0xc000081f38 pc=0x5d46b0c6a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1328 +0x25 fp=0xc000081fe0 sp=0xc000081fc8 pc=0x5d46b0c6a1c5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000081fe8 sp=0xc000081fe0 pc=0x5d46b0cc6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1328 +0x105

goroutine 6 gp=0xc000007a40 m=nil [GC worker (idle)]:
runtime.gopark(0x1db5e5291d?, 0x3?, 0x2a?, 0x5f?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000086f38 sp=0xc000086f18 pc=0x5d46b0cbdc4e
runtime.gcBgMarkWorker(0xc0001136c0)
	runtime/mgc.go:1412 +0xe9 fp=0xc000086fc8 sp=0xc000086f38 pc=0x5d46b0c6a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1328 +0x25 fp=0xc000086fe0 sp=0xc000086fc8 pc=0x5d46b0c6a1c5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000086fe8 sp=0xc000086fe0 pc=0x5d46b0cc6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1328 +0x105

goroutine 23 gp=0xc000470700 m=nil [GC worker (idle)]:
runtime.gopark(0x1db5e5210d?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000082738 sp=0xc000082718 pc=0x5d46b0cbdc4e
runtime.gcBgMarkWorker(0xc0001136c0)
	runtime/mgc.go:1412 +0xe9 fp=0xc0000827c8 sp=0xc000082738 pc=0x5d46b0c6a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1328 +0x25 fp=0xc0000827e0 sp=0xc0000827c8 pc=0x5d46b0c6a1c5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000827e8 sp=0xc0000827e0 pc=0x5d46b0cc6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1328 +0x105

goroutine 35 gp=0xc0004841c0 m=nil [GC worker (idle)]:
runtime.gopark(0x1db5e3472e?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc00048af38 sp=0xc00048af18 pc=0x5d46b0cbdc4e
runtime.gcBgMarkWorker(0xc0001136c0)
	runtime/mgc.go:1412 +0xe9 fp=0xc00048afc8 sp=0xc00048af38 pc=0x5d46b0c6a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1328 +0x25 fp=0xc00048afe0 sp=0xc00048afc8 pc=0x5d46b0c6a1c5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc00048afe8 sp=0xc00048afe0 pc=0x5d46b0cc6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1328 +0x105

goroutine 42 gp=0xc000484a80 m=nil [IO wait]:
runtime.gopark(0x5d46b0c62965?, 0x0?, 0x0?, 0x0?, 0xb?)
	runtime/proc.go:424 +0xce fp=0xc0004885a8 sp=0xc000488588 pc=0x5d46b0cbdc4e
runtime.netpollblock(0x5d46b0ce0e78?, 0xb0c54a66?, 0x46?)
	runtime/netpoll.go:575 +0xf7 fp=0xc0004885e0 sp=0xc0004885a8 pc=0x5d46b0c818b7
internal/poll.runtime_pollWait(0x76bc60dc8cd8, 0x72)
	runtime/netpoll.go:351 +0x85 fp=0xc000488600 sp=0xc0004885e0 pc=0x5d46b0cbcf45
internal/poll.(*pollDesc).wait(0xc0001ac000?, 0xc00011c6d1?, 0x0)
	internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000488628 sp=0xc000488600 pc=0x5d46b0d44567
internal/poll.(*pollDesc).waitRead(...)
	internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc0001ac000, {0xc00011c6d1, 0x1, 0x1})
	internal/poll/fd_unix.go:165 +0x27a fp=0xc0004886c0 sp=0xc000488628 pc=0x5d46b0d4585a
net.(*netFD).Read(0xc0001ac000, {0xc00011c6d1?, 0xc000488748?, 0x5d46b0cbf8d0?})
	net/fd_posix.go:55 +0x25 fp=0xc000488708 sp=0xc0004886c0 pc=0x5d46b0db0045
net.(*conn).Read(0xc00012a040, {0xc00011c6d1?, 0x0?, 0x5d46b25c6680?})
	net/net.go:189 +0x45 fp=0xc000488750 sp=0xc000488708 pc=0x5d46b0dbe645
net.(*TCPConn).Read(0x5d46b25030a0?, {0xc00011c6d1?, 0x0?, 0x0?})
	<autogenerated>:1 +0x25 fp=0xc000488780 sp=0xc000488750 pc=0x5d46b0dd1845
net/http.(*connReader).backgroundRead(0xc00011c6c0)
	net/http/server.go:690 +0x37 fp=0xc0004887c8 sp=0xc000488780 pc=0x5d46b100db37
net/http.(*connReader).startBackgroundRead.gowrap2()
	net/http/server.go:686 +0x25 fp=0xc0004887e0 sp=0xc0004887c8 pc=0x5d46b100da65
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0004887e8 sp=0xc0004887e0 pc=0x5d46b0cc6021
created by net/http.(*connReader).startBackgroundRead in goroutine 36
	net/http/server.go:686 +0xb6

goroutine 36 gp=0xc0004708c0 m=nil [select]:
runtime.gopark(0xc00005ba68?, 0x2?, 0xce?, 0x36?, 0xc00005b834?)
	runtime/proc.go:424 +0xce fp=0xc00005b650 sp=0xc00005b630 pc=0x5d46b0cbdc4e
runtime.selectgo(0xc00005ba68, 0xc00005b830, 0x21?, 0x0, 0x1?, 0x1)
	runtime/select.go:335 +0x7a5 fp=0xc00005b778 sp=0xc00005b650 pc=0x5d46b0c9af45
ollama/llama/runner.(*Server).completion(0xc00019d560, {0x5d46b1e06910, 0xc0001a29a0}, 0xc000422640)
	ollama/llama/runner/runner.go:696 +0xab6 fp=0xc00005bac0 sp=0xc00005b778 pc=0x5d46b1089236
ollama/llama/runner.(*Server).completion-fm({0x5d46b1e06910?, 0xc0001a29a0?}, 0x5d46b1021fe7?)
	<autogenerated>:1 +0x36 fp=0xc00005baf0 sp=0xc00005bac0 pc=0x5d46b108c916
net/http.HandlerFunc.ServeHTTP(0xc0004bbc00?, {0x5d46b1e06910?, 0xc0001a29a0?}, 0x0?)
	net/http/server.go:2220 +0x29 fp=0xc00005bb18 sp=0xc00005baf0 pc=0x5d46b1014809
net/http.(*ServeMux).ServeHTTP(0x5d46b0c5e485?, {0x5d46b1e06910, 0xc0001a29a0}, 0xc000422640)
	net/http/server.go:2747 +0x1ca fp=0xc00005bb68 sp=0xc00005bb18 pc=0x5d46b101670a
net/http.serverHandler.ServeHTTP({0x5d46b1e03510?}, {0x5d46b1e06910?, 0xc0001a29a0?}, 0x6?)
	net/http/server.go:3210 +0x8e fp=0xc00005bb98 sp=0xc00005bb68 pc=0x5d46b1033c6e
net/http.(*conn).serve(0xc0004e4000, {0x5d46b1e08988, 0xc00060ed20})
	net/http/server.go:2092 +0x5d0 fp=0xc00005bfb8 sp=0xc00005bb98 pc=0x5d46b10131b0
net/http.(*Server).Serve.gowrap3()
	net/http/server.go:3360 +0x28 fp=0xc00005bfe0 sp=0xc00005bfb8 pc=0x5d46b1018608
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc00005bfe8 sp=0xc00005bfe0 pc=0x5d46b0cc6021
created by net/http.(*Server).Serve in goroutine 1
	net/http/server.go:3360 +0x485

rax    0x0
rbx    0x0
rcx    0x76bbf0006940
rdx    0x76bbffdfebf0
rdi    0x6
rsi    0x5d46df7625a0
rbp    0x76bbffdfe960
rsp    0x76bbffdfe3f0
r8     0x76bbf0006ac0
r9     0x0
r10    0x76bc613fcf28
r11    0x76bbf0006ac0
r12    0x76bbf0006ac0
r13    0x76bbec0cd298
r14    0x76bbf0006940
r15    0x76bbf0006ac0
rip    0x76bc5f60bc2f
rflags 0x10202
cs     0x33
fs     0x0
gs     0x0
[GIN] 2025/02/17 - 23:54:59 | 500 |  2.583875751s |       10.89.0.4 | POST     "/api/generate"
time=2025-02-18T00:05:40.353+08:00 level=INFO source=server.go:104 msg="system memory" total="30.3 GiB" free="29.0 GiB" free_swap="8.0 GiB"
time=2025-02-18T00:05:40.353+08:00 level=INFO source=memory.go:356 msg="offload to device" layers.requested=-1 layers.model=29 layers.offload=0 layers.split="" memory.available="[29.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="4.6 GiB" memory.required.partial="0 B" memory.required.kv="1.8 GiB" memory.required.allocations="[4.6 GiB]" memory.weights.total="3.3 GiB" memory.weights.repeating="3.0 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="824.0 MiB" memory.graph.partial="881.1 MiB"
time=2025-02-18T00:05:40.353+08:00 level=INFO source=server.go:392 msg="starting llama server" cmd="/usr/local/lib/python3.11/dist-packages/bigdl/cpp/libs/ollama runner --model /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 16384 --batch-size 512 --n-gpu-layers 999 --threads 8 --no-mmap --parallel 1 --port 43439"
time=2025-02-18T00:05:40.354+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-02-18T00:05:40.354+08:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2025-02-18T00:05:40.354+08:00 level=INFO source=server.go:605 msg="waiting for server to become available" status="llm server error"
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 2 SYCL devices:
time=2025-02-18T00:05:40.488+08:00 level=INFO source=runner.go:967 msg="starting go runner"
time=2025-02-18T00:05:40.488+08:00 level=INFO source=runner.go:968 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=8
time=2025-02-18T00:05:40.488+08:00 level=INFO source=runner.go:1026 msg="Server listening on 127.0.0.1:43439"
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llama_load_model_from_file: using device SYCL0 (Intel(R) Graphics [0xe20b]) - 11605 MiB free
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llama_load_model_from_file: using device SYCL1 (Intel(R) Graphics [0xe20b]) - 11605 MiB free
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 28
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
time=2025-02-18T00:05:40.605+08:00 level=INFO source=server.go:605 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 3
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 3.21 B
llm_load_print_meta: model size       = 1.87 GiB (5.01 BPW) 
llm_load_print_meta: general.name     = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:        SYCL0 model buffer size =   852.89 MiB
llm_load_tensors:        SYCL1 model buffer size =  1065.46 MiB
llm_load_tensors:          CPU model buffer size =   308.23 MiB
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 16384
llama_new_context_with_model: n_ctx_per_seq = 16384
llama_new_context_with_model: n_batch       = 512
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 500000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
Found 2 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Graphics [0xe20b]|   20.1|    160|    1024|   32| 12168M|     1.6.32224.500000|
| 1| [level_zero:gpu:1]|                Intel Graphics [0xe20b]|   20.1|    160|    1024|   32| 12168M|     1.6.32224.500000|
llama_kv_cache_init:      SYCL0 KV buffer size =   960.00 MiB
llama_kv_cache_init:      SYCL1 KV buffer size =   832.00 MiB
llama_new_context_with_model: KV self size  = 1792.00 MiB, K (f16):  896.00 MiB, V (f16):  896.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.50 MiB
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model:      SYCL0 compute buffer size =   202.01 MiB
llama_new_context_with_model:      SYCL1 compute buffer size =   408.52 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =   134.02 MiB
llama_new_context_with_model: graph nodes  = 790
llama_new_context_with_model: graph splits = 3
time=2025-02-18T00:05:42.945+08:00 level=WARN source=runner.go:892 msg="%s: warming up the model with an empty run - please wait ... " !BADKEY=loadModel
time=2025-02-18T00:05:43.114+08:00 level=INFO source=server.go:610 msg="llama runner started in 2.76 seconds"
SIGILL: illegal instruction
PC=0x7f465740bc2f m=3 sigcode=2
signal arrived during cgo execution
instruction bytes: 0xf3 0xf 0xc7 0xf8 0x25 0xff 0x3 0x0 0x0 0x48 0x8b 0xd 0xe1 0xc2 0x2a 0x0

goroutine 7 gp=0xc000584e00 m=3 mp=0xc00008ae08 [syscall]:
runtime.cgocall(0x6144ee0584e0, 0xc000096b90)
	runtime/cgocall.go:167 +0x4b fp=0xc000096b68 sp=0xc000096b30 pc=0x6144ed4b754b
ollama/llama/llamafile._Cfunc_llama_decode(0x7f45d3c4a750, {0x21, 0x7f45d001ba90, 0x0, 0x0, 0x7f45d001c2a0, 0x7f45d001cab0, 0x7f45d001d2c0, 0x7f45d0081840})
	_cgo_gotypes.go:558 +0x4f fp=0xc000096b90 sp=0xc000096b68 pc=0x6144ed87996f
ollama/llama/llamafile.(*Context).Decode.func1(0x6144ed8886eb?, 0x7f45d3c4a750?)
	ollama/llama/llamafile/llama.go:143 +0xf5 fp=0xc000096c80 sp=0xc000096b90 pc=0x6144ed87c595
ollama/llama/llamafile.(*Context).Decode(0xc000086d70?, 0x0?)
	ollama/llama/llamafile/llama.go:143 +0x13 fp=0xc000096cc8 sp=0xc000096c80 pc=0x6144ed87c413
ollama/llama/runner.(*Server).processBatch(0xc00019d5f0, 0xc0000b4a20, 0xc000086f20)
	ollama/llama/runner/runner.go:434 +0x23f fp=0xc000096ee0 sp=0xc000096cc8 pc=0x6144ed8873bf
ollama/llama/runner.(*Server).run(0xc00019d5f0, {0x6144ee6089c0, 0xc00016f310})
	ollama/llama/runner/runner.go:342 +0x1d5 fp=0xc000096fb8 sp=0xc000096ee0 pc=0x6144ed886df5
ollama/llama/runner.Execute.gowrap2()
	ollama/llama/runner/runner.go:1006 +0x28 fp=0xc000096fe0 sp=0xc000096fb8 pc=0x6144ed88c068
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000096fe8 sp=0xc000096fe0 pc=0x6144ed4c6021
created by ollama/llama/runner.Execute in goroutine 1
	ollama/llama/runner/runner.go:1006 +0xde5

goroutine 1 gp=0xc0000061c0 m=nil [IO wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000507560 sp=0xc000507540 pc=0x6144ed4bdc4e
runtime.netpollblock(0x20059ff80?, 0xed454a66?, 0x44?)
	runtime/netpoll.go:575 +0xf7 fp=0xc000507598 sp=0xc000507560 pc=0x6144ed4818b7
internal/poll.runtime_pollWait(0x7f4658f67df0, 0x72)
	runtime/netpoll.go:351 +0x85 fp=0xc0005075b8 sp=0xc000507598 pc=0x6144ed4bcf45
internal/poll.(*pollDesc).wait(0xc00048e880?, 0x2c?, 0x0)
	internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0005075e0 sp=0xc0005075b8 pc=0x6144ed544567
internal/poll.(*pollDesc).waitRead(...)
	internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc00048e880)
	internal/poll/fd_unix.go:620 +0x295 fp=0xc000507688 sp=0xc0005075e0 pc=0x6144ed549935
net.(*netFD).accept(0xc00048e880)
	net/fd_unix.go:172 +0x29 fp=0xc000507740 sp=0xc000507688 pc=0x6144ed5b2009
net.(*TCPListener).accept(0xc000062f80)
	net/tcpsock_posix.go:159 +0x1e fp=0xc000507790 sp=0xc000507740 pc=0x6144ed5c7c7e
net.(*TCPListener).Accept(0xc000062f80)
	net/tcpsock.go:372 +0x30 fp=0xc0005077c0 sp=0xc000507790 pc=0x6144ed5c6b30
net/http.(*onceCloseListener).Accept(0xc00059e000?)
	<autogenerated>:1 +0x24 fp=0xc0005077d8 sp=0xc0005077c0 pc=0x6144ed840284
net/http.(*Server).Serve(0xc0000eb2c0, {0x6144ee606700, 0xc000062f80})
	net/http/server.go:3330 +0x30c fp=0xc000507908 sp=0xc0005077d8 pc=0x6144ed81820c
ollama/llama/runner.Execute({0xc000136010?, 0x0?, 0x0?})
	ollama/llama/runner/runner.go:1027 +0x11a9 fp=0xc000507ca8 sp=0xc000507908 pc=0x6144ed88bd49
ollama/cmd.NewCLI.func2(0xc0001c9400?, {0x6144ee05cf9d?, 0x4?, 0x6144ee05cfa1?})
	ollama/cmd/cmd.go:1430 +0x45 fp=0xc000507cd0 sp=0xc000507ca8 pc=0x6144ee057765
github.com/spf13/cobra.(*Command).execute(0xc00069a908, {0xc0000eac30, 0xf, 0xf})
	github.com/spf13/cobra@v1.8.1/command.go:985 +0xaaa fp=0xc000507e58 sp=0xc000507cd0 pc=0x6144ed64b3ea
github.com/spf13/cobra.(*Command).ExecuteC(0xc0001f9b08)
	github.com/spf13/cobra@v1.8.1/command.go:1117 +0x3ff fp=0xc000507f30 sp=0xc000507e58 pc=0x6144ed64bcbf
github.com/spf13/cobra.(*Command).Execute(...)
	github.com/spf13/cobra@v1.8.1/command.go:1041
github.com/spf13/cobra.(*Command).ExecuteContext(...)
	github.com/spf13/cobra@v1.8.1/command.go:1034
main.main()
	ollama/main.go:12 +0x4d fp=0xc000507f50 sp=0xc000507f30 pc=0x6144ee057dcd
runtime.main()
	runtime/proc.go:272 +0x29d fp=0xc000507fe0 sp=0xc000507f50 pc=0x6144ed488f5d
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000507fe8 sp=0xc000507fe0 pc=0x6144ed4c6021

goroutine 2 gp=0xc000006c40 m=nil [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000084fa8 sp=0xc000084f88 pc=0x6144ed4bdc4e
runtime.goparkunlock(...)
	runtime/proc.go:430
runtime.forcegchelper()
	runtime/proc.go:337 +0xb8 fp=0xc000084fe0 sp=0xc000084fa8 pc=0x6144ed489298
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000084fe8 sp=0xc000084fe0 pc=0x6144ed4c6021
created by runtime.init.7 in goroutine 1
	runtime/proc.go:325 +0x1a

goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]:
runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000085780 sp=0xc000085760 pc=0x6144ed4bdc4e
runtime.goparkunlock(...)
	runtime/proc.go:430
runtime.bgsweep(0xc0000b2000)
	runtime/mgcsweep.go:317 +0xdf fp=0xc0000857c8 sp=0xc000085780 pc=0x6144ed47393f
runtime.gcenable.gowrap1()
	runtime/mgc.go:204 +0x25 fp=0xc0000857e0 sp=0xc0000857c8 pc=0x6144ed467f85
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000857e8 sp=0xc0000857e0 pc=0x6144ed4c6021
created by runtime.gcenable in goroutine 1
	runtime/mgc.go:204 +0x66

goroutine 4 gp=0xc000007340 m=nil [GC scavenge wait]:
runtime.gopark(0x10000?, 0x6144ee203070?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000085f78 sp=0xc000085f58 pc=0x6144ed4bdc4e
runtime.goparkunlock(...)
	runtime/proc.go:430
runtime.(*scavengerState).park(0x6144eed9ed80)
	runtime/mgcscavenge.go:425 +0x49 fp=0xc000085fa8 sp=0xc000085f78 pc=0x6144ed471309
runtime.bgscavenge(0xc0000b2000)
	runtime/mgcscavenge.go:658 +0x59 fp=0xc000085fc8 sp=0xc000085fa8 pc=0x6144ed471899
runtime.gcenable.gowrap2()
	runtime/mgc.go:205 +0x25 fp=0xc000085fe0 sp=0xc000085fc8 pc=0x6144ed467f25
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000085fe8 sp=0xc000085fe0 pc=0x6144ed4c6021
created by runtime.gcenable in goroutine 1
	runtime/mgc.go:205 +0xa5

goroutine 18 gp=0xc000104700 m=nil [finalizer wait]:
runtime.gopark(0xc000084648?, 0x6144ed45e485?, 0xb0?, 0x1?, 0xc0000061c0?)
	runtime/proc.go:424 +0xce fp=0xc000084620 sp=0xc000084600 pc=0x6144ed4bdc4e
runtime.runfinq()
	runtime/mfinal.go:193 +0x107 fp=0xc0000847e0 sp=0xc000084620 pc=0x6144ed467007
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000847e8 sp=0xc0000847e0 pc=0x6144ed4c6021
created by runtime.createfing in goroutine 1
	runtime/mfinal.go:163 +0x3d

goroutine 19 gp=0xc000105880 m=nil [chan receive]:
runtime.gopark(0xc000080760?, 0x6144ed599685?, 0x40?, 0xc8?, 0x6144ee619c00?)
	runtime/proc.go:424 +0xce fp=0xc000080718 sp=0xc0000806f8 pc=0x6144ed4bdc4e
runtime.chanrecv(0xc0001122a0, 0x0, 0x1)
	runtime/chan.go:639 +0x41c fp=0xc000080790 sp=0xc000080718 pc=0x6144ed45767c
runtime.chanrecv1(0x0?, 0x0?)
	runtime/chan.go:489 +0x12 fp=0xc0000807b8 sp=0xc000080790 pc=0x6144ed457232
runtime.unique_runtime_registerUniqueMapCleanup.func1(...)
	runtime/mgc.go:1781
runtime.unique_runtime_registerUniqueMapCleanup.gowrap1()
	runtime/mgc.go:1784 +0x2f fp=0xc0000807e0 sp=0xc0000807b8 pc=0x6144ed46afef
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000807e8 sp=0xc0000807e0 pc=0x6144ed4c6021
created by unique.runtime_registerUniqueMapCleanup in goroutine 1
	runtime/mgc.go:1779 +0x96

goroutine 20 gp=0xc000105c00 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000080f38 sp=0xc000080f18 pc=0x6144ed4bdc4e
runtime.gcBgMarkWorker(0xc000113880)
	runtime/mgc.go:1412 +0xe9 fp=0xc000080fc8 sp=0xc000080f38 pc=0x6144ed46a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1328 +0x25 fp=0xc000080fe0 sp=0xc000080fc8 pc=0x6144ed46a1c5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000080fe8 sp=0xc000080fe0 pc=0x6144ed4c6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1328 +0x105

goroutine 34 gp=0xc000484000 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc00048a738 sp=0xc00048a718 pc=0x6144ed4bdc4e
runtime.gcBgMarkWorker(0xc000113880)
	runtime/mgc.go:1412 +0xe9 fp=0xc00048a7c8 sp=0xc00048a738 pc=0x6144ed46a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1328 +0x25 fp=0xc00048a7e0 sp=0xc00048a7c8 pc=0x6144ed46a1c5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc00048a7e8 sp=0xc00048a7e0 pc=0x6144ed4c6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1328 +0x105

goroutine 5 gp=0xc000007880 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000086738 sp=0xc000086718 pc=0x6144ed4bdc4e
runtime.gcBgMarkWorker(0xc000113880)
	runtime/mgc.go:1412 +0xe9 fp=0xc0000867c8 sp=0xc000086738 pc=0x6144ed46a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1328 +0x25 fp=0xc0000867e0 sp=0xc0000867c8 pc=0x6144ed46a1c5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000867e8 sp=0xc0000867e0 pc=0x6144ed4c6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1328 +0x105

goroutine 35 gp=0xc0004841c0 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc00048af38 sp=0xc00048af18 pc=0x6144ed4bdc4e
runtime.gcBgMarkWorker(0xc000113880)
	runtime/mgc.go:1412 +0xe9 fp=0xc00048afc8 sp=0xc00048af38 pc=0x6144ed46a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1328 +0x25 fp=0xc00048afe0 sp=0xc00048afc8 pc=0x6144ed46a1c5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc00048afe8 sp=0xc00048afe0 pc=0x6144ed4c6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1328 +0x105

goroutine 36 gp=0xc000484380 m=nil [GC worker (idle)]:
runtime.gopark(0xb375489572?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc00048b738 sp=0xc00048b718 pc=0x6144ed4bdc4e
runtime.gcBgMarkWorker(0xc000113880)
	runtime/mgc.go:1412 +0xe9 fp=0xc00048b7c8 sp=0xc00048b738 pc=0x6144ed46a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1328 +0x25 fp=0xc00048b7e0 sp=0xc00048b7c8 pc=0x6144ed46a1c5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc00048b7e8 sp=0xc00048b7e0 pc=0x6144ed4c6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1328 +0x105

goroutine 37 gp=0xc000484540 m=nil [GC worker (idle)]:
runtime.gopark(0xb375487a7e?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc00048bf38 sp=0xc00048bf18 pc=0x6144ed4bdc4e
runtime.gcBgMarkWorker(0xc000113880)
	runtime/mgc.go:1412 +0xe9 fp=0xc00048bfc8 sp=0xc00048bf38 pc=0x6144ed46a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1328 +0x25 fp=0xc00048bfe0 sp=0xc00048bfc8 pc=0x6144ed46a1c5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc00048bfe8 sp=0xc00048bfe0 pc=0x6144ed4c6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1328 +0x105

goroutine 21 gp=0xc000105dc0 m=nil [GC worker (idle)]:
runtime.gopark(0xb375492c10?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000081738 sp=0xc000081718 pc=0x6144ed4bdc4e
runtime.gcBgMarkWorker(0xc000113880)
	runtime/mgc.go:1412 +0xe9 fp=0xc0000817c8 sp=0xc000081738 pc=0x6144ed46a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1328 +0x25 fp=0xc0000817e0 sp=0xc0000817c8 pc=0x6144ed46a1c5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000817e8 sp=0xc0000817e0 pc=0x6144ed4c6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1328 +0x105

goroutine 22 gp=0xc00047a000 m=nil [GC worker (idle)]:
runtime.gopark(0xb375488bb1?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000081f38 sp=0xc000081f18 pc=0x6144ed4bdc4e
runtime.gcBgMarkWorker(0xc000113880)
	runtime/mgc.go:1412 +0xe9 fp=0xc000081fc8 sp=0xc000081f38 pc=0x6144ed46a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1328 +0x25 fp=0xc000081fe0 sp=0xc000081fc8 pc=0x6144ed46a1c5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000081fe8 sp=0xc000081fe0 pc=0x6144ed4c6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1328 +0x105

goroutine 11 gp=0xc000584c40 m=nil [IO wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0xb?)
	runtime/proc.go:424 +0xce fp=0xc0000825a8 sp=0xc000082588 pc=0x6144ed4bdc4e
runtime.netpollblock(0x6144ed4e0e78?, 0xed454a66?, 0x44?)
	runtime/netpoll.go:575 +0xf7 fp=0xc0000825e0 sp=0xc0000825a8 pc=0x6144ed4818b7
internal/poll.runtime_pollWait(0x7f4658f67cd8, 0x72)
	runtime/netpoll.go:351 +0x85 fp=0xc000082600 sp=0xc0000825e0 pc=0x6144ed4bcf45
internal/poll.(*pollDesc).wait(0xc0001ac000?, 0xc000592101?, 0x0)
	internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000082628 sp=0xc000082600 pc=0x6144ed544567
internal/poll.(*pollDesc).waitRead(...)
	internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc0001ac000, {0xc000592101, 0x1, 0x1})
	internal/poll/fd_unix.go:165 +0x27a fp=0xc0000826c0 sp=0xc000082628 pc=0x6144ed54585a
net.(*netFD).Read(0xc0001ac000, {0xc000592101?, 0x0?, 0x0?})
	net/fd_posix.go:55 +0x25 fp=0xc000082708 sp=0xc0000826c0 pc=0x6144ed5b0045
net.(*conn).Read(0xc000088128, {0xc000592101?, 0x0?, 0x0?})
	net/net.go:189 +0x45 fp=0xc000082750 sp=0xc000082708 pc=0x6144ed5be645
net.(*TCPConn).Read(0x0?, {0xc000592101?, 0x0?, 0x0?})
	<autogenerated>:1 +0x25 fp=0xc000082780 sp=0xc000082750 pc=0x6144ed5d1845
net/http.(*connReader).backgroundRead(0xc0005920f0)
	net/http/server.go:690 +0x37 fp=0xc0000827c8 sp=0xc000082780 pc=0x6144ed80db37
net/http.(*connReader).startBackgroundRead.gowrap2()
	net/http/server.go:686 +0x25 fp=0xc0000827e0 sp=0xc0000827c8 pc=0x6144ed80da65
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000827e8 sp=0xc0000827e0 pc=0x6144ed4c6021
created by net/http.(*connReader).startBackgroundRead in goroutine 50
	net/http/server.go:686 +0xb6

goroutine 50 gp=0xc000007dc0 m=nil [select]:
runtime.gopark(0xc00005ba68?, 0x2?, 0xce?, 0x36?, 0xc00005b834?)
	runtime/proc.go:424 +0xce fp=0xc00005b650 sp=0xc00005b630 pc=0x6144ed4bdc4e
runtime.selectgo(0xc00005ba68, 0xc00005b830, 0x21?, 0x0, 0x1?, 0x1)
	runtime/select.go:335 +0x7a5 fp=0xc00005b778 sp=0xc00005b650 pc=0x6144ed49af45
ollama/llama/runner.(*Server).completion(0xc00019d5f0, {0x6144ee606910, 0xc0001a2c40}, 0xc000423e00)
	ollama/llama/runner/runner.go:696 +0xab6 fp=0xc00005bac0 sp=0xc00005b778 pc=0x6144ed889236
ollama/llama/runner.(*Server).completion-fm({0x6144ee606910?, 0xc0001a2c40?}, 0x6144ed821fe7?)
	<autogenerated>:1 +0x36 fp=0xc00005baf0 sp=0xc00005bac0 pc=0x6144ed88c916
net/http.HandlerFunc.ServeHTTP(0xc0001a28c0?, {0x6144ee606910?, 0xc0001a2c40?}, 0x0?)
	net/http/server.go:2220 +0x29 fp=0xc00005bb18 sp=0xc00005baf0 pc=0x6144ed814809
net/http.(*ServeMux).ServeHTTP(0x6144ed45e485?, {0x6144ee606910, 0xc0001a2c40}, 0xc000423e00)
	net/http/server.go:2747 +0x1ca fp=0xc00005bb68 sp=0xc00005bb18 pc=0x6144ed81670a
net/http.serverHandler.ServeHTTP({0x6144ee603510?}, {0x6144ee606910?, 0xc0001a2c40?}, 0x6?)
	net/http/server.go:3210 +0x8e fp=0xc00005bb98 sp=0xc00005bb68 pc=0x6144ed833c6e
net/http.(*conn).serve(0xc00059e000, {0x6144ee608988, 0xc000698600})
	net/http/server.go:2092 +0x5d0 fp=0xc00005bfb8 sp=0xc00005bb98 pc=0x6144ed8131b0
net/http.(*Server).Serve.gowrap3()
	net/http/server.go:3360 +0x28 fp=0xc00005bfe0 sp=0xc00005bfb8 pc=0x6144ed818608
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc00005bfe8 sp=0xc00005bfe0 pc=0x6144ed4c6021
created by net/http.(*Server).Serve in goroutine 1
	net/http/server.go:3360 +0x485

rax    0x0
rbx    0x0
rcx    0x7f45e8006740
rdx    0x7f45f7bfebf0
rdi    0x6
rsi    0x6144ff3565a0
rbp    0x7f45f7bfe960
rsp    0x7f45f7bfe3f0
r8     0x7f45e80068c0
r9     0x0
r10    0x7f4659563f28
r11    0x7f45e80068c0
r12    0x7f45e80068c0
r13    0x7f45d00cd478
r14    0x7f45e8006740
r15    0x7f45e80068c0
rip    0x7f465740bc2f
rflags 0x10202
cs     0x33
fs     0x0
gs     0x0
[GIN] 2025/02/18 - 00:05:43 | 500 |   2.83359842s |       10.89.0.4 | POST     "/api/generate"
[GIN] 2025/02/18 - 00:06:38 | 200 |          26µs |       10.89.0.4 | GET      "/"
[GIN] 2025/02/18 - 00:07:05 | 200 |      17.285µs |       10.89.0.4 | GET      "/"
[GIN] 2025/02/18 - 00:07:20 | 404 |       2.766µs |       10.89.0.4 | POST     "/api/cat"
[GIN] 2025/02/18 - 00:07:33 | 404 |      17.543µs |       10.89.0.4 | POST     "/api/cat"
time=2025-02-18T00:07:46.069+08:00 level=INFO source=server.go:104 msg="system memory" total="30.3 GiB" free="29.0 GiB" free_swap="8.0 GiB"
time=2025-02-18T00:07:46.069+08:00 level=INFO source=memory.go:356 msg="offload to device" layers.requested=-1 layers.model=29 layers.offload=0 layers.split="" memory.available="[29.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="4.6 GiB" memory.required.partial="0 B" memory.required.kv="1.8 GiB" memory.required.allocations="[4.6 GiB]" memory.weights.total="3.3 GiB" memory.weights.repeating="3.0 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="824.0 MiB" memory.graph.partial="881.1 MiB"
time=2025-02-18T00:07:46.070+08:00 level=INFO source=server.go:392 msg="starting llama server" cmd="/usr/local/lib/python3.11/dist-packages/bigdl/cpp/libs/ollama runner --model /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 16384 --batch-size 512 --n-gpu-layers 999 --threads 8 --no-mmap --parallel 1 --port 38703"
time=2025-02-18T00:07:46.070+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-02-18T00:07:46.070+08:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2025-02-18T00:07:46.070+08:00 level=INFO source=server.go:605 msg="waiting for server to become available" status="llm server error"
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 2 SYCL devices:
time=2025-02-18T00:07:46.205+08:00 level=INFO source=runner.go:967 msg="starting go runner"
time=2025-02-18T00:07:46.205+08:00 level=INFO source=runner.go:968 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=8
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
time=2025-02-18T00:07:46.205+08:00 level=INFO source=runner.go:1026 msg="Server listening on 127.0.0.1:38703"
llama_load_model_from_file: using device SYCL0 (Intel(R) Graphics [0xe20b]) - 11605 MiB free
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llama_load_model_from_file: using device SYCL1 (Intel(R) Graphics [0xe20b]) - 11605 MiB free
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 28
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
time=2025-02-18T00:07:46.321+08:00 level=INFO source=server.go:605 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 3
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 3.21 B
llm_load_print_meta: model size       = 1.87 GiB (5.01 BPW) 
llm_load_print_meta: general.name     = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:        SYCL0 model buffer size =   852.89 MiB
llm_load_tensors:        SYCL1 model buffer size =  1065.46 MiB
llm_load_tensors:          CPU model buffer size =   308.23 MiB
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 16384
llama_new_context_with_model: n_ctx_per_seq = 16384
llama_new_context_with_model: n_batch       = 512
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 500000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
Found 2 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Graphics [0xe20b]|   20.1|    160|    1024|   32| 12168M|     1.6.32224.500000|
| 1| [level_zero:gpu:1]|                Intel Graphics [0xe20b]|   20.1|    160|    1024|   32| 12168M|     1.6.32224.500000|
llama_kv_cache_init:      SYCL0 KV buffer size =   960.00 MiB
llama_kv_cache_init:      SYCL1 KV buffer size =   832.00 MiB
llama_new_context_with_model: KV self size  = 1792.00 MiB, K (f16):  896.00 MiB, V (f16):  896.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.50 MiB
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model:      SYCL0 compute buffer size =   202.01 MiB
llama_new_context_with_model:      SYCL1 compute buffer size =   408.52 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =   134.02 MiB
llama_new_context_with_model: graph nodes  = 790
llama_new_context_with_model: graph splits = 3
time=2025-02-18T00:07:48.472+08:00 level=WARN source=runner.go:892 msg="%s: warming up the model with an empty run - please wait ... " !BADKEY=loadModel
time=2025-02-18T00:07:48.579+08:00 level=INFO source=server.go:610 msg="llama runner started in 2.51 seconds"
[GIN] 2025/02/18 - 00:07:48 | 200 |  2.551880862s |       10.89.0.4 | POST     "/api/chat"
[GIN] 2025/02/18 - 00:07:59 | 200 |    14.32507ms |       10.89.0.4 | POST     "/api/chat"
[GIN] 2025/02/18 - 00:08:03 | 200 |   14.761876ms |       10.89.0.4 | POST     "/api/chat"
[GIN] 2025/02/18 - 00:09:21 | 200 |     391.063µs |       10.89.0.1 | GET      "/api/tags"
[GIN] 2025/02/18 - 00:09:29 | 200 |      86.722µs |       10.89.0.1 | GET      "/api/version"
[GIN] 2025/02/18 - 00:09:45 | 200 |     220.019µs |       10.89.0.4 | GET      "/api/tags"
[GIN] 2025/02/18 - 00:10:24 | 200 |   15.094868ms |       10.89.0.4 | POST     "/api/chat"
time=2025-02-18T00:51:08.195+08:00 level=INFO source=server.go:104 msg="system memory" total="30.3 GiB" free="29.0 GiB" free_swap="8.0 GiB"
time=2025-02-18T00:51:08.195+08:00 level=INFO source=memory.go:356 msg="offload to device" layers.requested=-1 layers.model=29 layers.offload=0 layers.split="" memory.available="[29.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="4.6 GiB" memory.required.partial="0 B" memory.required.kv="1.8 GiB" memory.required.allocations="[4.6 GiB]" memory.weights.total="3.3 GiB" memory.weights.repeating="3.0 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="824.0 MiB" memory.graph.partial="881.1 MiB"
time=2025-02-18T00:51:08.196+08:00 level=INFO source=server.go:392 msg="starting llama server" cmd="/usr/local/lib/python3.11/dist-packages/bigdl/cpp/libs/ollama runner --model /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 16384 --batch-size 512 --n-gpu-layers 999 --threads 8 --no-mmap --parallel 1 --port 39041"
time=2025-02-18T00:51:08.196+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-02-18T00:51:08.196+08:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2025-02-18T00:51:08.196+08:00 level=INFO source=server.go:605 msg="waiting for server to become available" status="llm server error"
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 2 SYCL devices:
time=2025-02-18T00:51:08.334+08:00 level=INFO source=runner.go:967 msg="starting go runner"
time=2025-02-18T00:51:08.334+08:00 level=INFO source=runner.go:968 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=8
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llama_load_model_from_file: using device SYCL0 (Intel(R) Graphics [0xe20b]) - 11605 MiB free
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llama_load_model_from_file: using device SYCL1 (Intel(R) Graphics [0xe20b]) - 11605 MiB free
time=2025-02-18T00:51:08.334+08:00 level=INFO source=runner.go:1026 msg="Server listening on 127.0.0.1:39041"
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 28
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
time=2025-02-18T00:51:08.447+08:00 level=INFO source=server.go:605 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 3
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 3.21 B
llm_load_print_meta: model size       = 1.87 GiB (5.01 BPW) 
llm_load_print_meta: general.name     = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:        SYCL0 model buffer size =   852.89 MiB
llm_load_tensors:        SYCL1 model buffer size =  1065.46 MiB
llm_load_tensors:          CPU model buffer size =   308.23 MiB
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 16384
llama_new_context_with_model: n_ctx_per_seq = 16384
llama_new_context_with_model: n_batch       = 512
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 500000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
Found 2 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Graphics [0xe20b]|   20.1|    160|    1024|   32| 12168M|     1.6.32224.500000|
| 1| [level_zero:gpu:1]|                Intel Graphics [0xe20b]|   20.1|    160|    1024|   32| 12168M|     1.6.32224.500000|
llama_kv_cache_init:      SYCL0 KV buffer size =   960.00 MiB
llama_kv_cache_init:      SYCL1 KV buffer size =   832.00 MiB
llama_new_context_with_model: KV self size  = 1792.00 MiB, K (f16):  896.00 MiB, V (f16):  896.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.50 MiB
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model:      SYCL0 compute buffer size =   202.01 MiB
llama_new_context_with_model:      SYCL1 compute buffer size =   408.52 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =   134.02 MiB
llama_new_context_with_model: graph nodes  = 790
llama_new_context_with_model: graph splits = 3
time=2025-02-18T00:51:10.610+08:00 level=WARN source=runner.go:892 msg="%s: warming up the model with an empty run - please wait ... " !BADKEY=loadModel
time=2025-02-18T00:51:10.705+08:00 level=INFO source=server.go:610 msg="llama runner started in 2.51 seconds"
[GIN] 2025/02/18 - 00:51:10 | 200 |  2.550868677s |       10.89.0.4 | POST     "/api/chat"
SIGILL: illegal instruction
PC=0x725dfd20bc2f m=9 sigcode=2
signal arrived during cgo execution
instruction bytes: 0xf3 0xf 0xc7 0xf8 0x25 0xff 0x3 0x0 0x0 0x48 0x8b 0xd 0xe1 0xc2 0x2a 0x0

goroutine 14 gp=0xc0001048c0 m=9 mp=0xc000508008 [syscall]:
runtime.cgocall(0x616d12a584e0, 0xc0004f7b90)
	runtime/cgocall.go:167 +0x4b fp=0xc0004f7b68 sp=0xc0004f7b30 pc=0x616d11eb754b
ollama/llama/llamafile._Cfunc_llama_decode(0x725d73c576b0, {0x21, 0x725d70020350, 0x0, 0x0, 0x725d7001b8f0, 0x725d7001c100, 0x725d7001c910, 0x725d700813f0})
	_cgo_gotypes.go:558 +0x4f fp=0xc0004f7b90 sp=0xc0004f7b68 pc=0x616d1227996f
ollama/llama/llamafile.(*Context).Decode.func1(0x616d122886eb?, 0x725d73c576b0?)
	ollama/llama/llamafile/llama.go:143 +0xf5 fp=0xc0004f7c80 sp=0xc0004f7b90 pc=0x616d1227c595
ollama/llama/llamafile.(*Context).Decode(0xc0004f7d70?, 0x0?)
	ollama/llama/llamafile/llama.go:143 +0x13 fp=0xc0004f7cc8 sp=0xc0004f7c80 pc=0x616d1227c413
ollama/llama/runner.(*Server).processBatch(0xc0001a7560, 0xc00062c0c0, 0xc0004f7f20)
	ollama/llama/runner/runner.go:434 +0x23f fp=0xc0004f7ee0 sp=0xc0004f7cc8 pc=0x616d122873bf
ollama/llama/runner.(*Server).run(0xc0001a7560, {0x616d130089c0, 0xc000526e10})
	ollama/llama/runner/runner.go:342 +0x1d5 fp=0xc0004f7fb8 sp=0xc0004f7ee0 pc=0x616d12286df5
ollama/llama/runner.Execute.gowrap2()
	ollama/llama/runner/runner.go:1006 +0x28 fp=0xc0004f7fe0 sp=0xc0004f7fb8 pc=0x616d1228c068
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0004f7fe8 sp=0xc0004f7fe0 pc=0x616d11ec6021
created by ollama/llama/runner.Execute in goroutine 1
	ollama/llama/runner/runner.go:1006 +0xde5

goroutine 1 gp=0xc0000061c0 m=nil [IO wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc0004b5560 sp=0xc0004b5540 pc=0x616d11ebdc4e
runtime.netpollblock(0x1ff80?, 0x11e54a66?, 0x6d?)
	runtime/netpoll.go:575 +0xf7 fp=0xc0004b5598 sp=0xc0004b5560 pc=0x616d11e818b7
internal/poll.runtime_pollWait(0x725dfed35680, 0x72)
	runtime/netpoll.go:351 +0x85 fp=0xc0004b55b8 sp=0xc0004b5598 pc=0x616d11ebcf45
internal/poll.(*pollDesc).wait(0xc000507900?, 0x2c?, 0x0)
	internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0004b55e0 sp=0xc0004b55b8 pc=0x616d11f44567
internal/poll.(*pollDesc).waitRead(...)
	internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc000507900)
	internal/poll/fd_unix.go:620 +0x295 fp=0xc0004b5688 sp=0xc0004b55e0 pc=0x616d11f49935
net.(*netFD).accept(0xc000507900)
	net/fd_unix.go:172 +0x29 fp=0xc0004b5740 sp=0xc0004b5688 pc=0x616d11fb2009
net.(*TCPListener).accept(0xc0005236c0)
	net/tcpsock_posix.go:159 +0x1e fp=0xc0004b5790 sp=0xc0004b5740 pc=0x616d11fc7c7e
net.(*TCPListener).Accept(0xc0005236c0)
	net/tcpsock.go:372 +0x30 fp=0xc0004b57c0 sp=0xc0004b5790 pc=0x616d11fc6b30
net/http.(*onceCloseListener).Accept(0xc00001e000?)
	<autogenerated>:1 +0x24 fp=0xc0004b57d8 sp=0xc0004b57c0 pc=0x616d12240284
net/http.(*Server).Serve(0xc0003ecc30, {0x616d13006700, 0xc0005236c0})
	net/http/server.go:3330 +0x30c fp=0xc0004b5908 sp=0xc0004b57d8 pc=0x616d1221820c
ollama/llama/runner.Execute({0xc000036130?, 0x0?, 0x0?})
	ollama/llama/runner/runner.go:1027 +0x11a9 fp=0xc0004b5ca8 sp=0xc0004b5908 pc=0x616d1228bd49
ollama/cmd.NewCLI.func2(0xc0001cf200?, {0x616d12a5cf9d?, 0x4?, 0x616d12a5cfa1?})
	ollama/cmd/cmd.go:1430 +0x45 fp=0xc0004b5cd0 sp=0xc0004b5ca8 pc=0x616d12a57765
github.com/spf13/cobra.(*Command).execute(0xc0004ce908, {0xc0003ec2d0, 0xf, 0xf})
	github.com/spf13/cobra@v1.8.1/command.go:985 +0xaaa fp=0xc0004b5e58 sp=0xc0004b5cd0 pc=0x616d1204b3ea
github.com/spf13/cobra.(*Command).ExecuteC(0xc0006b2f08)
	github.com/spf13/cobra@v1.8.1/command.go:1117 +0x3ff fp=0xc0004b5f30 sp=0xc0004b5e58 pc=0x616d1204bcbf
github.com/spf13/cobra.(*Command).Execute(...)
	github.com/spf13/cobra@v1.8.1/command.go:1041
github.com/spf13/cobra.(*Command).ExecuteContext(...)
	github.com/spf13/cobra@v1.8.1/command.go:1034
main.main()
	ollama/main.go:12 +0x4d fp=0xc0004b5f50 sp=0xc0004b5f30 pc=0x616d12a57dcd
runtime.main()
	runtime/proc.go:272 +0x29d fp=0xc0004b5fe0 sp=0xc0004b5f50 pc=0x616d11e88f5d
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0004b5fe8 sp=0xc0004b5fe0 pc=0x616d11ec6021

goroutine 2 gp=0xc000006c40 m=nil [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000084fa8 sp=0xc000084f88 pc=0x616d11ebdc4e
runtime.goparkunlock(...)
	runtime/proc.go:430
runtime.forcegchelper()
	runtime/proc.go:337 +0xb8 fp=0xc000084fe0 sp=0xc000084fa8 pc=0x616d11e89298
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000084fe8 sp=0xc000084fe0 pc=0x616d11ec6021
created by runtime.init.7 in goroutine 1
	runtime/proc.go:325 +0x1a

goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]:
runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000085780 sp=0xc000085760 pc=0x616d11ebdc4e
runtime.goparkunlock(...)
	runtime/proc.go:430
runtime.bgsweep(0xc0000b2000)
	runtime/mgcsweep.go:317 +0xdf fp=0xc0000857c8 sp=0xc000085780 pc=0x616d11e7393f
runtime.gcenable.gowrap1()
	runtime/mgc.go:204 +0x25 fp=0xc0000857e0 sp=0xc0000857c8 pc=0x616d11e67f85
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000857e8 sp=0xc0000857e0 pc=0x616d11ec6021
created by runtime.gcenable in goroutine 1
	runtime/mgc.go:204 +0x66

goroutine 4 gp=0xc000007340 m=nil [GC scavenge wait]:
runtime.gopark(0x10000?, 0x616d12c03070?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000085f78 sp=0xc000085f58 pc=0x616d11ebdc4e
runtime.goparkunlock(...)
	runtime/proc.go:430
runtime.(*scavengerState).park(0x616d1379ed80)
	runtime/mgcscavenge.go:425 +0x49 fp=0xc000085fa8 sp=0xc000085f78 pc=0x616d11e71309
runtime.bgscavenge(0xc0000b2000)
	runtime/mgcscavenge.go:658 +0x59 fp=0xc000085fc8 sp=0xc000085fa8 pc=0x616d11e71899
runtime.gcenable.gowrap2()
	runtime/mgc.go:205 +0x25 fp=0xc000085fe0 sp=0xc000085fc8 pc=0x616d11e67f25
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000085fe8 sp=0xc000085fe0 pc=0x616d11ec6021
created by runtime.gcenable in goroutine 1
	runtime/mgc.go:205 +0xa5

goroutine 5 gp=0xc000007c00 m=nil [finalizer wait]:
runtime.gopark(0xc000084648?, 0x616d11e5e485?, 0xb0?, 0x1?, 0xc0000061c0?)
	runtime/proc.go:424 +0xce fp=0xc000084620 sp=0xc000084600 pc=0x616d11ebdc4e
runtime.runfinq()
	runtime/mfinal.go:193 +0x107 fp=0xc0000847e0 sp=0xc000084620 pc=0x616d11e67007
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000847e8 sp=0xc0000847e0 pc=0x616d11ec6021
created by runtime.createfing in goroutine 1
	runtime/mfinal.go:163 +0x3d

goroutine 6 gp=0xc0001fae00 m=nil [chan receive]:
runtime.gopark(0xc000086760?, 0x616d11f99685?, 0x40?, 0xe8?, 0x616d13019c00?)
	runtime/proc.go:424 +0xce fp=0xc000086718 sp=0xc0000866f8 pc=0x616d11ebdc4e
runtime.chanrecv(0xc00004e310, 0x0, 0x1)
	runtime/chan.go:639 +0x41c fp=0xc000086790 sp=0xc000086718 pc=0x616d11e5767c
runtime.chanrecv1(0x0?, 0x0?)
	runtime/chan.go:489 +0x12 fp=0xc0000867b8 sp=0xc000086790 pc=0x616d11e57232
runtime.unique_runtime_registerUniqueMapCleanup.func1(...)
	runtime/mgc.go:1781
runtime.unique_runtime_registerUniqueMapCleanup.gowrap1()
	runtime/mgc.go:1784 +0x2f fp=0xc0000867e0 sp=0xc0000867b8 pc=0x616d11e6afef
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000867e8 sp=0xc0000867e0 pc=0x616d11ec6021
created by unique.runtime_registerUniqueMapCleanup in goroutine 1
	runtime/mgc.go:1779 +0x96

goroutine 7 gp=0xc0001fba40 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000086f38 sp=0xc000086f18 pc=0x616d11ebdc4e
runtime.gcBgMarkWorker(0xc00004f730)
	runtime/mgc.go:1412 +0xe9 fp=0xc000086fc8 sp=0xc000086f38 pc=0x616d11e6a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1328 +0x25 fp=0xc000086fe0 sp=0xc000086fc8 pc=0x616d11e6a1c5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000086fe8 sp=0xc000086fe0 pc=0x616d11ec6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1328 +0x105

goroutine 18 gp=0xc000504000 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000080738 sp=0xc000080718 pc=0x616d11ebdc4e
runtime.gcBgMarkWorker(0xc00004f730)
	runtime/mgc.go:1412 +0xe9 fp=0xc0000807c8 sp=0xc000080738 pc=0x616d11e6a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1328 +0x25 fp=0xc0000807e0 sp=0xc0000807c8 pc=0x616d11e6a1c5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000807e8 sp=0xc0000807e0 pc=0x616d11ec6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1328 +0x105

goroutine 34 gp=0xc000104380 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc00011c738 sp=0xc00011c718 pc=0x616d11ebdc4e
runtime.gcBgMarkWorker(0xc00004f730)
	runtime/mgc.go:1412 +0xe9 fp=0xc00011c7c8 sp=0xc00011c738 pc=0x616d11e6a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1328 +0x25 fp=0xc00011c7e0 sp=0xc00011c7c8 pc=0x616d11e6a1c5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc00011c7e8 sp=0xc00011c7e0 pc=0x616d11ec6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1328 +0x105

goroutine 8 gp=0xc0001fbc00 m=nil [GC worker (idle)]:
runtime.gopark(0x32e956a4c73?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000087738 sp=0xc000087718 pc=0x616d11ebdc4e
runtime.gcBgMarkWorker(0xc00004f730)
	runtime/mgc.go:1412 +0xe9 fp=0xc0000877c8 sp=0xc000087738 pc=0x616d11e6a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1328 +0x25 fp=0xc0000877e0 sp=0xc0000877c8 pc=0x616d11e6a1c5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000877e8 sp=0xc0000877e0 pc=0x616d11ec6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1328 +0x105

goroutine 9 gp=0xc0001fbdc0 m=nil [GC worker (idle)]:
runtime.gopark(0x32e9569ff0f?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000087f38 sp=0xc000087f18 pc=0x616d11ebdc4e
runtime.gcBgMarkWorker(0xc00004f730)
	runtime/mgc.go:1412 +0xe9 fp=0xc000087fc8 sp=0xc000087f38 pc=0x616d11e6a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1328 +0x25 fp=0xc000087fe0 sp=0xc000087fc8 pc=0x616d11e6a1c5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000087fe8 sp=0xc000087fe0 pc=0x616d11ec6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1328 +0x105

goroutine 10 gp=0xc0004a4000 m=nil [GC worker (idle)]:
runtime.gopark(0x616d137c8900?, 0x1?, 0xa3?, 0xe?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000118738 sp=0xc000118718 pc=0x616d11ebdc4e
runtime.gcBgMarkWorker(0xc00004f730)
	runtime/mgc.go:1412 +0xe9 fp=0xc0001187c8 sp=0xc000118738 pc=0x616d11e6a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1328 +0x25 fp=0xc0001187e0 sp=0xc0001187c8 pc=0x616d11e6a1c5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0001187e8 sp=0xc0001187e0 pc=0x616d11ec6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1328 +0x105

goroutine 11 gp=0xc0004a41c0 m=nil [GC worker (idle)]:
runtime.gopark(0x32e9569fe95?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000118f38 sp=0xc000118f18 pc=0x616d11ebdc4e
runtime.gcBgMarkWorker(0xc00004f730)
	runtime/mgc.go:1412 +0xe9 fp=0xc000118fc8 sp=0xc000118f38 pc=0x616d11e6a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1328 +0x25 fp=0xc000118fe0 sp=0xc000118fc8 pc=0x616d11e6a1c5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000118fe8 sp=0xc000118fe0 pc=0x616d11ec6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1328 +0x105

goroutine 12 gp=0xc0004a4380 m=nil [GC worker (idle)]:
runtime.gopark(0x32e9569fd16?, 0x1?, 0x73?, 0xb5?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000119738 sp=0xc000119718 pc=0x616d11ebdc4e
runtime.gcBgMarkWorker(0xc00004f730)
	runtime/mgc.go:1412 +0xe9 fp=0xc0001197c8 sp=0xc000119738 pc=0x616d11e6a2e9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1328 +0x25 fp=0xc0001197e0 sp=0xc0001197c8 pc=0x616d11e6a1c5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0001197e8 sp=0xc0001197e0 pc=0x616d11ec6021
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1328 +0x105

goroutine 50 gp=0xc000504700 m=nil [select]:
runtime.gopark(0xc00005ba68?, 0x2?, 0xce?, 0x36?, 0xc00005b834?)
	runtime/proc.go:424 +0xce fp=0xc00005b650 sp=0xc00005b630 pc=0x616d11ebdc4e
runtime.selectgo(0xc00005ba68, 0xc00005b830, 0x21?, 0x0, 0x1?, 0x1)
	runtime/select.go:335 +0x7a5 fp=0xc00005b778 sp=0xc00005b650 pc=0x616d11e9af45
ollama/llama/runner.(*Server).completion(0xc0001a7560, {0x616d13006910, 0xc0004c2620}, 0xc00060ea00)
	ollama/llama/runner/runner.go:696 +0xab6 fp=0xc00005bac0 sp=0xc00005b778 pc=0x616d12289236
ollama/llama/runner.(*Server).completion-fm({0x616d13006910?, 0xc0004c2620?}, 0x616d12221fe7?)
	<autogenerated>:1 +0x36 fp=0xc00005baf0 sp=0xc00005bac0 pc=0x616d1228c916
net/http.HandlerFunc.ServeHTTP(0xc000536ee0?, {0x616d13006910?, 0xc0004c2620?}, 0x0?)
	net/http/server.go:2220 +0x29 fp=0xc00005bb18 sp=0xc00005baf0 pc=0x616d12214809
net/http.(*ServeMux).ServeHTTP(0x616d11e5e485?, {0x616d13006910, 0xc0004c2620}, 0xc00060ea00)
	net/http/server.go:2747 +0x1ca fp=0xc00005bb68 sp=0xc00005bb18 pc=0x616d1221670a
net/http.serverHandler.ServeHTTP({0x616d13003510?}, {0x616d13006910?, 0xc0004c2620?}, 0x6?)
	net/http/server.go:3210 +0x8e fp=0xc00005bb98 sp=0xc00005bb68 pc=0x616d12233c6e
net/http.(*conn).serve(0xc00001e000, {0x616d13008988, 0xc000608510})
	net/http/server.go:2092 +0x5d0 fp=0xc00005bfb8 sp=0xc00005bb98 pc=0x616d122131b0
net/http.(*Server).Serve.gowrap3()
	net/http/server.go:3360 +0x28 fp=0xc00005bfe0 sp=0xc00005bfb8 pc=0x616d12218608
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc00005bfe8 sp=0xc00005bfe0 pc=0x616d11ec6021
created by net/http.(*Server).Serve in goroutine 1
	net/http/server.go:3360 +0x485

goroutine 39 gp=0xc0005848c0 m=nil [IO wait]:
runtime.gopark(0x616d11e62965?, 0x0?, 0x0?, 0x0?, 0xb?)
	runtime/proc.go:424 +0xce fp=0xc00011eda8 sp=0xc00011ed88 pc=0x616d11ebdc4e
runtime.netpollblock(0x616d11ee0e78?, 0x11e54a66?, 0x6d?)
	runtime/netpoll.go:575 +0xf7 fp=0xc00011ede0 sp=0xc00011eda8 pc=0x616d11e818b7
internal/poll.runtime_pollWait(0x725dfed35568, 0x72)
	runtime/netpoll.go:351 +0x85 fp=0xc00011ee00 sp=0xc00011ede0 pc=0x616d11ebcf45
internal/poll.(*pollDesc).wait(0xc00049c000?, 0xc000608581?, 0x0)
	internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00011ee28 sp=0xc00011ee00 pc=0x616d11f44567
internal/poll.(*pollDesc).waitRead(...)
	internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc00049c000, {0xc000608581, 0x1, 0x1})
	internal/poll/fd_unix.go:165 +0x27a fp=0xc00011eec0 sp=0xc00011ee28 pc=0x616d11f4585a
net.(*netFD).Read(0xc00049c000, {0xc000608581?, 0xc00011ef48?, 0x616d11ebf8d0?})
	net/fd_posix.go:55 +0x25 fp=0xc00011ef08 sp=0xc00011eec0 pc=0x616d11fb0045
net.(*conn).Read(0xc000088040, {0xc000608581?, 0x0?, 0x616d137c6680?})
	net/net.go:189 +0x45 fp=0xc00011ef50 sp=0xc00011ef08 pc=0x616d11fbe645
net.(*TCPConn).Read(0x616d137030a0?, {0xc000608581?, 0x0?, 0x0?})
	<autogenerated>:1 +0x25 fp=0xc00011ef80 sp=0xc00011ef50 pc=0x616d11fd1845
net/http.(*connReader).backgroundRead(0xc000608570)
	net/http/server.go:690 +0x37 fp=0xc00011efc8 sp=0xc00011ef80 pc=0x616d1220db37
net/http.(*connReader).startBackgroundRead.gowrap2()
	net/http/server.go:686 +0x25 fp=0xc00011efe0 sp=0xc00011efc8 pc=0x616d1220da65
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc00011efe8 sp=0xc00011efe0 pc=0x616d11ec6021
created by net/http.(*connReader).startBackgroundRead in goroutine 50
	net/http/server.go:686 +0xb6

rax    0x0
rbx    0x0
rcx    0x725d6c006580
rdx    0x725d94ffebf0
rdi    0x6
rsi    0x725d78000b80
rbp    0x725d94ffe960
rsp    0x725d94ffe3f0
r8     0x725d6c006700
r9     0x0
r10    0x725dff2f8f28
r11    0x725d6c006700
r12    0x725d6c006700
r13    0x725d700cd298
r14    0x725d6c006580
r15    0x725d6c006700
rip    0x725dfd20bc2f
rflags 0x10202
cs     0x33
fs     0x0
gs     0x0
[GIN] 2025/02/18 - 00:51:53 | 500 |   41.218616ms |       10.89.0.4 | POST     "/api/chat"

<!-- gh-comment-id:2663901274 --> @Ejo2001 commented on GitHub (Feb 17, 2025): @rick-github It's huuuuuge ``` ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no ggml_sycl_init: SYCL_USE_XMX: yes ggml_sycl_init: found 2 SYCL devices: Couldn't find '/root/.ollama/id_ed25519'. Generating new private key. Your new public key is: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIJd9w8VDZHSpRWILi74iB1v7MDHSyZtXdFau2kzQL/1p 2025/02/17 23:53:47 routes.go:1259: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:localhost,127.0.0.1]" time=2025-02-17T23:53:47.328+08:00 level=INFO source=images.go:757 msg="total blobs: 6" time=2025-02-17T23:53:47.328+08:00 level=INFO source=images.go:764 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached. [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production. - using env: export GIN_MODE=release - using code: gin.SetMode(gin.ReleaseMode) [GIN-debug] POST /api/pull --> ollama/server.(*Server).PullHandler-fm (5 handlers) [GIN-debug] POST /api/generate --> ollama/server.(*Server).GenerateHandler-fm (5 handlers) [GIN-debug] POST /api/chat --> ollama/server.(*Server).ChatHandler-fm (5 handlers) [GIN-debug] POST /api/embed --> ollama/server.(*Server).EmbedHandler-fm (5 handlers) [GIN-debug] POST /api/embeddings --> ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers) [GIN-debug] POST /api/create --> ollama/server.(*Server).CreateHandler-fm (5 handlers) [GIN-debug] POST /api/push --> ollama/server.(*Server).PushHandler-fm (5 handlers) [GIN-debug] POST /api/copy --> ollama/server.(*Server).CopyHandler-fm (5 handlers) [GIN-debug] DELETE /api/delete --> ollama/server.(*Server).DeleteHandler-fm (5 handlers) [GIN-debug] POST /api/show --> ollama/server.(*Server).ShowHandler-fm (5 handlers) [GIN-debug] POST /api/blobs/:digest --> ollama/server.(*Server).CreateBlobHandler-fm (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> ollama/server.(*Server).HeadBlobHandler-fm (5 handlers) [GIN-debug] GET /api/ps --> ollama/server.(*Server).PsHandler-fm (5 handlers) [GIN-debug] POST /v1/chat/completions --> ollama/server.(*Server).ChatHandler-fm (6 handlers) [GIN-debug] POST /v1/completions --> ollama/server.(*Server).GenerateHandler-fm (6 handlers) [GIN-debug] POST /v1/embeddings --> ollama/server.(*Server).EmbedHandler-fm (6 handlers) [GIN-debug] GET /v1/models --> ollama/server.(*Server).ListHandler-fm (6 handlers) [GIN-debug] GET /v1/models/:model --> ollama/server.(*Server).ShowHandler-fm (6 handlers) [GIN-debug] GET / --> ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] GET /api/tags --> ollama/server.(*Server).ListHandler-fm (5 handlers) [GIN-debug] GET /api/version --> ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD / --> ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] HEAD /api/tags --> ollama/server.(*Server).ListHandler-fm (5 handlers) [GIN-debug] HEAD /api/version --> ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) time=2025-02-17T23:53:47.328+08:00 level=INFO source=routes.go:1310 msg="Listening on [::]:11434 (version 0.5.4-ipexllm-20250217)" time=2025-02-17T23:53:47.328+08:00 level=INFO source=routes.go:1339 msg="Dynamic LLM libraries" runners=[ipex_llm] [GIN] 2025/02/17 - 23:54:16 | 200 | 7.261733ms | 10.89.0.1 | GET  "/api/tags" [GIN] 2025/02/17 - 23:54:16 | 200 | 125.667µs | 10.89.0.1 | GET  "/api/version" [GIN] 2025/02/17 - 23:54:25 | 404 | 1.632384ms | 10.89.0.4 | POST  "/api/generate" time=2025-02-17T23:54:35.903+08:00 level=INFO source=gpu.go:226 msg="looking for compatible GPUs" time=2025-02-17T23:54:35.935+08:00 level=INFO source=server.go:104 msg="system memory" total="30.3 GiB" free="29.0 GiB" free_swap="8.0 GiB" time=2025-02-17T23:54:35.936+08:00 level=INFO source=memory.go:356 msg="offload to device" layers.requested=-1 layers.model=29 layers.offload=0 layers.split="" memory.available="[29.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="4.6 GiB" memory.required.partial="0 B" memory.required.kv="1.8 GiB" memory.required.allocations="[4.6 GiB]" memory.weights.total="3.3 GiB" memory.weights.repeating="3.0 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="824.0 MiB" memory.graph.partial="881.1 MiB" time=2025-02-17T23:54:35.936+08:00 level=INFO source=server.go:392 msg="starting llama server" cmd="/usr/local/lib/python3.11/dist-packages/bigdl/cpp/libs/ollama runner --model /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 16384 --batch-size 512 --n-gpu-layers 999 --threads 8 --no-mmap --parallel 1 --port 41795" time=2025-02-17T23:54:35.937+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2025-02-17T23:54:35.937+08:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding" time=2025-02-17T23:54:35.938+08:00 level=INFO source=server.go:605 msg="waiting for server to become available" status="llm server error" ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no ggml_sycl_init: SYCL_USE_XMX: yes ggml_sycl_init: found 2 SYCL devices: time=2025-02-17T23:54:36.074+08:00 level=INFO source=runner.go:967 msg="starting go runner" time=2025-02-17T23:54:36.075+08:00 level=INFO source=runner.go:968 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=8 get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory llama_load_model_from_file: using device SYCL0 (Intel(R) Graphics [0xe20b]) - 11605 MiB free get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory llama_load_model_from_file: using device SYCL1 (Intel(R) Graphics [0xe20b]) - 11605 MiB free time=2025-02-17T23:54:36.075+08:00 level=INFO source=runner.go:1026 msg="Server listening on 127.0.0.1:41795" llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Llama-3.2 llama_model_loader: - kv 5: general.size_label str = 3B llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 8: llama.block_count u32 = 28 llama_model_loader: - kv 9: llama.context_length u32 = 131072 llama_model_loader: - kv 10: llama.embedding_length u32 = 3072 llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192 llama_model_loader: - kv 12: llama.attention.head_count u32 = 24 llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 llama_model_loader: - kv 18: general.file_type u32 = 15 llama_model_loader: - kv 19: llama.vocab_size u32 = 128256 llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - type f32: 58 tensors llama_model_loader: - type q4_K: 168 tensors llama_model_loader: - type q6_K: 29 tensors time=2025-02-17T23:54:36.190+08:00 level=INFO source=server.go:605 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 3072 llm_load_print_meta: n_layer = 28 llm_load_print_meta: n_head = 24 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 3 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 8192 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 3B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 3.21 B llm_load_print_meta: model size = 1.87 GiB (5.01 BPW) llm_load_print_meta: general.name = Llama 3.2 3B Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory llm_load_tensors: offloading 28 repeating layers to GPU llm_load_tensors: offloading output layer to GPU llm_load_tensors: offloaded 29/29 layers to GPU llm_load_tensors: SYCL0 model buffer size = 852.89 MiB llm_load_tensors: SYCL1 model buffer size = 1065.46 MiB llm_load_tensors: CPU model buffer size = 308.23 MiB llama_new_context_with_model: n_seq_max = 1 llama_new_context_with_model: n_ctx = 16384 llama_new_context_with_model: n_ctx_per_seq = 16384 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized [SYCL] call ggml_check_sycl ggml_check_sycl: GGML_SYCL_DEBUG: 0 ggml_check_sycl: GGML_SYCL_F16: no Found 2 SYCL devices: | | | | |Max | |Max |Global | | | | | | |compute|Max work|sub |mem | | |ID| Device Type| Name|Version|units |group |group|size | Driver version| |--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------| | 0| [level_zero:gpu:0]| Intel Graphics [0xe20b]| 20.1| 160| 1024| 32| 12168M| 1.6.32224.500000| | 1| [level_zero:gpu:1]| Intel Graphics [0xe20b]| 20.1| 160| 1024| 32| 12168M| 1.6.32224.500000| llama_kv_cache_init: SYCL0 KV buffer size = 960.00 MiB llama_kv_cache_init: SYCL1 KV buffer size = 832.00 MiB llama_new_context_with_model: KV self size = 1792.00 MiB, K (f16): 896.00 MiB, V (f16): 896.00 MiB llama_new_context_with_model: SYCL_Host output buffer size = 0.50 MiB get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory llama_new_context_with_model: pipeline parallelism enabled (n_copies=4) llama_new_context_with_model: SYCL0 compute buffer size = 202.01 MiB llama_new_context_with_model: SYCL1 compute buffer size = 408.52 MiB llama_new_context_with_model: SYCL_Host compute buffer size = 134.02 MiB llama_new_context_with_model: graph nodes = 790 llama_new_context_with_model: graph splits = 3 time=2025-02-17T23:54:43.679+08:00 level=WARN source=runner.go:892 msg="%s: warming up the model with an empty run - please wait ... " !BADKEY=loadModel time=2025-02-17T23:54:43.720+08:00 level=INFO source=server.go:610 msg="llama runner started in 7.78 seconds" SIGILL: illegal instruction PC=0x74136c80bc2f m=4 sigcode=2 signal arrived during cgo execution instruction bytes: 0xf3 0xf 0xc7 0xf8 0x25 0xff 0x3 0x0 0x0 0x48 0x8b 0xd 0xe1 0xc2 0x2a 0x0 goroutine 39 gp=0xc0001fb180 m=4 mp=0xc00008b508 [syscall]: runtime.cgocall(0x601e0fa584e0, 0xc000159b90) runtime/cgocall.go:167 +0x4b fp=0xc000159b68 sp=0xc000159b30 pc=0x601e0eeb754b ollama/llama/llamafile._Cfunc_llama_decode(0x7412ebc57ee0, {0x1f, 0x7412e801bf90, 0x0, 0x0, 0x7412e801c7a0, 0x7412e801cfb0, 0x7412e801d7c0, 0x7412e83aec30}) _cgo_gotypes.go:558 +0x4f fp=0xc000159b90 sp=0xc000159b68 pc=0x601e0f27996f ollama/llama/llamafile.(*Context).Decode.func1(0x601e0f2886eb?, 0x7412ebc57ee0?) ollama/llama/llamafile/llama.go:143 +0xf5 fp=0xc000159c80 sp=0xc000159b90 pc=0x601e0f27c595 ollama/llama/llamafile.(*Context).Decode(0xc000080d70?, 0x0?) ollama/llama/llamafile/llama.go:143 +0x13 fp=0xc000159cc8 sp=0xc000159c80 pc=0x601e0f27c413 ollama/llama/runner.(*Server).processBatch(0xc0001a75f0, 0xc0003fc9c0, 0xc000080f20) ollama/llama/runner/runner.go:434 +0x23f fp=0xc000159ee0 sp=0xc000159cc8 pc=0x601e0f2873bf ollama/llama/runner.(*Server).run(0xc0001a75f0, {0x601e100089c0, 0xc000529680}) ollama/llama/runner/runner.go:342 +0x1d5 fp=0xc000159fb8 sp=0xc000159ee0 pc=0x601e0f286df5 ollama/llama/runner.Execute.gowrap2() ollama/llama/runner/runner.go:1006 +0x28 fp=0xc000159fe0 sp=0xc000159fb8 pc=0x601e0f28c068 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000159fe8 sp=0xc000159fe0 pc=0x601e0eec6021 created by ollama/llama/runner.Execute in goroutine 1 ollama/llama/runner/runner.go:1006 +0xde5 goroutine 1 gp=0xc0000061c0 m=nil [IO wait]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000127560 sp=0xc000127540 pc=0x601e0eebdc4e runtime.netpollblock(0x4adf80?, 0xee54a66?, 0x1e?) runtime/netpoll.go:575 +0xf7 fp=0xc000127598 sp=0xc000127560 pc=0x601e0ee818b7 internal/poll.runtime_pollWait(0x74136e335680, 0x72) runtime/netpoll.go:351 +0x85 fp=0xc0001275b8 sp=0xc000127598 pc=0x601e0eebcf45 internal/poll.(*pollDesc).wait(0xc000506a00?, 0x2c?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0001275e0 sp=0xc0001275b8 pc=0x601e0ef44567 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Accept(0xc000506a00) internal/poll/fd_unix.go:620 +0x295 fp=0xc000127688 sp=0xc0001275e0 pc=0x601e0ef49935 net.(*netFD).accept(0xc000506a00) net/fd_unix.go:172 +0x29 fp=0xc000127740 sp=0xc000127688 pc=0x601e0efb2009 net.(*TCPListener).accept(0xc000535e00) net/tcpsock_posix.go:159 +0x1e fp=0xc000127790 sp=0xc000127740 pc=0x601e0efc7c7e net.(*TCPListener).Accept(0xc000535e00) net/tcpsock.go:372 +0x30 fp=0xc0001277c0 sp=0xc000127790 pc=0x601e0efc6b30 net/http.(*onceCloseListener).Accept(0xc0004ac000?) <autogenerated>:1 +0x24 fp=0xc0001277d8 sp=0xc0001277c0 pc=0x601e0f240284 net/http.(*Server).Serve(0xc00053af00, {0x601e10006700, 0xc000535e00}) net/http/server.go:3330 +0x30c fp=0xc000127908 sp=0xc0001277d8 pc=0x601e0f21820c ollama/llama/runner.Execute({0xc000036130?, 0x0?, 0x0?}) ollama/llama/runner/runner.go:1027 +0x11a9 fp=0xc000127ca8 sp=0xc000127908 pc=0x601e0f28bd49 ollama/cmd.NewCLI.func2(0xc0001cf200?, {0x601e0fa5cf9d?, 0x4?, 0x601e0fa5cfa1?}) ollama/cmd/cmd.go:1430 +0x45 fp=0xc000127cd0 sp=0xc000127ca8 pc=0x601e0fa57765 github.com/spf13/cobra.(*Command).execute(0xc000130908, {0xc00053a870, 0xf, 0xf}) github.com/spf13/cobra@v1.8.1/command.go:985 +0xaaa fp=0xc000127e58 sp=0xc000127cd0 pc=0x601e0f04b3ea github.com/spf13/cobra.(*Command).ExecuteC(0xc00012e308) github.com/spf13/cobra@v1.8.1/command.go:1117 +0x3ff fp=0xc000127f30 sp=0xc000127e58 pc=0x601e0f04bcbf github.com/spf13/cobra.(*Command).Execute(...) github.com/spf13/cobra@v1.8.1/command.go:1041 github.com/spf13/cobra.(*Command).ExecuteContext(...) github.com/spf13/cobra@v1.8.1/command.go:1034 main.main() ollama/main.go:12 +0x4d fp=0xc000127f50 sp=0xc000127f30 pc=0x601e0fa57dcd runtime.main() runtime/proc.go:272 +0x29d fp=0xc000127fe0 sp=0xc000127f50 pc=0x601e0ee88f5d runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000127fe8 sp=0xc000127fe0 pc=0x601e0eec6021 goroutine 2 gp=0xc000006c40 m=nil [force gc (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000084fa8 sp=0xc000084f88 pc=0x601e0eebdc4e runtime.goparkunlock(...) runtime/proc.go:430 runtime.forcegchelper() runtime/proc.go:337 +0xb8 fp=0xc000084fe0 sp=0xc000084fa8 pc=0x601e0ee89298 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000084fe8 sp=0xc000084fe0 pc=0x601e0eec6021 created by runtime.init.7 in goroutine 1 runtime/proc.go:325 +0x1a goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]: runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000085780 sp=0xc000085760 pc=0x601e0eebdc4e runtime.goparkunlock(...) runtime/proc.go:430 runtime.bgsweep(0xc0000b2000) runtime/mgcsweep.go:317 +0xdf fp=0xc0000857c8 sp=0xc000085780 pc=0x601e0ee7393f runtime.gcenable.gowrap1() runtime/mgc.go:204 +0x25 fp=0xc0000857e0 sp=0xc0000857c8 pc=0x601e0ee67f85 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000857e8 sp=0xc0000857e0 pc=0x601e0eec6021 created by runtime.gcenable in goroutine 1 runtime/mgc.go:204 +0x66 goroutine 4 gp=0xc000007340 m=nil [GC scavenge wait]: runtime.gopark(0x10000?, 0x601e0fc03070?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000085f78 sp=0xc000085f58 pc=0x601e0eebdc4e runtime.goparkunlock(...) runtime/proc.go:430 runtime.(*scavengerState).park(0x601e1079ed80) runtime/mgcscavenge.go:425 +0x49 fp=0xc000085fa8 sp=0xc000085f78 pc=0x601e0ee71309 runtime.bgscavenge(0xc0000b2000) runtime/mgcscavenge.go:658 +0x59 fp=0xc000085fc8 sp=0xc000085fa8 pc=0x601e0ee71899 runtime.gcenable.gowrap2() runtime/mgc.go:205 +0x25 fp=0xc000085fe0 sp=0xc000085fc8 pc=0x601e0ee67f25 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000085fe8 sp=0xc000085fe0 pc=0x601e0eec6021 created by runtime.gcenable in goroutine 1 runtime/mgc.go:205 +0xa5 goroutine 5 gp=0xc000007c00 m=nil [finalizer wait]: runtime.gopark(0xc000084648?, 0x601e0ee5e485?, 0xb0?, 0x1?, 0xc0000061c0?) runtime/proc.go:424 +0xce fp=0xc000084620 sp=0xc000084600 pc=0x601e0eebdc4e runtime.runfinq() runtime/mfinal.go:193 +0x107 fp=0xc0000847e0 sp=0xc000084620 pc=0x601e0ee67007 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000847e8 sp=0xc0000847e0 pc=0x601e0eec6021 created by runtime.createfing in goroutine 1 runtime/mfinal.go:163 +0x3d goroutine 6 gp=0xc0001fae00 m=nil [chan receive]: runtime.gopark(0xc000086760?, 0x601e0ef99685?, 0x40?, 0xe8?, 0x601e10019c00?) runtime/proc.go:424 +0xce fp=0xc000086718 sp=0xc0000866f8 pc=0x601e0eebdc4e runtime.chanrecv(0xc00004e310, 0x0, 0x1) runtime/chan.go:639 +0x41c fp=0xc000086790 sp=0xc000086718 pc=0x601e0ee5767c runtime.chanrecv1(0x0?, 0x0?) runtime/chan.go:489 +0x12 fp=0xc0000867b8 sp=0xc000086790 pc=0x601e0ee57232 runtime.unique_runtime_registerUniqueMapCleanup.func1(...) runtime/mgc.go:1781 runtime.unique_runtime_registerUniqueMapCleanup.gowrap1() runtime/mgc.go:1784 +0x2f fp=0xc0000867e0 sp=0xc0000867b8 pc=0x601e0ee6afef runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000867e8 sp=0xc0000867e0 pc=0x601e0eec6021 created by unique.runtime_registerUniqueMapCleanup in goroutine 1 runtime/mgc.go:1779 +0x96 goroutine 7 gp=0xc0001fb880 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000086f38 sp=0xc000086f18 pc=0x601e0eebdc4e runtime.gcBgMarkWorker(0xc00004f730) runtime/mgc.go:1412 +0xe9 fp=0xc000086fc8 sp=0xc000086f38 pc=0x601e0ee6a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc000086fe0 sp=0xc000086fc8 pc=0x601e0ee6a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000086fe8 sp=0xc000086fe0 pc=0x601e0eec6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 18 gp=0xc000504000 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000080738 sp=0xc000080718 pc=0x601e0eebdc4e runtime.gcBgMarkWorker(0xc00004f730) runtime/mgc.go:1412 +0xe9 fp=0xc0000807c8 sp=0xc000080738 pc=0x601e0ee6a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc0000807e0 sp=0xc0000807c8 pc=0x601e0ee6a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000807e8 sp=0xc0000807e0 pc=0x601e0eec6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 34 gp=0xc000104380 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc00011c738 sp=0xc00011c718 pc=0x601e0eebdc4e runtime.gcBgMarkWorker(0xc00004f730) runtime/mgc.go:1412 +0xe9 fp=0xc00011c7c8 sp=0xc00011c738 pc=0x601e0ee6a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc00011c7e0 sp=0xc00011c7c8 pc=0x601e0ee6a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00011c7e8 sp=0xc00011c7e0 pc=0x601e0eec6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 8 gp=0xc0001fba40 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000087738 sp=0xc000087718 pc=0x601e0eebdc4e runtime.gcBgMarkWorker(0xc00004f730) runtime/mgc.go:1412 +0xe9 fp=0xc0000877c8 sp=0xc000087738 pc=0x601e0ee6a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc0000877e0 sp=0xc0000877c8 pc=0x601e0ee6a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000877e8 sp=0xc0000877e0 pc=0x601e0eec6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 35 gp=0xc000104540 m=nil [GC worker (idle)]: runtime.gopark(0x18c3012f41?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc00011cf38 sp=0xc00011cf18 pc=0x601e0eebdc4e runtime.gcBgMarkWorker(0xc00004f730) runtime/mgc.go:1412 +0xe9 fp=0xc00011cfc8 sp=0xc00011cf38 pc=0x601e0ee6a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc00011cfe0 sp=0xc00011cfc8 pc=0x601e0ee6a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00011cfe8 sp=0xc00011cfe0 pc=0x601e0eec6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 36 gp=0xc000104700 m=nil [GC worker (idle)]: runtime.gopark(0x18c301349c?, 0x3?, 0x77?, 0xf7?, 0x0?) runtime/proc.go:424 +0xce fp=0xc00011d738 sp=0xc00011d718 pc=0x601e0eebdc4e runtime.gcBgMarkWorker(0xc00004f730) runtime/mgc.go:1412 +0xe9 fp=0xc00011d7c8 sp=0xc00011d738 pc=0x601e0ee6a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc00011d7e0 sp=0xc00011d7c8 pc=0x601e0ee6a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00011d7e8 sp=0xc00011d7e0 pc=0x601e0eec6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 37 gp=0xc0001048c0 m=nil [GC worker (idle)]: runtime.gopark(0x18c300b8f7?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc00011df38 sp=0xc00011df18 pc=0x601e0eebdc4e runtime.gcBgMarkWorker(0xc00004f730) runtime/mgc.go:1412 +0xe9 fp=0xc00011dfc8 sp=0xc00011df38 pc=0x601e0ee6a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc00011dfe0 sp=0xc00011dfc8 pc=0x601e0ee6a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00011dfe8 sp=0xc00011dfe0 pc=0x601e0eec6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 9 gp=0xc0001fbc00 m=nil [GC worker (idle)]: runtime.gopark(0x18c30154d9?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000087f38 sp=0xc000087f18 pc=0x601e0eebdc4e runtime.gcBgMarkWorker(0xc00004f730) runtime/mgc.go:1412 +0xe9 fp=0xc000087fc8 sp=0xc000087f38 pc=0x601e0ee6a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc000087fe0 sp=0xc000087fc8 pc=0x601e0ee6a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000087fe8 sp=0xc000087fe0 pc=0x601e0eec6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 10 gp=0xc000104a80 m=nil [select]: runtime.gopark(0xc00005ba68?, 0x2?, 0xce?, 0x36?, 0xc00005b834?) runtime/proc.go:424 +0xce fp=0xc00005b650 sp=0xc00005b630 pc=0x601e0eebdc4e runtime.selectgo(0xc00005ba68, 0xc00005b830, 0x1f?, 0x0, 0x1?, 0x1) runtime/select.go:335 +0x7a5 fp=0xc00005b778 sp=0xc00005b650 pc=0x601e0ee9af45 ollama/llama/runner.(*Server).completion(0xc0001a75f0, {0x601e10006910, 0xc000632ee0}, 0xc00033d180) ollama/llama/runner/runner.go:696 +0xab6 fp=0xc00005bac0 sp=0xc00005b778 pc=0x601e0f289236 ollama/llama/runner.(*Server).completion-fm({0x601e10006910?, 0xc000632ee0?}, 0x601e0f221fe7?) <autogenerated>:1 +0x36 fp=0xc00005baf0 sp=0xc00005bac0 pc=0x601e0f28c916 net/http.HandlerFunc.ServeHTTP(0xc000541340?, {0x601e10006910?, 0xc000632ee0?}, 0x0?) net/http/server.go:2220 +0x29 fp=0xc00005bb18 sp=0xc00005baf0 pc=0x601e0f214809 net/http.(*ServeMux).ServeHTTP(0x601e0ee5e485?, {0x601e10006910, 0xc000632ee0}, 0xc00033d180) net/http/server.go:2747 +0x1ca fp=0xc00005bb68 sp=0xc00005bb18 pc=0x601e0f21670a net/http.serverHandler.ServeHTTP({0x601e10003510?}, {0x601e10006910?, 0xc000632ee0?}, 0x6?) net/http/server.go:3210 +0x8e fp=0xc00005bb98 sp=0xc00005bb68 pc=0x601e0f233c6e net/http.(*conn).serve(0xc0004ac000, {0x601e10008988, 0xc00060af30}) net/http/server.go:2092 +0x5d0 fp=0xc00005bfb8 sp=0xc00005bb98 pc=0x601e0f2131b0 net/http.(*Server).Serve.gowrap3() net/http/server.go:3360 +0x28 fp=0xc00005bfe0 sp=0xc00005bfb8 pc=0x601e0f218608 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00005bfe8 sp=0xc00005bfe0 pc=0x601e0eec6021 created by net/http.(*Server).Serve in goroutine 1 net/http/server.go:3360 +0x485 goroutine 31 gp=0xc000584a80 m=nil [IO wait]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0xb?) runtime/proc.go:424 +0xce fp=0xc00011bda8 sp=0xc00011bd88 pc=0x601e0eebdc4e runtime.netpollblock(0x601e0eee0e78?, 0xee54a66?, 0x1e?) runtime/netpoll.go:575 +0xf7 fp=0xc00011bde0 sp=0xc00011bda8 pc=0x601e0ee818b7 internal/poll.runtime_pollWait(0x74136e335568, 0x72) runtime/netpoll.go:351 +0x85 fp=0xc00011be00 sp=0xc00011bde0 pc=0x601e0eebcf45 internal/poll.(*pollDesc).wait(0xc000624500?, 0xc00060afa1?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00011be28 sp=0xc00011be00 pc=0x601e0ef44567 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Read(0xc000624500, {0xc00060afa1, 0x1, 0x1}) internal/poll/fd_unix.go:165 +0x27a fp=0xc00011bec0 sp=0xc00011be28 pc=0x601e0ef4585a net.(*netFD).Read(0xc000624500, {0xc00060afa1?, 0x0?, 0x0?}) net/fd_posix.go:55 +0x25 fp=0xc00011bf08 sp=0xc00011bec0 pc=0x601e0efb0045 net.(*conn).Read(0xc000520030, {0xc00060afa1?, 0x0?, 0x0?}) net/net.go:189 +0x45 fp=0xc00011bf50 sp=0xc00011bf08 pc=0x601e0efbe645 net.(*TCPConn).Read(0x0?, {0xc00060afa1?, 0x0?, 0x0?}) <autogenerated>:1 +0x25 fp=0xc00011bf80 sp=0xc00011bf50 pc=0x601e0efd1845 net/http.(*connReader).backgroundRead(0xc00060af90) net/http/server.go:690 +0x37 fp=0xc00011bfc8 sp=0xc00011bf80 pc=0x601e0f20db37 net/http.(*connReader).startBackgroundRead.gowrap2() net/http/server.go:686 +0x25 fp=0xc00011bfe0 sp=0xc00011bfc8 pc=0x601e0f20da65 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00011bfe8 sp=0xc00011bfe0 pc=0x601e0eec6021 created by net/http.(*connReader).startBackgroundRead in goroutine 10 net/http/server.go:686 +0xb6 rax 0x0 rbx 0x0 rcx 0x7412f80063c0 rdx 0x741307dfebf0 rdi 0x6 rsi 0x601e2f6f7060 rbp 0x741307dfe960 rsp 0x741307dfe3f0 r8 0x7412f8006540 r9 0x0 r10 0x74136e8f8f28 r11 0x7412f8006540 r12 0x7412f8006540 r13 0x7412e80cd298 r14 0x7412f80063c0 r15 0x7412f8006540 rip 0x74136c80bc2f rflags 0x10202 cs 0x33 fs 0x0 gs 0x0 [GIN] 2025/02/17 - 23:54:43 | 500 | 7.913867252s | 10.89.0.4 | POST  "/api/generate" time=2025-02-17T23:54:57.191+08:00 level=INFO source=server.go:104 msg="system memory" total="30.3 GiB" free="28.6 GiB" free_swap="8.0 GiB" time=2025-02-17T23:54:57.191+08:00 level=INFO source=memory.go:356 msg="offload to device" layers.requested=-1 layers.model=29 layers.offload=0 layers.split="" memory.available="[28.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="4.6 GiB" memory.required.partial="0 B" memory.required.kv="1.8 GiB" memory.required.allocations="[4.6 GiB]" memory.weights.total="3.3 GiB" memory.weights.repeating="3.0 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="824.0 MiB" memory.graph.partial="881.1 MiB" time=2025-02-17T23:54:57.192+08:00 level=INFO source=server.go:392 msg="starting llama server" cmd="/usr/local/lib/python3.11/dist-packages/bigdl/cpp/libs/ollama runner --model /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 16384 --batch-size 512 --n-gpu-layers 999 --threads 8 --no-mmap --parallel 1 --port 41959" time=2025-02-17T23:54:57.192+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2025-02-17T23:54:57.192+08:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding" time=2025-02-17T23:54:57.192+08:00 level=INFO source=server.go:605 msg="waiting for server to become available" status="llm server error" ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no ggml_sycl_init: SYCL_USE_XMX: yes ggml_sycl_init: found 2 SYCL devices: time=2025-02-17T23:54:57.326+08:00 level=INFO source=runner.go:967 msg="starting go runner" time=2025-02-17T23:54:57.326+08:00 level=INFO source=runner.go:968 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=8 get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory llama_load_model_from_file: using device SYCL0 (Intel(R) Graphics [0xe20b]) - 11605 MiB free get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory llama_load_model_from_file: using device SYCL1 (Intel(R) Graphics [0xe20b]) - 11605 MiB free time=2025-02-17T23:54:57.326+08:00 level=INFO source=runner.go:1026 msg="Server listening on 127.0.0.1:41959" llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Llama-3.2 llama_model_loader: - kv 5: general.size_label str = 3B llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 8: llama.block_count u32 = 28 llama_model_loader: - kv 9: llama.context_length u32 = 131072 llama_model_loader: - kv 10: llama.embedding_length u32 = 3072 llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192 llama_model_loader: - kv 12: llama.attention.head_count u32 = 24 llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 llama_model_loader: - kv 18: general.file_type u32 = 15 llama_model_loader: - kv 19: llama.vocab_size u32 = 128256 llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - type f32: 58 tensors llama_model_loader: - type q4_K: 168 tensors llama_model_loader: - type q6_K: 29 tensors time=2025-02-17T23:54:57.443+08:00 level=INFO source=server.go:605 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 3072 llm_load_print_meta: n_layer = 28 llm_load_print_meta: n_head = 24 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 3 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 8192 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 3B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 3.21 B llm_load_print_meta: model size = 1.87 GiB (5.01 BPW) llm_load_print_meta: general.name = Llama 3.2 3B Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory llm_load_tensors: offloading 28 repeating layers to GPU llm_load_tensors: offloading output layer to GPU llm_load_tensors: offloaded 29/29 layers to GPU llm_load_tensors: SYCL0 model buffer size = 852.89 MiB llm_load_tensors: SYCL1 model buffer size = 1065.46 MiB llm_load_tensors: CPU model buffer size = 308.23 MiB llama_new_context_with_model: n_seq_max = 1 llama_new_context_with_model: n_ctx = 16384 llama_new_context_with_model: n_ctx_per_seq = 16384 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized [SYCL] call ggml_check_sycl ggml_check_sycl: GGML_SYCL_DEBUG: 0 ggml_check_sycl: GGML_SYCL_F16: no Found 2 SYCL devices: | | | | |Max | |Max |Global | | | | | | |compute|Max work|sub |mem | | |ID| Device Type| Name|Version|units |group |group|size | Driver version| |--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------| | 0| [level_zero:gpu:0]| Intel Graphics [0xe20b]| 20.1| 160| 1024| 32| 12168M| 1.6.32224.500000| | 1| [level_zero:gpu:1]| Intel Graphics [0xe20b]| 20.1| 160| 1024| 32| 12168M| 1.6.32224.500000| llama_kv_cache_init: SYCL0 KV buffer size = 960.00 MiB llama_kv_cache_init: SYCL1 KV buffer size = 832.00 MiB llama_new_context_with_model: KV self size = 1792.00 MiB, K (f16): 896.00 MiB, V (f16): 896.00 MiB llama_new_context_with_model: SYCL_Host output buffer size = 0.50 MiB get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory llama_new_context_with_model: pipeline parallelism enabled (n_copies=4) llama_new_context_with_model: SYCL0 compute buffer size = 202.01 MiB llama_new_context_with_model: SYCL1 compute buffer size = 408.52 MiB llama_new_context_with_model: SYCL_Host compute buffer size = 134.02 MiB llama_new_context_with_model: graph nodes = 790 llama_new_context_with_model: graph splits = 3 time=2025-02-17T23:54:59.613+08:00 level=WARN source=runner.go:892 msg="%s: warming up the model with an empty run - please wait ... " !BADKEY=loadModel time=2025-02-17T23:54:59.702+08:00 level=INFO source=server.go:610 msg="llama runner started in 2.51 seconds" SIGILL: illegal instruction PC=0x76bc5f60bc2f m=3 sigcode=2 signal arrived during cgo execution instruction bytes: 0xf3 0xf 0xc7 0xf8 0x25 0xff 0x3 0x0 0x0 0x48 0x8b 0xd 0xe1 0xc2 0x2a 0x0 goroutine 8 gp=0xc000484c40 m=3 mp=0xc00008ae08 [syscall]: runtime.cgocall(0x5d46b18584e0, 0xc000097b90) runtime/cgocall.go:167 +0x4b fp=0xc000097b68 sp=0xc000097b30 pc=0x5d46b0cb754b ollama/llama/llamafile._Cfunc_llama_decode(0x76bbefc576b0, {0x21, 0x76bbec020350, 0x0, 0x0, 0x76bbec01b8f0, 0x76bbec01c100, 0x76bbec01c910, 0x76bbec0813f0}) _cgo_gotypes.go:558 +0x4f fp=0xc000097b90 sp=0xc000097b68 pc=0x5d46b107996f ollama/llama/llamafile.(*Context).Decode.func1(0x5d46b10886eb?, 0x76bbefc576b0?) ollama/llama/llamafile/llama.go:143 +0xf5 fp=0xc000097c80 sp=0xc000097b90 pc=0x5d46b107c595 ollama/llama/llamafile.(*Context).Decode(0xc00048d570?, 0x0?) ollama/llama/llamafile/llama.go:143 +0x13 fp=0xc000097cc8 sp=0xc000097c80 pc=0x5d46b107c413 ollama/llama/runner.(*Server).processBatch(0xc00019d560, 0xc00001e060, 0xc00048d720) ollama/llama/runner/runner.go:434 +0x23f fp=0xc000097ee0 sp=0xc000097cc8 pc=0x5d46b10873bf ollama/llama/runner.(*Server).run(0xc00019d560, {0x5d46b1e089c0, 0xc0005227d0}) ollama/llama/runner/runner.go:342 +0x1d5 fp=0xc000097fb8 sp=0xc000097ee0 pc=0x5d46b1086df5 ollama/llama/runner.Execute.gowrap2() ollama/llama/runner/runner.go:1006 +0x28 fp=0xc000097fe0 sp=0xc000097fb8 pc=0x5d46b108c068 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000097fe8 sp=0xc000097fe0 pc=0x5d46b0cc6021 created by ollama/llama/runner.Execute in goroutine 1 ollama/llama/runner/runner.go:1006 +0xde5 goroutine 1 gp=0xc0000061c0 m=nil [IO wait]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000507560 sp=0xc000507540 pc=0x5d46b0cbdc4e runtime.netpollblock(0x4e5f80?, 0xb0c54a66?, 0x46?) runtime/netpoll.go:575 +0xf7 fp=0xc000507598 sp=0xc000507560 pc=0x5d46b0c818b7 internal/poll.runtime_pollWait(0x76bc60dc8df0, 0x72) runtime/netpoll.go:351 +0x85 fp=0xc0005075b8 sp=0xc000507598 pc=0x5d46b0cbcf45 internal/poll.(*pollDesc).wait(0xc000604700?, 0x2c?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0005075e0 sp=0xc0005075b8 pc=0x5d46b0d44567 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Accept(0xc000604700) internal/poll/fd_unix.go:620 +0x295 fp=0xc000507688 sp=0xc0005075e0 pc=0x5d46b0d49935 net.(*netFD).accept(0xc000604700) net/fd_unix.go:172 +0x29 fp=0xc000507740 sp=0xc000507688 pc=0x5d46b0db2009 net.(*TCPListener).accept(0xc000366740) net/tcpsock_posix.go:159 +0x1e fp=0xc000507790 sp=0xc000507740 pc=0x5d46b0dc7c7e net.(*TCPListener).Accept(0xc000366740) net/tcpsock.go:372 +0x30 fp=0xc0005077c0 sp=0xc000507790 pc=0x5d46b0dc6b30 net/http.(*onceCloseListener).Accept(0xc0004e4000?) <autogenerated>:1 +0x24 fp=0xc0005077d8 sp=0xc0005077c0 pc=0x5d46b1040284 net/http.(*Server).Serve(0xc0004cd0e0, {0x5d46b1e06700, 0xc000366740}) net/http/server.go:3330 +0x30c fp=0xc000507908 sp=0xc0005077d8 pc=0x5d46b101820c ollama/llama/runner.Execute({0xc000136010?, 0x0?, 0x0?}) ollama/llama/runner/runner.go:1027 +0x11a9 fp=0xc000507ca8 sp=0xc000507908 pc=0x5d46b108bd49 ollama/cmd.NewCLI.func2(0xc0001c9400?, {0x5d46b185cf9d?, 0x4?, 0x5d46b185cfa1?}) ollama/cmd/cmd.go:1430 +0x45 fp=0xc000507cd0 sp=0xc000507ca8 pc=0x5d46b1857765 github.com/spf13/cobra.(*Command).execute(0xc0000ca908, {0xc0004cca50, 0xf, 0xf}) github.com/spf13/cobra@v1.8.1/command.go:985 +0xaaa fp=0xc000507e58 sp=0xc000507cd0 pc=0x5d46b0e4b3ea github.com/spf13/cobra.(*Command).ExecuteC(0xc0004c8f08) github.com/spf13/cobra@v1.8.1/command.go:1117 +0x3ff fp=0xc000507f30 sp=0xc000507e58 pc=0x5d46b0e4bcbf github.com/spf13/cobra.(*Command).Execute(...) github.com/spf13/cobra@v1.8.1/command.go:1041 github.com/spf13/cobra.(*Command).ExecuteContext(...) github.com/spf13/cobra@v1.8.1/command.go:1034 main.main() ollama/main.go:12 +0x4d fp=0xc000507f50 sp=0xc000507f30 pc=0x5d46b1857dcd runtime.main() runtime/proc.go:272 +0x29d fp=0xc000507fe0 sp=0xc000507f50 pc=0x5d46b0c88f5d runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000507fe8 sp=0xc000507fe0 pc=0x5d46b0cc6021 goroutine 2 gp=0xc000006c40 m=nil [force gc (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000084fa8 sp=0xc000084f88 pc=0x5d46b0cbdc4e runtime.goparkunlock(...) runtime/proc.go:430 runtime.forcegchelper() runtime/proc.go:337 +0xb8 fp=0xc000084fe0 sp=0xc000084fa8 pc=0x5d46b0c89298 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000084fe8 sp=0xc000084fe0 pc=0x5d46b0cc6021 created by runtime.init.7 in goroutine 1 runtime/proc.go:325 +0x1a goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]: runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000085780 sp=0xc000085760 pc=0x5d46b0cbdc4e runtime.goparkunlock(...) runtime/proc.go:430 runtime.bgsweep(0xc0000b2000) runtime/mgcsweep.go:317 +0xdf fp=0xc0000857c8 sp=0xc000085780 pc=0x5d46b0c7393f runtime.gcenable.gowrap1() runtime/mgc.go:204 +0x25 fp=0xc0000857e0 sp=0xc0000857c8 pc=0x5d46b0c67f85 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000857e8 sp=0xc0000857e0 pc=0x5d46b0cc6021 created by runtime.gcenable in goroutine 1 runtime/mgc.go:204 +0x66 goroutine 4 gp=0xc000007340 m=nil [GC scavenge wait]: runtime.gopark(0x10000?, 0x5d46b1a03070?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000085f78 sp=0xc000085f58 pc=0x5d46b0cbdc4e runtime.goparkunlock(...) runtime/proc.go:430 runtime.(*scavengerState).park(0x5d46b259ed80) runtime/mgcscavenge.go:425 +0x49 fp=0xc000085fa8 sp=0xc000085f78 pc=0x5d46b0c71309 runtime.bgscavenge(0xc0000b2000) runtime/mgcscavenge.go:658 +0x59 fp=0xc000085fc8 sp=0xc000085fa8 pc=0x5d46b0c71899 runtime.gcenable.gowrap2() runtime/mgc.go:205 +0x25 fp=0xc000085fe0 sp=0xc000085fc8 pc=0x5d46b0c67f25 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000085fe8 sp=0xc000085fe0 pc=0x5d46b0cc6021 created by runtime.gcenable in goroutine 1 runtime/mgc.go:205 +0xa5 goroutine 18 gp=0xc000104700 m=nil [finalizer wait]: runtime.gopark(0xc000084648?, 0x5d46b0c5e485?, 0xb0?, 0x1?, 0xc0000061c0?) runtime/proc.go:424 +0xce fp=0xc000084620 sp=0xc000084600 pc=0x5d46b0cbdc4e runtime.runfinq() runtime/mfinal.go:193 +0x107 fp=0xc0000847e0 sp=0xc000084620 pc=0x5d46b0c67007 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000847e8 sp=0xc0000847e0 pc=0x5d46b0cc6021 created by runtime.createfing in goroutine 1 runtime/mfinal.go:163 +0x3d goroutine 19 gp=0xc000105880 m=nil [chan receive]: runtime.gopark(0xc000080760?, 0x5d46b0d99685?, 0x40?, 0xc8?, 0x5d46b1e19c00?) runtime/proc.go:424 +0xce fp=0xc000080718 sp=0xc0000806f8 pc=0x5d46b0cbdc4e runtime.chanrecv(0xc0001122a0, 0x0, 0x1) runtime/chan.go:639 +0x41c fp=0xc000080790 sp=0xc000080718 pc=0x5d46b0c5767c runtime.chanrecv1(0x0?, 0x0?) runtime/chan.go:489 +0x12 fp=0xc0000807b8 sp=0xc000080790 pc=0x5d46b0c57232 runtime.unique_runtime_registerUniqueMapCleanup.func1(...) runtime/mgc.go:1781 runtime.unique_runtime_registerUniqueMapCleanup.gowrap1() runtime/mgc.go:1784 +0x2f fp=0xc0000807e0 sp=0xc0000807b8 pc=0x5d46b0c6afef runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000807e8 sp=0xc0000807e0 pc=0x5d46b0cc6021 created by unique.runtime_registerUniqueMapCleanup in goroutine 1 runtime/mgc.go:1779 +0x96 goroutine 20 gp=0xc0004701c0 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000080f38 sp=0xc000080f18 pc=0x5d46b0cbdc4e runtime.gcBgMarkWorker(0xc0001136c0) runtime/mgc.go:1412 +0xe9 fp=0xc000080fc8 sp=0xc000080f38 pc=0x5d46b0c6a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc000080fe0 sp=0xc000080fc8 pc=0x5d46b0c6a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000080fe8 sp=0xc000080fe0 pc=0x5d46b0cc6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 34 gp=0xc000484000 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc00048a738 sp=0xc00048a718 pc=0x5d46b0cbdc4e runtime.gcBgMarkWorker(0xc0001136c0) runtime/mgc.go:1412 +0xe9 fp=0xc00048a7c8 sp=0xc00048a738 pc=0x5d46b0c6a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc00048a7e0 sp=0xc00048a7c8 pc=0x5d46b0c6a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00048a7e8 sp=0xc00048a7e0 pc=0x5d46b0cc6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 5 gp=0xc000007880 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000086738 sp=0xc000086718 pc=0x5d46b0cbdc4e runtime.gcBgMarkWorker(0xc0001136c0) runtime/mgc.go:1412 +0xe9 fp=0xc0000867c8 sp=0xc000086738 pc=0x5d46b0c6a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc0000867e0 sp=0xc0000867c8 pc=0x5d46b0c6a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000867e8 sp=0xc0000867e0 pc=0x5d46b0cc6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 21 gp=0xc000470380 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000081738 sp=0xc000081718 pc=0x5d46b0cbdc4e runtime.gcBgMarkWorker(0xc0001136c0) runtime/mgc.go:1412 +0xe9 fp=0xc0000817c8 sp=0xc000081738 pc=0x5d46b0c6a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc0000817e0 sp=0xc0000817c8 pc=0x5d46b0c6a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000817e8 sp=0xc0000817e0 pc=0x5d46b0cc6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 22 gp=0xc000470540 m=nil [GC worker (idle)]: runtime.gopark(0x1db5e35f8f?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000081f38 sp=0xc000081f18 pc=0x5d46b0cbdc4e runtime.gcBgMarkWorker(0xc0001136c0) runtime/mgc.go:1412 +0xe9 fp=0xc000081fc8 sp=0xc000081f38 pc=0x5d46b0c6a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc000081fe0 sp=0xc000081fc8 pc=0x5d46b0c6a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000081fe8 sp=0xc000081fe0 pc=0x5d46b0cc6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 6 gp=0xc000007a40 m=nil [GC worker (idle)]: runtime.gopark(0x1db5e5291d?, 0x3?, 0x2a?, 0x5f?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000086f38 sp=0xc000086f18 pc=0x5d46b0cbdc4e runtime.gcBgMarkWorker(0xc0001136c0) runtime/mgc.go:1412 +0xe9 fp=0xc000086fc8 sp=0xc000086f38 pc=0x5d46b0c6a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc000086fe0 sp=0xc000086fc8 pc=0x5d46b0c6a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000086fe8 sp=0xc000086fe0 pc=0x5d46b0cc6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 23 gp=0xc000470700 m=nil [GC worker (idle)]: runtime.gopark(0x1db5e5210d?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000082738 sp=0xc000082718 pc=0x5d46b0cbdc4e runtime.gcBgMarkWorker(0xc0001136c0) runtime/mgc.go:1412 +0xe9 fp=0xc0000827c8 sp=0xc000082738 pc=0x5d46b0c6a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc0000827e0 sp=0xc0000827c8 pc=0x5d46b0c6a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000827e8 sp=0xc0000827e0 pc=0x5d46b0cc6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 35 gp=0xc0004841c0 m=nil [GC worker (idle)]: runtime.gopark(0x1db5e3472e?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc00048af38 sp=0xc00048af18 pc=0x5d46b0cbdc4e runtime.gcBgMarkWorker(0xc0001136c0) runtime/mgc.go:1412 +0xe9 fp=0xc00048afc8 sp=0xc00048af38 pc=0x5d46b0c6a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc00048afe0 sp=0xc00048afc8 pc=0x5d46b0c6a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00048afe8 sp=0xc00048afe0 pc=0x5d46b0cc6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 42 gp=0xc000484a80 m=nil [IO wait]: runtime.gopark(0x5d46b0c62965?, 0x0?, 0x0?, 0x0?, 0xb?) runtime/proc.go:424 +0xce fp=0xc0004885a8 sp=0xc000488588 pc=0x5d46b0cbdc4e runtime.netpollblock(0x5d46b0ce0e78?, 0xb0c54a66?, 0x46?) runtime/netpoll.go:575 +0xf7 fp=0xc0004885e0 sp=0xc0004885a8 pc=0x5d46b0c818b7 internal/poll.runtime_pollWait(0x76bc60dc8cd8, 0x72) runtime/netpoll.go:351 +0x85 fp=0xc000488600 sp=0xc0004885e0 pc=0x5d46b0cbcf45 internal/poll.(*pollDesc).wait(0xc0001ac000?, 0xc00011c6d1?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000488628 sp=0xc000488600 pc=0x5d46b0d44567 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Read(0xc0001ac000, {0xc00011c6d1, 0x1, 0x1}) internal/poll/fd_unix.go:165 +0x27a fp=0xc0004886c0 sp=0xc000488628 pc=0x5d46b0d4585a net.(*netFD).Read(0xc0001ac000, {0xc00011c6d1?, 0xc000488748?, 0x5d46b0cbf8d0?}) net/fd_posix.go:55 +0x25 fp=0xc000488708 sp=0xc0004886c0 pc=0x5d46b0db0045 net.(*conn).Read(0xc00012a040, {0xc00011c6d1?, 0x0?, 0x5d46b25c6680?}) net/net.go:189 +0x45 fp=0xc000488750 sp=0xc000488708 pc=0x5d46b0dbe645 net.(*TCPConn).Read(0x5d46b25030a0?, {0xc00011c6d1?, 0x0?, 0x0?}) <autogenerated>:1 +0x25 fp=0xc000488780 sp=0xc000488750 pc=0x5d46b0dd1845 net/http.(*connReader).backgroundRead(0xc00011c6c0) net/http/server.go:690 +0x37 fp=0xc0004887c8 sp=0xc000488780 pc=0x5d46b100db37 net/http.(*connReader).startBackgroundRead.gowrap2() net/http/server.go:686 +0x25 fp=0xc0004887e0 sp=0xc0004887c8 pc=0x5d46b100da65 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0004887e8 sp=0xc0004887e0 pc=0x5d46b0cc6021 created by net/http.(*connReader).startBackgroundRead in goroutine 36 net/http/server.go:686 +0xb6 goroutine 36 gp=0xc0004708c0 m=nil [select]: runtime.gopark(0xc00005ba68?, 0x2?, 0xce?, 0x36?, 0xc00005b834?) runtime/proc.go:424 +0xce fp=0xc00005b650 sp=0xc00005b630 pc=0x5d46b0cbdc4e runtime.selectgo(0xc00005ba68, 0xc00005b830, 0x21?, 0x0, 0x1?, 0x1) runtime/select.go:335 +0x7a5 fp=0xc00005b778 sp=0xc00005b650 pc=0x5d46b0c9af45 ollama/llama/runner.(*Server).completion(0xc00019d560, {0x5d46b1e06910, 0xc0001a29a0}, 0xc000422640) ollama/llama/runner/runner.go:696 +0xab6 fp=0xc00005bac0 sp=0xc00005b778 pc=0x5d46b1089236 ollama/llama/runner.(*Server).completion-fm({0x5d46b1e06910?, 0xc0001a29a0?}, 0x5d46b1021fe7?) <autogenerated>:1 +0x36 fp=0xc00005baf0 sp=0xc00005bac0 pc=0x5d46b108c916 net/http.HandlerFunc.ServeHTTP(0xc0004bbc00?, {0x5d46b1e06910?, 0xc0001a29a0?}, 0x0?) net/http/server.go:2220 +0x29 fp=0xc00005bb18 sp=0xc00005baf0 pc=0x5d46b1014809 net/http.(*ServeMux).ServeHTTP(0x5d46b0c5e485?, {0x5d46b1e06910, 0xc0001a29a0}, 0xc000422640) net/http/server.go:2747 +0x1ca fp=0xc00005bb68 sp=0xc00005bb18 pc=0x5d46b101670a net/http.serverHandler.ServeHTTP({0x5d46b1e03510?}, {0x5d46b1e06910?, 0xc0001a29a0?}, 0x6?) net/http/server.go:3210 +0x8e fp=0xc00005bb98 sp=0xc00005bb68 pc=0x5d46b1033c6e net/http.(*conn).serve(0xc0004e4000, {0x5d46b1e08988, 0xc00060ed20}) net/http/server.go:2092 +0x5d0 fp=0xc00005bfb8 sp=0xc00005bb98 pc=0x5d46b10131b0 net/http.(*Server).Serve.gowrap3() net/http/server.go:3360 +0x28 fp=0xc00005bfe0 sp=0xc00005bfb8 pc=0x5d46b1018608 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00005bfe8 sp=0xc00005bfe0 pc=0x5d46b0cc6021 created by net/http.(*Server).Serve in goroutine 1 net/http/server.go:3360 +0x485 rax 0x0 rbx 0x0 rcx 0x76bbf0006940 rdx 0x76bbffdfebf0 rdi 0x6 rsi 0x5d46df7625a0 rbp 0x76bbffdfe960 rsp 0x76bbffdfe3f0 r8 0x76bbf0006ac0 r9 0x0 r10 0x76bc613fcf28 r11 0x76bbf0006ac0 r12 0x76bbf0006ac0 r13 0x76bbec0cd298 r14 0x76bbf0006940 r15 0x76bbf0006ac0 rip 0x76bc5f60bc2f rflags 0x10202 cs 0x33 fs 0x0 gs 0x0 [GIN] 2025/02/17 - 23:54:59 | 500 | 2.583875751s | 10.89.0.4 | POST  "/api/generate" time=2025-02-18T00:05:40.353+08:00 level=INFO source=server.go:104 msg="system memory" total="30.3 GiB" free="29.0 GiB" free_swap="8.0 GiB" time=2025-02-18T00:05:40.353+08:00 level=INFO source=memory.go:356 msg="offload to device" layers.requested=-1 layers.model=29 layers.offload=0 layers.split="" memory.available="[29.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="4.6 GiB" memory.required.partial="0 B" memory.required.kv="1.8 GiB" memory.required.allocations="[4.6 GiB]" memory.weights.total="3.3 GiB" memory.weights.repeating="3.0 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="824.0 MiB" memory.graph.partial="881.1 MiB" time=2025-02-18T00:05:40.353+08:00 level=INFO source=server.go:392 msg="starting llama server" cmd="/usr/local/lib/python3.11/dist-packages/bigdl/cpp/libs/ollama runner --model /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 16384 --batch-size 512 --n-gpu-layers 999 --threads 8 --no-mmap --parallel 1 --port 43439" time=2025-02-18T00:05:40.354+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2025-02-18T00:05:40.354+08:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding" time=2025-02-18T00:05:40.354+08:00 level=INFO source=server.go:605 msg="waiting for server to become available" status="llm server error" ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no ggml_sycl_init: SYCL_USE_XMX: yes ggml_sycl_init: found 2 SYCL devices: time=2025-02-18T00:05:40.488+08:00 level=INFO source=runner.go:967 msg="starting go runner" time=2025-02-18T00:05:40.488+08:00 level=INFO source=runner.go:968 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=8 time=2025-02-18T00:05:40.488+08:00 level=INFO source=runner.go:1026 msg="Server listening on 127.0.0.1:43439" get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory llama_load_model_from_file: using device SYCL0 (Intel(R) Graphics [0xe20b]) - 11605 MiB free get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory llama_load_model_from_file: using device SYCL1 (Intel(R) Graphics [0xe20b]) - 11605 MiB free llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Llama-3.2 llama_model_loader: - kv 5: general.size_label str = 3B llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 8: llama.block_count u32 = 28 llama_model_loader: - kv 9: llama.context_length u32 = 131072 llama_model_loader: - kv 10: llama.embedding_length u32 = 3072 llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192 llama_model_loader: - kv 12: llama.attention.head_count u32 = 24 llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 llama_model_loader: - kv 18: general.file_type u32 = 15 llama_model_loader: - kv 19: llama.vocab_size u32 = 128256 llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - type f32: 58 tensors llama_model_loader: - type q4_K: 168 tensors llama_model_loader: - type q6_K: 29 tensors time=2025-02-18T00:05:40.605+08:00 level=INFO source=server.go:605 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 3072 llm_load_print_meta: n_layer = 28 llm_load_print_meta: n_head = 24 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 3 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 8192 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 3B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 3.21 B llm_load_print_meta: model size = 1.87 GiB (5.01 BPW) llm_load_print_meta: general.name = Llama 3.2 3B Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory llm_load_tensors: offloading 28 repeating layers to GPU llm_load_tensors: offloading output layer to GPU llm_load_tensors: offloaded 29/29 layers to GPU llm_load_tensors: SYCL0 model buffer size = 852.89 MiB llm_load_tensors: SYCL1 model buffer size = 1065.46 MiB llm_load_tensors: CPU model buffer size = 308.23 MiB llama_new_context_with_model: n_seq_max = 1 llama_new_context_with_model: n_ctx = 16384 llama_new_context_with_model: n_ctx_per_seq = 16384 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized [SYCL] call ggml_check_sycl ggml_check_sycl: GGML_SYCL_DEBUG: 0 ggml_check_sycl: GGML_SYCL_F16: no Found 2 SYCL devices: | | | | |Max | |Max |Global | | | | | | |compute|Max work|sub |mem | | |ID| Device Type| Name|Version|units |group |group|size | Driver version| |--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------| | 0| [level_zero:gpu:0]| Intel Graphics [0xe20b]| 20.1| 160| 1024| 32| 12168M| 1.6.32224.500000| | 1| [level_zero:gpu:1]| Intel Graphics [0xe20b]| 20.1| 160| 1024| 32| 12168M| 1.6.32224.500000| llama_kv_cache_init: SYCL0 KV buffer size = 960.00 MiB llama_kv_cache_init: SYCL1 KV buffer size = 832.00 MiB llama_new_context_with_model: KV self size = 1792.00 MiB, K (f16): 896.00 MiB, V (f16): 896.00 MiB llama_new_context_with_model: SYCL_Host output buffer size = 0.50 MiB get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory llama_new_context_with_model: pipeline parallelism enabled (n_copies=4) llama_new_context_with_model: SYCL0 compute buffer size = 202.01 MiB llama_new_context_with_model: SYCL1 compute buffer size = 408.52 MiB llama_new_context_with_model: SYCL_Host compute buffer size = 134.02 MiB llama_new_context_with_model: graph nodes = 790 llama_new_context_with_model: graph splits = 3 time=2025-02-18T00:05:42.945+08:00 level=WARN source=runner.go:892 msg="%s: warming up the model with an empty run - please wait ... " !BADKEY=loadModel time=2025-02-18T00:05:43.114+08:00 level=INFO source=server.go:610 msg="llama runner started in 2.76 seconds" SIGILL: illegal instruction PC=0x7f465740bc2f m=3 sigcode=2 signal arrived during cgo execution instruction bytes: 0xf3 0xf 0xc7 0xf8 0x25 0xff 0x3 0x0 0x0 0x48 0x8b 0xd 0xe1 0xc2 0x2a 0x0 goroutine 7 gp=0xc000584e00 m=3 mp=0xc00008ae08 [syscall]: runtime.cgocall(0x6144ee0584e0, 0xc000096b90) runtime/cgocall.go:167 +0x4b fp=0xc000096b68 sp=0xc000096b30 pc=0x6144ed4b754b ollama/llama/llamafile._Cfunc_llama_decode(0x7f45d3c4a750, {0x21, 0x7f45d001ba90, 0x0, 0x0, 0x7f45d001c2a0, 0x7f45d001cab0, 0x7f45d001d2c0, 0x7f45d0081840}) _cgo_gotypes.go:558 +0x4f fp=0xc000096b90 sp=0xc000096b68 pc=0x6144ed87996f ollama/llama/llamafile.(*Context).Decode.func1(0x6144ed8886eb?, 0x7f45d3c4a750?) ollama/llama/llamafile/llama.go:143 +0xf5 fp=0xc000096c80 sp=0xc000096b90 pc=0x6144ed87c595 ollama/llama/llamafile.(*Context).Decode(0xc000086d70?, 0x0?) ollama/llama/llamafile/llama.go:143 +0x13 fp=0xc000096cc8 sp=0xc000096c80 pc=0x6144ed87c413 ollama/llama/runner.(*Server).processBatch(0xc00019d5f0, 0xc0000b4a20, 0xc000086f20) ollama/llama/runner/runner.go:434 +0x23f fp=0xc000096ee0 sp=0xc000096cc8 pc=0x6144ed8873bf ollama/llama/runner.(*Server).run(0xc00019d5f0, {0x6144ee6089c0, 0xc00016f310}) ollama/llama/runner/runner.go:342 +0x1d5 fp=0xc000096fb8 sp=0xc000096ee0 pc=0x6144ed886df5 ollama/llama/runner.Execute.gowrap2() ollama/llama/runner/runner.go:1006 +0x28 fp=0xc000096fe0 sp=0xc000096fb8 pc=0x6144ed88c068 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000096fe8 sp=0xc000096fe0 pc=0x6144ed4c6021 created by ollama/llama/runner.Execute in goroutine 1 ollama/llama/runner/runner.go:1006 +0xde5 goroutine 1 gp=0xc0000061c0 m=nil [IO wait]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000507560 sp=0xc000507540 pc=0x6144ed4bdc4e runtime.netpollblock(0x20059ff80?, 0xed454a66?, 0x44?) runtime/netpoll.go:575 +0xf7 fp=0xc000507598 sp=0xc000507560 pc=0x6144ed4818b7 internal/poll.runtime_pollWait(0x7f4658f67df0, 0x72) runtime/netpoll.go:351 +0x85 fp=0xc0005075b8 sp=0xc000507598 pc=0x6144ed4bcf45 internal/poll.(*pollDesc).wait(0xc00048e880?, 0x2c?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0005075e0 sp=0xc0005075b8 pc=0x6144ed544567 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Accept(0xc00048e880) internal/poll/fd_unix.go:620 +0x295 fp=0xc000507688 sp=0xc0005075e0 pc=0x6144ed549935 net.(*netFD).accept(0xc00048e880) net/fd_unix.go:172 +0x29 fp=0xc000507740 sp=0xc000507688 pc=0x6144ed5b2009 net.(*TCPListener).accept(0xc000062f80) net/tcpsock_posix.go:159 +0x1e fp=0xc000507790 sp=0xc000507740 pc=0x6144ed5c7c7e net.(*TCPListener).Accept(0xc000062f80) net/tcpsock.go:372 +0x30 fp=0xc0005077c0 sp=0xc000507790 pc=0x6144ed5c6b30 net/http.(*onceCloseListener).Accept(0xc00059e000?) <autogenerated>:1 +0x24 fp=0xc0005077d8 sp=0xc0005077c0 pc=0x6144ed840284 net/http.(*Server).Serve(0xc0000eb2c0, {0x6144ee606700, 0xc000062f80}) net/http/server.go:3330 +0x30c fp=0xc000507908 sp=0xc0005077d8 pc=0x6144ed81820c ollama/llama/runner.Execute({0xc000136010?, 0x0?, 0x0?}) ollama/llama/runner/runner.go:1027 +0x11a9 fp=0xc000507ca8 sp=0xc000507908 pc=0x6144ed88bd49 ollama/cmd.NewCLI.func2(0xc0001c9400?, {0x6144ee05cf9d?, 0x4?, 0x6144ee05cfa1?}) ollama/cmd/cmd.go:1430 +0x45 fp=0xc000507cd0 sp=0xc000507ca8 pc=0x6144ee057765 github.com/spf13/cobra.(*Command).execute(0xc00069a908, {0xc0000eac30, 0xf, 0xf}) github.com/spf13/cobra@v1.8.1/command.go:985 +0xaaa fp=0xc000507e58 sp=0xc000507cd0 pc=0x6144ed64b3ea github.com/spf13/cobra.(*Command).ExecuteC(0xc0001f9b08) github.com/spf13/cobra@v1.8.1/command.go:1117 +0x3ff fp=0xc000507f30 sp=0xc000507e58 pc=0x6144ed64bcbf github.com/spf13/cobra.(*Command).Execute(...) github.com/spf13/cobra@v1.8.1/command.go:1041 github.com/spf13/cobra.(*Command).ExecuteContext(...) github.com/spf13/cobra@v1.8.1/command.go:1034 main.main() ollama/main.go:12 +0x4d fp=0xc000507f50 sp=0xc000507f30 pc=0x6144ee057dcd runtime.main() runtime/proc.go:272 +0x29d fp=0xc000507fe0 sp=0xc000507f50 pc=0x6144ed488f5d runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000507fe8 sp=0xc000507fe0 pc=0x6144ed4c6021 goroutine 2 gp=0xc000006c40 m=nil [force gc (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000084fa8 sp=0xc000084f88 pc=0x6144ed4bdc4e runtime.goparkunlock(...) runtime/proc.go:430 runtime.forcegchelper() runtime/proc.go:337 +0xb8 fp=0xc000084fe0 sp=0xc000084fa8 pc=0x6144ed489298 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000084fe8 sp=0xc000084fe0 pc=0x6144ed4c6021 created by runtime.init.7 in goroutine 1 runtime/proc.go:325 +0x1a goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]: runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000085780 sp=0xc000085760 pc=0x6144ed4bdc4e runtime.goparkunlock(...) runtime/proc.go:430 runtime.bgsweep(0xc0000b2000) runtime/mgcsweep.go:317 +0xdf fp=0xc0000857c8 sp=0xc000085780 pc=0x6144ed47393f runtime.gcenable.gowrap1() runtime/mgc.go:204 +0x25 fp=0xc0000857e0 sp=0xc0000857c8 pc=0x6144ed467f85 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000857e8 sp=0xc0000857e0 pc=0x6144ed4c6021 created by runtime.gcenable in goroutine 1 runtime/mgc.go:204 +0x66 goroutine 4 gp=0xc000007340 m=nil [GC scavenge wait]: runtime.gopark(0x10000?, 0x6144ee203070?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000085f78 sp=0xc000085f58 pc=0x6144ed4bdc4e runtime.goparkunlock(...) runtime/proc.go:430 runtime.(*scavengerState).park(0x6144eed9ed80) runtime/mgcscavenge.go:425 +0x49 fp=0xc000085fa8 sp=0xc000085f78 pc=0x6144ed471309 runtime.bgscavenge(0xc0000b2000) runtime/mgcscavenge.go:658 +0x59 fp=0xc000085fc8 sp=0xc000085fa8 pc=0x6144ed471899 runtime.gcenable.gowrap2() runtime/mgc.go:205 +0x25 fp=0xc000085fe0 sp=0xc000085fc8 pc=0x6144ed467f25 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000085fe8 sp=0xc000085fe0 pc=0x6144ed4c6021 created by runtime.gcenable in goroutine 1 runtime/mgc.go:205 +0xa5 goroutine 18 gp=0xc000104700 m=nil [finalizer wait]: runtime.gopark(0xc000084648?, 0x6144ed45e485?, 0xb0?, 0x1?, 0xc0000061c0?) runtime/proc.go:424 +0xce fp=0xc000084620 sp=0xc000084600 pc=0x6144ed4bdc4e runtime.runfinq() runtime/mfinal.go:193 +0x107 fp=0xc0000847e0 sp=0xc000084620 pc=0x6144ed467007 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000847e8 sp=0xc0000847e0 pc=0x6144ed4c6021 created by runtime.createfing in goroutine 1 runtime/mfinal.go:163 +0x3d goroutine 19 gp=0xc000105880 m=nil [chan receive]: runtime.gopark(0xc000080760?, 0x6144ed599685?, 0x40?, 0xc8?, 0x6144ee619c00?) runtime/proc.go:424 +0xce fp=0xc000080718 sp=0xc0000806f8 pc=0x6144ed4bdc4e runtime.chanrecv(0xc0001122a0, 0x0, 0x1) runtime/chan.go:639 +0x41c fp=0xc000080790 sp=0xc000080718 pc=0x6144ed45767c runtime.chanrecv1(0x0?, 0x0?) runtime/chan.go:489 +0x12 fp=0xc0000807b8 sp=0xc000080790 pc=0x6144ed457232 runtime.unique_runtime_registerUniqueMapCleanup.func1(...) runtime/mgc.go:1781 runtime.unique_runtime_registerUniqueMapCleanup.gowrap1() runtime/mgc.go:1784 +0x2f fp=0xc0000807e0 sp=0xc0000807b8 pc=0x6144ed46afef runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000807e8 sp=0xc0000807e0 pc=0x6144ed4c6021 created by unique.runtime_registerUniqueMapCleanup in goroutine 1 runtime/mgc.go:1779 +0x96 goroutine 20 gp=0xc000105c00 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000080f38 sp=0xc000080f18 pc=0x6144ed4bdc4e runtime.gcBgMarkWorker(0xc000113880) runtime/mgc.go:1412 +0xe9 fp=0xc000080fc8 sp=0xc000080f38 pc=0x6144ed46a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc000080fe0 sp=0xc000080fc8 pc=0x6144ed46a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000080fe8 sp=0xc000080fe0 pc=0x6144ed4c6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 34 gp=0xc000484000 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc00048a738 sp=0xc00048a718 pc=0x6144ed4bdc4e runtime.gcBgMarkWorker(0xc000113880) runtime/mgc.go:1412 +0xe9 fp=0xc00048a7c8 sp=0xc00048a738 pc=0x6144ed46a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc00048a7e0 sp=0xc00048a7c8 pc=0x6144ed46a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00048a7e8 sp=0xc00048a7e0 pc=0x6144ed4c6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 5 gp=0xc000007880 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000086738 sp=0xc000086718 pc=0x6144ed4bdc4e runtime.gcBgMarkWorker(0xc000113880) runtime/mgc.go:1412 +0xe9 fp=0xc0000867c8 sp=0xc000086738 pc=0x6144ed46a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc0000867e0 sp=0xc0000867c8 pc=0x6144ed46a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000867e8 sp=0xc0000867e0 pc=0x6144ed4c6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 35 gp=0xc0004841c0 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc00048af38 sp=0xc00048af18 pc=0x6144ed4bdc4e runtime.gcBgMarkWorker(0xc000113880) runtime/mgc.go:1412 +0xe9 fp=0xc00048afc8 sp=0xc00048af38 pc=0x6144ed46a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc00048afe0 sp=0xc00048afc8 pc=0x6144ed46a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00048afe8 sp=0xc00048afe0 pc=0x6144ed4c6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 36 gp=0xc000484380 m=nil [GC worker (idle)]: runtime.gopark(0xb375489572?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc00048b738 sp=0xc00048b718 pc=0x6144ed4bdc4e runtime.gcBgMarkWorker(0xc000113880) runtime/mgc.go:1412 +0xe9 fp=0xc00048b7c8 sp=0xc00048b738 pc=0x6144ed46a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc00048b7e0 sp=0xc00048b7c8 pc=0x6144ed46a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00048b7e8 sp=0xc00048b7e0 pc=0x6144ed4c6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 37 gp=0xc000484540 m=nil [GC worker (idle)]: runtime.gopark(0xb375487a7e?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc00048bf38 sp=0xc00048bf18 pc=0x6144ed4bdc4e runtime.gcBgMarkWorker(0xc000113880) runtime/mgc.go:1412 +0xe9 fp=0xc00048bfc8 sp=0xc00048bf38 pc=0x6144ed46a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc00048bfe0 sp=0xc00048bfc8 pc=0x6144ed46a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00048bfe8 sp=0xc00048bfe0 pc=0x6144ed4c6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 21 gp=0xc000105dc0 m=nil [GC worker (idle)]: runtime.gopark(0xb375492c10?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000081738 sp=0xc000081718 pc=0x6144ed4bdc4e runtime.gcBgMarkWorker(0xc000113880) runtime/mgc.go:1412 +0xe9 fp=0xc0000817c8 sp=0xc000081738 pc=0x6144ed46a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc0000817e0 sp=0xc0000817c8 pc=0x6144ed46a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000817e8 sp=0xc0000817e0 pc=0x6144ed4c6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 22 gp=0xc00047a000 m=nil [GC worker (idle)]: runtime.gopark(0xb375488bb1?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000081f38 sp=0xc000081f18 pc=0x6144ed4bdc4e runtime.gcBgMarkWorker(0xc000113880) runtime/mgc.go:1412 +0xe9 fp=0xc000081fc8 sp=0xc000081f38 pc=0x6144ed46a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc000081fe0 sp=0xc000081fc8 pc=0x6144ed46a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000081fe8 sp=0xc000081fe0 pc=0x6144ed4c6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 11 gp=0xc000584c40 m=nil [IO wait]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0xb?) runtime/proc.go:424 +0xce fp=0xc0000825a8 sp=0xc000082588 pc=0x6144ed4bdc4e runtime.netpollblock(0x6144ed4e0e78?, 0xed454a66?, 0x44?) runtime/netpoll.go:575 +0xf7 fp=0xc0000825e0 sp=0xc0000825a8 pc=0x6144ed4818b7 internal/poll.runtime_pollWait(0x7f4658f67cd8, 0x72) runtime/netpoll.go:351 +0x85 fp=0xc000082600 sp=0xc0000825e0 pc=0x6144ed4bcf45 internal/poll.(*pollDesc).wait(0xc0001ac000?, 0xc000592101?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000082628 sp=0xc000082600 pc=0x6144ed544567 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Read(0xc0001ac000, {0xc000592101, 0x1, 0x1}) internal/poll/fd_unix.go:165 +0x27a fp=0xc0000826c0 sp=0xc000082628 pc=0x6144ed54585a net.(*netFD).Read(0xc0001ac000, {0xc000592101?, 0x0?, 0x0?}) net/fd_posix.go:55 +0x25 fp=0xc000082708 sp=0xc0000826c0 pc=0x6144ed5b0045 net.(*conn).Read(0xc000088128, {0xc000592101?, 0x0?, 0x0?}) net/net.go:189 +0x45 fp=0xc000082750 sp=0xc000082708 pc=0x6144ed5be645 net.(*TCPConn).Read(0x0?, {0xc000592101?, 0x0?, 0x0?}) <autogenerated>:1 +0x25 fp=0xc000082780 sp=0xc000082750 pc=0x6144ed5d1845 net/http.(*connReader).backgroundRead(0xc0005920f0) net/http/server.go:690 +0x37 fp=0xc0000827c8 sp=0xc000082780 pc=0x6144ed80db37 net/http.(*connReader).startBackgroundRead.gowrap2() net/http/server.go:686 +0x25 fp=0xc0000827e0 sp=0xc0000827c8 pc=0x6144ed80da65 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000827e8 sp=0xc0000827e0 pc=0x6144ed4c6021 created by net/http.(*connReader).startBackgroundRead in goroutine 50 net/http/server.go:686 +0xb6 goroutine 50 gp=0xc000007dc0 m=nil [select]: runtime.gopark(0xc00005ba68?, 0x2?, 0xce?, 0x36?, 0xc00005b834?) runtime/proc.go:424 +0xce fp=0xc00005b650 sp=0xc00005b630 pc=0x6144ed4bdc4e runtime.selectgo(0xc00005ba68, 0xc00005b830, 0x21?, 0x0, 0x1?, 0x1) runtime/select.go:335 +0x7a5 fp=0xc00005b778 sp=0xc00005b650 pc=0x6144ed49af45 ollama/llama/runner.(*Server).completion(0xc00019d5f0, {0x6144ee606910, 0xc0001a2c40}, 0xc000423e00) ollama/llama/runner/runner.go:696 +0xab6 fp=0xc00005bac0 sp=0xc00005b778 pc=0x6144ed889236 ollama/llama/runner.(*Server).completion-fm({0x6144ee606910?, 0xc0001a2c40?}, 0x6144ed821fe7?) <autogenerated>:1 +0x36 fp=0xc00005baf0 sp=0xc00005bac0 pc=0x6144ed88c916 net/http.HandlerFunc.ServeHTTP(0xc0001a28c0?, {0x6144ee606910?, 0xc0001a2c40?}, 0x0?) net/http/server.go:2220 +0x29 fp=0xc00005bb18 sp=0xc00005baf0 pc=0x6144ed814809 net/http.(*ServeMux).ServeHTTP(0x6144ed45e485?, {0x6144ee606910, 0xc0001a2c40}, 0xc000423e00) net/http/server.go:2747 +0x1ca fp=0xc00005bb68 sp=0xc00005bb18 pc=0x6144ed81670a net/http.serverHandler.ServeHTTP({0x6144ee603510?}, {0x6144ee606910?, 0xc0001a2c40?}, 0x6?) net/http/server.go:3210 +0x8e fp=0xc00005bb98 sp=0xc00005bb68 pc=0x6144ed833c6e net/http.(*conn).serve(0xc00059e000, {0x6144ee608988, 0xc000698600}) net/http/server.go:2092 +0x5d0 fp=0xc00005bfb8 sp=0xc00005bb98 pc=0x6144ed8131b0 net/http.(*Server).Serve.gowrap3() net/http/server.go:3360 +0x28 fp=0xc00005bfe0 sp=0xc00005bfb8 pc=0x6144ed818608 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00005bfe8 sp=0xc00005bfe0 pc=0x6144ed4c6021 created by net/http.(*Server).Serve in goroutine 1 net/http/server.go:3360 +0x485 rax 0x0 rbx 0x0 rcx 0x7f45e8006740 rdx 0x7f45f7bfebf0 rdi 0x6 rsi 0x6144ff3565a0 rbp 0x7f45f7bfe960 rsp 0x7f45f7bfe3f0 r8 0x7f45e80068c0 r9 0x0 r10 0x7f4659563f28 r11 0x7f45e80068c0 r12 0x7f45e80068c0 r13 0x7f45d00cd478 r14 0x7f45e8006740 r15 0x7f45e80068c0 rip 0x7f465740bc2f rflags 0x10202 cs 0x33 fs 0x0 gs 0x0 [GIN] 2025/02/18 - 00:05:43 | 500 | 2.83359842s | 10.89.0.4 | POST  "/api/generate" [GIN] 2025/02/18 - 00:06:38 | 200 | 26µs | 10.89.0.4 | GET  "/" [GIN] 2025/02/18 - 00:07:05 | 200 | 17.285µs | 10.89.0.4 | GET  "/" [GIN] 2025/02/18 - 00:07:20 | 404 | 2.766µs | 10.89.0.4 | POST  "/api/cat" [GIN] 2025/02/18 - 00:07:33 | 404 | 17.543µs | 10.89.0.4 | POST  "/api/cat" time=2025-02-18T00:07:46.069+08:00 level=INFO source=server.go:104 msg="system memory" total="30.3 GiB" free="29.0 GiB" free_swap="8.0 GiB" time=2025-02-18T00:07:46.069+08:00 level=INFO source=memory.go:356 msg="offload to device" layers.requested=-1 layers.model=29 layers.offload=0 layers.split="" memory.available="[29.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="4.6 GiB" memory.required.partial="0 B" memory.required.kv="1.8 GiB" memory.required.allocations="[4.6 GiB]" memory.weights.total="3.3 GiB" memory.weights.repeating="3.0 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="824.0 MiB" memory.graph.partial="881.1 MiB" time=2025-02-18T00:07:46.070+08:00 level=INFO source=server.go:392 msg="starting llama server" cmd="/usr/local/lib/python3.11/dist-packages/bigdl/cpp/libs/ollama runner --model /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 16384 --batch-size 512 --n-gpu-layers 999 --threads 8 --no-mmap --parallel 1 --port 38703" time=2025-02-18T00:07:46.070+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2025-02-18T00:07:46.070+08:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding" time=2025-02-18T00:07:46.070+08:00 level=INFO source=server.go:605 msg="waiting for server to become available" status="llm server error" ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no ggml_sycl_init: SYCL_USE_XMX: yes ggml_sycl_init: found 2 SYCL devices: time=2025-02-18T00:07:46.205+08:00 level=INFO source=runner.go:967 msg="starting go runner" time=2025-02-18T00:07:46.205+08:00 level=INFO source=runner.go:968 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=8 get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory time=2025-02-18T00:07:46.205+08:00 level=INFO source=runner.go:1026 msg="Server listening on 127.0.0.1:38703" llama_load_model_from_file: using device SYCL0 (Intel(R) Graphics [0xe20b]) - 11605 MiB free get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory llama_load_model_from_file: using device SYCL1 (Intel(R) Graphics [0xe20b]) - 11605 MiB free llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Llama-3.2 llama_model_loader: - kv 5: general.size_label str = 3B llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 8: llama.block_count u32 = 28 llama_model_loader: - kv 9: llama.context_length u32 = 131072 llama_model_loader: - kv 10: llama.embedding_length u32 = 3072 llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192 llama_model_loader: - kv 12: llama.attention.head_count u32 = 24 llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 llama_model_loader: - kv 18: general.file_type u32 = 15 llama_model_loader: - kv 19: llama.vocab_size u32 = 128256 llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - type f32: 58 tensors llama_model_loader: - type q4_K: 168 tensors llama_model_loader: - type q6_K: 29 tensors time=2025-02-18T00:07:46.321+08:00 level=INFO source=server.go:605 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 3072 llm_load_print_meta: n_layer = 28 llm_load_print_meta: n_head = 24 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 3 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 8192 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 3B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 3.21 B llm_load_print_meta: model size = 1.87 GiB (5.01 BPW) llm_load_print_meta: general.name = Llama 3.2 3B Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory llm_load_tensors: offloading 28 repeating layers to GPU llm_load_tensors: offloading output layer to GPU llm_load_tensors: offloaded 29/29 layers to GPU llm_load_tensors: SYCL0 model buffer size = 852.89 MiB llm_load_tensors: SYCL1 model buffer size = 1065.46 MiB llm_load_tensors: CPU model buffer size = 308.23 MiB llama_new_context_with_model: n_seq_max = 1 llama_new_context_with_model: n_ctx = 16384 llama_new_context_with_model: n_ctx_per_seq = 16384 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized [SYCL] call ggml_check_sycl ggml_check_sycl: GGML_SYCL_DEBUG: 0 ggml_check_sycl: GGML_SYCL_F16: no Found 2 SYCL devices: | | | | |Max | |Max |Global | | | | | | |compute|Max work|sub |mem | | |ID| Device Type| Name|Version|units |group |group|size | Driver version| |--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------| | 0| [level_zero:gpu:0]| Intel Graphics [0xe20b]| 20.1| 160| 1024| 32| 12168M| 1.6.32224.500000| | 1| [level_zero:gpu:1]| Intel Graphics [0xe20b]| 20.1| 160| 1024| 32| 12168M| 1.6.32224.500000| llama_kv_cache_init: SYCL0 KV buffer size = 960.00 MiB llama_kv_cache_init: SYCL1 KV buffer size = 832.00 MiB llama_new_context_with_model: KV self size = 1792.00 MiB, K (f16): 896.00 MiB, V (f16): 896.00 MiB llama_new_context_with_model: SYCL_Host output buffer size = 0.50 MiB get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory llama_new_context_with_model: pipeline parallelism enabled (n_copies=4) llama_new_context_with_model: SYCL0 compute buffer size = 202.01 MiB llama_new_context_with_model: SYCL1 compute buffer size = 408.52 MiB llama_new_context_with_model: SYCL_Host compute buffer size = 134.02 MiB llama_new_context_with_model: graph nodes = 790 llama_new_context_with_model: graph splits = 3 time=2025-02-18T00:07:48.472+08:00 level=WARN source=runner.go:892 msg="%s: warming up the model with an empty run - please wait ... " !BADKEY=loadModel time=2025-02-18T00:07:48.579+08:00 level=INFO source=server.go:610 msg="llama runner started in 2.51 seconds" [GIN] 2025/02/18 - 00:07:48 | 200 | 2.551880862s | 10.89.0.4 | POST  "/api/chat" [GIN] 2025/02/18 - 00:07:59 | 200 | 14.32507ms | 10.89.0.4 | POST  "/api/chat" [GIN] 2025/02/18 - 00:08:03 | 200 | 14.761876ms | 10.89.0.4 | POST  "/api/chat" [GIN] 2025/02/18 - 00:09:21 | 200 | 391.063µs | 10.89.0.1 | GET  "/api/tags" [GIN] 2025/02/18 - 00:09:29 | 200 | 86.722µs | 10.89.0.1 | GET  "/api/version" [GIN] 2025/02/18 - 00:09:45 | 200 | 220.019µs | 10.89.0.4 | GET  "/api/tags" [GIN] 2025/02/18 - 00:10:24 | 200 | 15.094868ms | 10.89.0.4 | POST  "/api/chat" time=2025-02-18T00:51:08.195+08:00 level=INFO source=server.go:104 msg="system memory" total="30.3 GiB" free="29.0 GiB" free_swap="8.0 GiB" time=2025-02-18T00:51:08.195+08:00 level=INFO source=memory.go:356 msg="offload to device" layers.requested=-1 layers.model=29 layers.offload=0 layers.split="" memory.available="[29.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="4.6 GiB" memory.required.partial="0 B" memory.required.kv="1.8 GiB" memory.required.allocations="[4.6 GiB]" memory.weights.total="3.3 GiB" memory.weights.repeating="3.0 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="824.0 MiB" memory.graph.partial="881.1 MiB" time=2025-02-18T00:51:08.196+08:00 level=INFO source=server.go:392 msg="starting llama server" cmd="/usr/local/lib/python3.11/dist-packages/bigdl/cpp/libs/ollama runner --model /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 16384 --batch-size 512 --n-gpu-layers 999 --threads 8 --no-mmap --parallel 1 --port 39041" time=2025-02-18T00:51:08.196+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2025-02-18T00:51:08.196+08:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding" time=2025-02-18T00:51:08.196+08:00 level=INFO source=server.go:605 msg="waiting for server to become available" status="llm server error" ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no ggml_sycl_init: SYCL_USE_XMX: yes ggml_sycl_init: found 2 SYCL devices: time=2025-02-18T00:51:08.334+08:00 level=INFO source=runner.go:967 msg="starting go runner" time=2025-02-18T00:51:08.334+08:00 level=INFO source=runner.go:968 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=8 get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory llama_load_model_from_file: using device SYCL0 (Intel(R) Graphics [0xe20b]) - 11605 MiB free get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory llama_load_model_from_file: using device SYCL1 (Intel(R) Graphics [0xe20b]) - 11605 MiB free time=2025-02-18T00:51:08.334+08:00 level=INFO source=runner.go:1026 msg="Server listening on 127.0.0.1:39041" llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Llama-3.2 llama_model_loader: - kv 5: general.size_label str = 3B llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 8: llama.block_count u32 = 28 llama_model_loader: - kv 9: llama.context_length u32 = 131072 llama_model_loader: - kv 10: llama.embedding_length u32 = 3072 llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192 llama_model_loader: - kv 12: llama.attention.head_count u32 = 24 llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 llama_model_loader: - kv 18: general.file_type u32 = 15 llama_model_loader: - kv 19: llama.vocab_size u32 = 128256 llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - type f32: 58 tensors llama_model_loader: - type q4_K: 168 tensors llama_model_loader: - type q6_K: 29 tensors time=2025-02-18T00:51:08.447+08:00 level=INFO source=server.go:605 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 3072 llm_load_print_meta: n_layer = 28 llm_load_print_meta: n_head = 24 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 3 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 8192 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 3B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 3.21 B llm_load_print_meta: model size = 1.87 GiB (5.01 BPW) llm_load_print_meta: general.name = Llama 3.2 3B Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory llm_load_tensors: offloading 28 repeating layers to GPU llm_load_tensors: offloading output layer to GPU llm_load_tensors: offloaded 29/29 layers to GPU llm_load_tensors: SYCL0 model buffer size = 852.89 MiB llm_load_tensors: SYCL1 model buffer size = 1065.46 MiB llm_load_tensors: CPU model buffer size = 308.23 MiB llama_new_context_with_model: n_seq_max = 1 llama_new_context_with_model: n_ctx = 16384 llama_new_context_with_model: n_ctx_per_seq = 16384 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized [SYCL] call ggml_check_sycl ggml_check_sycl: GGML_SYCL_DEBUG: 0 ggml_check_sycl: GGML_SYCL_F16: no Found 2 SYCL devices: | | | | |Max | |Max |Global | | | | | | |compute|Max work|sub |mem | | |ID| Device Type| Name|Version|units |group |group|size | Driver version| |--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------| | 0| [level_zero:gpu:0]| Intel Graphics [0xe20b]| 20.1| 160| 1024| 32| 12168M| 1.6.32224.500000| | 1| [level_zero:gpu:1]| Intel Graphics [0xe20b]| 20.1| 160| 1024| 32| 12168M| 1.6.32224.500000| llama_kv_cache_init: SYCL0 KV buffer size = 960.00 MiB llama_kv_cache_init: SYCL1 KV buffer size = 832.00 MiB llama_new_context_with_model: KV self size = 1792.00 MiB, K (f16): 896.00 MiB, V (f16): 896.00 MiB llama_new_context_with_model: SYCL_Host output buffer size = 0.50 MiB get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory llama_new_context_with_model: pipeline parallelism enabled (n_copies=4) llama_new_context_with_model: SYCL0 compute buffer size = 202.01 MiB llama_new_context_with_model: SYCL1 compute buffer size = 408.52 MiB llama_new_context_with_model: SYCL_Host compute buffer size = 134.02 MiB llama_new_context_with_model: graph nodes = 790 llama_new_context_with_model: graph splits = 3 time=2025-02-18T00:51:10.610+08:00 level=WARN source=runner.go:892 msg="%s: warming up the model with an empty run - please wait ... " !BADKEY=loadModel time=2025-02-18T00:51:10.705+08:00 level=INFO source=server.go:610 msg="llama runner started in 2.51 seconds" [GIN] 2025/02/18 - 00:51:10 | 200 | 2.550868677s | 10.89.0.4 | POST  "/api/chat" SIGILL: illegal instruction PC=0x725dfd20bc2f m=9 sigcode=2 signal arrived during cgo execution instruction bytes: 0xf3 0xf 0xc7 0xf8 0x25 0xff 0x3 0x0 0x0 0x48 0x8b 0xd 0xe1 0xc2 0x2a 0x0 goroutine 14 gp=0xc0001048c0 m=9 mp=0xc000508008 [syscall]: runtime.cgocall(0x616d12a584e0, 0xc0004f7b90) runtime/cgocall.go:167 +0x4b fp=0xc0004f7b68 sp=0xc0004f7b30 pc=0x616d11eb754b ollama/llama/llamafile._Cfunc_llama_decode(0x725d73c576b0, {0x21, 0x725d70020350, 0x0, 0x0, 0x725d7001b8f0, 0x725d7001c100, 0x725d7001c910, 0x725d700813f0}) _cgo_gotypes.go:558 +0x4f fp=0xc0004f7b90 sp=0xc0004f7b68 pc=0x616d1227996f ollama/llama/llamafile.(*Context).Decode.func1(0x616d122886eb?, 0x725d73c576b0?) ollama/llama/llamafile/llama.go:143 +0xf5 fp=0xc0004f7c80 sp=0xc0004f7b90 pc=0x616d1227c595 ollama/llama/llamafile.(*Context).Decode(0xc0004f7d70?, 0x0?) ollama/llama/llamafile/llama.go:143 +0x13 fp=0xc0004f7cc8 sp=0xc0004f7c80 pc=0x616d1227c413 ollama/llama/runner.(*Server).processBatch(0xc0001a7560, 0xc00062c0c0, 0xc0004f7f20) ollama/llama/runner/runner.go:434 +0x23f fp=0xc0004f7ee0 sp=0xc0004f7cc8 pc=0x616d122873bf ollama/llama/runner.(*Server).run(0xc0001a7560, {0x616d130089c0, 0xc000526e10}) ollama/llama/runner/runner.go:342 +0x1d5 fp=0xc0004f7fb8 sp=0xc0004f7ee0 pc=0x616d12286df5 ollama/llama/runner.Execute.gowrap2() ollama/llama/runner/runner.go:1006 +0x28 fp=0xc0004f7fe0 sp=0xc0004f7fb8 pc=0x616d1228c068 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0004f7fe8 sp=0xc0004f7fe0 pc=0x616d11ec6021 created by ollama/llama/runner.Execute in goroutine 1 ollama/llama/runner/runner.go:1006 +0xde5 goroutine 1 gp=0xc0000061c0 m=nil [IO wait]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc0004b5560 sp=0xc0004b5540 pc=0x616d11ebdc4e runtime.netpollblock(0x1ff80?, 0x11e54a66?, 0x6d?) runtime/netpoll.go:575 +0xf7 fp=0xc0004b5598 sp=0xc0004b5560 pc=0x616d11e818b7 internal/poll.runtime_pollWait(0x725dfed35680, 0x72) runtime/netpoll.go:351 +0x85 fp=0xc0004b55b8 sp=0xc0004b5598 pc=0x616d11ebcf45 internal/poll.(*pollDesc).wait(0xc000507900?, 0x2c?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0004b55e0 sp=0xc0004b55b8 pc=0x616d11f44567 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Accept(0xc000507900) internal/poll/fd_unix.go:620 +0x295 fp=0xc0004b5688 sp=0xc0004b55e0 pc=0x616d11f49935 net.(*netFD).accept(0xc000507900) net/fd_unix.go:172 +0x29 fp=0xc0004b5740 sp=0xc0004b5688 pc=0x616d11fb2009 net.(*TCPListener).accept(0xc0005236c0) net/tcpsock_posix.go:159 +0x1e fp=0xc0004b5790 sp=0xc0004b5740 pc=0x616d11fc7c7e net.(*TCPListener).Accept(0xc0005236c0) net/tcpsock.go:372 +0x30 fp=0xc0004b57c0 sp=0xc0004b5790 pc=0x616d11fc6b30 net/http.(*onceCloseListener).Accept(0xc00001e000?) <autogenerated>:1 +0x24 fp=0xc0004b57d8 sp=0xc0004b57c0 pc=0x616d12240284 net/http.(*Server).Serve(0xc0003ecc30, {0x616d13006700, 0xc0005236c0}) net/http/server.go:3330 +0x30c fp=0xc0004b5908 sp=0xc0004b57d8 pc=0x616d1221820c ollama/llama/runner.Execute({0xc000036130?, 0x0?, 0x0?}) ollama/llama/runner/runner.go:1027 +0x11a9 fp=0xc0004b5ca8 sp=0xc0004b5908 pc=0x616d1228bd49 ollama/cmd.NewCLI.func2(0xc0001cf200?, {0x616d12a5cf9d?, 0x4?, 0x616d12a5cfa1?}) ollama/cmd/cmd.go:1430 +0x45 fp=0xc0004b5cd0 sp=0xc0004b5ca8 pc=0x616d12a57765 github.com/spf13/cobra.(*Command).execute(0xc0004ce908, {0xc0003ec2d0, 0xf, 0xf}) github.com/spf13/cobra@v1.8.1/command.go:985 +0xaaa fp=0xc0004b5e58 sp=0xc0004b5cd0 pc=0x616d1204b3ea github.com/spf13/cobra.(*Command).ExecuteC(0xc0006b2f08) github.com/spf13/cobra@v1.8.1/command.go:1117 +0x3ff fp=0xc0004b5f30 sp=0xc0004b5e58 pc=0x616d1204bcbf github.com/spf13/cobra.(*Command).Execute(...) github.com/spf13/cobra@v1.8.1/command.go:1041 github.com/spf13/cobra.(*Command).ExecuteContext(...) github.com/spf13/cobra@v1.8.1/command.go:1034 main.main() ollama/main.go:12 +0x4d fp=0xc0004b5f50 sp=0xc0004b5f30 pc=0x616d12a57dcd runtime.main() runtime/proc.go:272 +0x29d fp=0xc0004b5fe0 sp=0xc0004b5f50 pc=0x616d11e88f5d runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0004b5fe8 sp=0xc0004b5fe0 pc=0x616d11ec6021 goroutine 2 gp=0xc000006c40 m=nil [force gc (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000084fa8 sp=0xc000084f88 pc=0x616d11ebdc4e runtime.goparkunlock(...) runtime/proc.go:430 runtime.forcegchelper() runtime/proc.go:337 +0xb8 fp=0xc000084fe0 sp=0xc000084fa8 pc=0x616d11e89298 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000084fe8 sp=0xc000084fe0 pc=0x616d11ec6021 created by runtime.init.7 in goroutine 1 runtime/proc.go:325 +0x1a goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]: runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000085780 sp=0xc000085760 pc=0x616d11ebdc4e runtime.goparkunlock(...) runtime/proc.go:430 runtime.bgsweep(0xc0000b2000) runtime/mgcsweep.go:317 +0xdf fp=0xc0000857c8 sp=0xc000085780 pc=0x616d11e7393f runtime.gcenable.gowrap1() runtime/mgc.go:204 +0x25 fp=0xc0000857e0 sp=0xc0000857c8 pc=0x616d11e67f85 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000857e8 sp=0xc0000857e0 pc=0x616d11ec6021 created by runtime.gcenable in goroutine 1 runtime/mgc.go:204 +0x66 goroutine 4 gp=0xc000007340 m=nil [GC scavenge wait]: runtime.gopark(0x10000?, 0x616d12c03070?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000085f78 sp=0xc000085f58 pc=0x616d11ebdc4e runtime.goparkunlock(...) runtime/proc.go:430 runtime.(*scavengerState).park(0x616d1379ed80) runtime/mgcscavenge.go:425 +0x49 fp=0xc000085fa8 sp=0xc000085f78 pc=0x616d11e71309 runtime.bgscavenge(0xc0000b2000) runtime/mgcscavenge.go:658 +0x59 fp=0xc000085fc8 sp=0xc000085fa8 pc=0x616d11e71899 runtime.gcenable.gowrap2() runtime/mgc.go:205 +0x25 fp=0xc000085fe0 sp=0xc000085fc8 pc=0x616d11e67f25 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000085fe8 sp=0xc000085fe0 pc=0x616d11ec6021 created by runtime.gcenable in goroutine 1 runtime/mgc.go:205 +0xa5 goroutine 5 gp=0xc000007c00 m=nil [finalizer wait]: runtime.gopark(0xc000084648?, 0x616d11e5e485?, 0xb0?, 0x1?, 0xc0000061c0?) runtime/proc.go:424 +0xce fp=0xc000084620 sp=0xc000084600 pc=0x616d11ebdc4e runtime.runfinq() runtime/mfinal.go:193 +0x107 fp=0xc0000847e0 sp=0xc000084620 pc=0x616d11e67007 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000847e8 sp=0xc0000847e0 pc=0x616d11ec6021 created by runtime.createfing in goroutine 1 runtime/mfinal.go:163 +0x3d goroutine 6 gp=0xc0001fae00 m=nil [chan receive]: runtime.gopark(0xc000086760?, 0x616d11f99685?, 0x40?, 0xe8?, 0x616d13019c00?) runtime/proc.go:424 +0xce fp=0xc000086718 sp=0xc0000866f8 pc=0x616d11ebdc4e runtime.chanrecv(0xc00004e310, 0x0, 0x1) runtime/chan.go:639 +0x41c fp=0xc000086790 sp=0xc000086718 pc=0x616d11e5767c runtime.chanrecv1(0x0?, 0x0?) runtime/chan.go:489 +0x12 fp=0xc0000867b8 sp=0xc000086790 pc=0x616d11e57232 runtime.unique_runtime_registerUniqueMapCleanup.func1(...) runtime/mgc.go:1781 runtime.unique_runtime_registerUniqueMapCleanup.gowrap1() runtime/mgc.go:1784 +0x2f fp=0xc0000867e0 sp=0xc0000867b8 pc=0x616d11e6afef runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000867e8 sp=0xc0000867e0 pc=0x616d11ec6021 created by unique.runtime_registerUniqueMapCleanup in goroutine 1 runtime/mgc.go:1779 +0x96 goroutine 7 gp=0xc0001fba40 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000086f38 sp=0xc000086f18 pc=0x616d11ebdc4e runtime.gcBgMarkWorker(0xc00004f730) runtime/mgc.go:1412 +0xe9 fp=0xc000086fc8 sp=0xc000086f38 pc=0x616d11e6a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc000086fe0 sp=0xc000086fc8 pc=0x616d11e6a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000086fe8 sp=0xc000086fe0 pc=0x616d11ec6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 18 gp=0xc000504000 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000080738 sp=0xc000080718 pc=0x616d11ebdc4e runtime.gcBgMarkWorker(0xc00004f730) runtime/mgc.go:1412 +0xe9 fp=0xc0000807c8 sp=0xc000080738 pc=0x616d11e6a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc0000807e0 sp=0xc0000807c8 pc=0x616d11e6a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000807e8 sp=0xc0000807e0 pc=0x616d11ec6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 34 gp=0xc000104380 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc00011c738 sp=0xc00011c718 pc=0x616d11ebdc4e runtime.gcBgMarkWorker(0xc00004f730) runtime/mgc.go:1412 +0xe9 fp=0xc00011c7c8 sp=0xc00011c738 pc=0x616d11e6a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc00011c7e0 sp=0xc00011c7c8 pc=0x616d11e6a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00011c7e8 sp=0xc00011c7e0 pc=0x616d11ec6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 8 gp=0xc0001fbc00 m=nil [GC worker (idle)]: runtime.gopark(0x32e956a4c73?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000087738 sp=0xc000087718 pc=0x616d11ebdc4e runtime.gcBgMarkWorker(0xc00004f730) runtime/mgc.go:1412 +0xe9 fp=0xc0000877c8 sp=0xc000087738 pc=0x616d11e6a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc0000877e0 sp=0xc0000877c8 pc=0x616d11e6a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000877e8 sp=0xc0000877e0 pc=0x616d11ec6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 9 gp=0xc0001fbdc0 m=nil [GC worker (idle)]: runtime.gopark(0x32e9569ff0f?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000087f38 sp=0xc000087f18 pc=0x616d11ebdc4e runtime.gcBgMarkWorker(0xc00004f730) runtime/mgc.go:1412 +0xe9 fp=0xc000087fc8 sp=0xc000087f38 pc=0x616d11e6a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc000087fe0 sp=0xc000087fc8 pc=0x616d11e6a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000087fe8 sp=0xc000087fe0 pc=0x616d11ec6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 10 gp=0xc0004a4000 m=nil [GC worker (idle)]: runtime.gopark(0x616d137c8900?, 0x1?, 0xa3?, 0xe?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000118738 sp=0xc000118718 pc=0x616d11ebdc4e runtime.gcBgMarkWorker(0xc00004f730) runtime/mgc.go:1412 +0xe9 fp=0xc0001187c8 sp=0xc000118738 pc=0x616d11e6a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc0001187e0 sp=0xc0001187c8 pc=0x616d11e6a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0001187e8 sp=0xc0001187e0 pc=0x616d11ec6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 11 gp=0xc0004a41c0 m=nil [GC worker (idle)]: runtime.gopark(0x32e9569fe95?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000118f38 sp=0xc000118f18 pc=0x616d11ebdc4e runtime.gcBgMarkWorker(0xc00004f730) runtime/mgc.go:1412 +0xe9 fp=0xc000118fc8 sp=0xc000118f38 pc=0x616d11e6a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc000118fe0 sp=0xc000118fc8 pc=0x616d11e6a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000118fe8 sp=0xc000118fe0 pc=0x616d11ec6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 12 gp=0xc0004a4380 m=nil [GC worker (idle)]: runtime.gopark(0x32e9569fd16?, 0x1?, 0x73?, 0xb5?, 0x0?) runtime/proc.go:424 +0xce fp=0xc000119738 sp=0xc000119718 pc=0x616d11ebdc4e runtime.gcBgMarkWorker(0xc00004f730) runtime/mgc.go:1412 +0xe9 fp=0xc0001197c8 sp=0xc000119738 pc=0x616d11e6a2e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1328 +0x25 fp=0xc0001197e0 sp=0xc0001197c8 pc=0x616d11e6a1c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0001197e8 sp=0xc0001197e0 pc=0x616d11ec6021 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1328 +0x105 goroutine 50 gp=0xc000504700 m=nil [select]: runtime.gopark(0xc00005ba68?, 0x2?, 0xce?, 0x36?, 0xc00005b834?) runtime/proc.go:424 +0xce fp=0xc00005b650 sp=0xc00005b630 pc=0x616d11ebdc4e runtime.selectgo(0xc00005ba68, 0xc00005b830, 0x21?, 0x0, 0x1?, 0x1) runtime/select.go:335 +0x7a5 fp=0xc00005b778 sp=0xc00005b650 pc=0x616d11e9af45 ollama/llama/runner.(*Server).completion(0xc0001a7560, {0x616d13006910, 0xc0004c2620}, 0xc00060ea00) ollama/llama/runner/runner.go:696 +0xab6 fp=0xc00005bac0 sp=0xc00005b778 pc=0x616d12289236 ollama/llama/runner.(*Server).completion-fm({0x616d13006910?, 0xc0004c2620?}, 0x616d12221fe7?) <autogenerated>:1 +0x36 fp=0xc00005baf0 sp=0xc00005bac0 pc=0x616d1228c916 net/http.HandlerFunc.ServeHTTP(0xc000536ee0?, {0x616d13006910?, 0xc0004c2620?}, 0x0?) net/http/server.go:2220 +0x29 fp=0xc00005bb18 sp=0xc00005baf0 pc=0x616d12214809 net/http.(*ServeMux).ServeHTTP(0x616d11e5e485?, {0x616d13006910, 0xc0004c2620}, 0xc00060ea00) net/http/server.go:2747 +0x1ca fp=0xc00005bb68 sp=0xc00005bb18 pc=0x616d1221670a net/http.serverHandler.ServeHTTP({0x616d13003510?}, {0x616d13006910?, 0xc0004c2620?}, 0x6?) net/http/server.go:3210 +0x8e fp=0xc00005bb98 sp=0xc00005bb68 pc=0x616d12233c6e net/http.(*conn).serve(0xc00001e000, {0x616d13008988, 0xc000608510}) net/http/server.go:2092 +0x5d0 fp=0xc00005bfb8 sp=0xc00005bb98 pc=0x616d122131b0 net/http.(*Server).Serve.gowrap3() net/http/server.go:3360 +0x28 fp=0xc00005bfe0 sp=0xc00005bfb8 pc=0x616d12218608 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00005bfe8 sp=0xc00005bfe0 pc=0x616d11ec6021 created by net/http.(*Server).Serve in goroutine 1 net/http/server.go:3360 +0x485 goroutine 39 gp=0xc0005848c0 m=nil [IO wait]: runtime.gopark(0x616d11e62965?, 0x0?, 0x0?, 0x0?, 0xb?) runtime/proc.go:424 +0xce fp=0xc00011eda8 sp=0xc00011ed88 pc=0x616d11ebdc4e runtime.netpollblock(0x616d11ee0e78?, 0x11e54a66?, 0x6d?) runtime/netpoll.go:575 +0xf7 fp=0xc00011ede0 sp=0xc00011eda8 pc=0x616d11e818b7 internal/poll.runtime_pollWait(0x725dfed35568, 0x72) runtime/netpoll.go:351 +0x85 fp=0xc00011ee00 sp=0xc00011ede0 pc=0x616d11ebcf45 internal/poll.(*pollDesc).wait(0xc00049c000?, 0xc000608581?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00011ee28 sp=0xc00011ee00 pc=0x616d11f44567 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Read(0xc00049c000, {0xc000608581, 0x1, 0x1}) internal/poll/fd_unix.go:165 +0x27a fp=0xc00011eec0 sp=0xc00011ee28 pc=0x616d11f4585a net.(*netFD).Read(0xc00049c000, {0xc000608581?, 0xc00011ef48?, 0x616d11ebf8d0?}) net/fd_posix.go:55 +0x25 fp=0xc00011ef08 sp=0xc00011eec0 pc=0x616d11fb0045 net.(*conn).Read(0xc000088040, {0xc000608581?, 0x0?, 0x616d137c6680?}) net/net.go:189 +0x45 fp=0xc00011ef50 sp=0xc00011ef08 pc=0x616d11fbe645 net.(*TCPConn).Read(0x616d137030a0?, {0xc000608581?, 0x0?, 0x0?}) <autogenerated>:1 +0x25 fp=0xc00011ef80 sp=0xc00011ef50 pc=0x616d11fd1845 net/http.(*connReader).backgroundRead(0xc000608570) net/http/server.go:690 +0x37 fp=0xc00011efc8 sp=0xc00011ef80 pc=0x616d1220db37 net/http.(*connReader).startBackgroundRead.gowrap2() net/http/server.go:686 +0x25 fp=0xc00011efe0 sp=0xc00011efc8 pc=0x616d1220da65 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00011efe8 sp=0xc00011efe0 pc=0x616d11ec6021 created by net/http.(*connReader).startBackgroundRead in goroutine 50 net/http/server.go:686 +0xb6 rax 0x0 rbx 0x0 rcx 0x725d6c006580 rdx 0x725d94ffebf0 rdi 0x6 rsi 0x725d78000b80 rbp 0x725d94ffe960 rsp 0x725d94ffe3f0 r8 0x725d6c006700 r9 0x0 r10 0x725dff2f8f28 r11 0x725d6c006700 r12 0x725d6c006700 r13 0x725d700cd298 r14 0x725d6c006580 r15 0x725d6c006700 rip 0x725dfd20bc2f rflags 0x10202 cs 0x33 fs 0x0 gs 0x0 [GIN] 2025/02/18 - 00:51:53 | 500 | 41.218616ms | 10.89.0.4 | POST  "/api/chat" ```
Author
Owner

@rick-github commented on GitHub (Feb 17, 2025):

So it identified the devices and actually did a few completions

[GIN] 2025/02/18 - 00:07:48 | 200 |  2.551880862s |       10.89.0.4 | POST     "/api/chat"
[GIN] 2025/02/18 - 00:07:59 | 200 |    14.32507ms |       10.89.0.4 | POST     "/api/chat"
[GIN] 2025/02/18 - 00:08:03 | 200 |   14.761876ms |       10.89.0.4 | POST     "/api/chat"

Then after a model unload/reload, it got one more successful completion:

[GIN] 2025/02/18 - 00:51:10 | 200 |  2.550868677s |       10.89.0.4 | POST     "/api/chat"

and then crashed in func_llama_decode, presumably in another completion. The crash was due to an illegal instruction, specifically RDRAND:

0000  f3 0f c7 f8           rdrand eax
0004  25 ff 03 00 00        and    eax, 0x3ff
0009  48 8b 0d e1 c2 2a 00  mov    rcx, QWORD PTR [rip+0x2ac2e1]

RDRAND was introduced in Ivy Bridge. I suspect that your CPU doesn't support the instruction. What does the following show:

grep -i rdrand /proc/cpuinfo | head -1

And what's the output of

sudo lscpu
<!-- gh-comment-id:2663940535 --> @rick-github commented on GitHub (Feb 17, 2025): So it identified the devices and actually did a few completions ``` [GIN] 2025/02/18 - 00:07:48 | 200 | 2.551880862s | 10.89.0.4 | POST "/api/chat" [GIN] 2025/02/18 - 00:07:59 | 200 | 14.32507ms | 10.89.0.4 | POST "/api/chat" [GIN] 2025/02/18 - 00:08:03 | 200 | 14.761876ms | 10.89.0.4 | POST "/api/chat" ``` Then after a model unload/reload, it got one more successful completion: ``` [GIN] 2025/02/18 - 00:51:10 | 200 | 2.550868677s | 10.89.0.4 | POST "/api/chat" ``` and then crashed in `func_llama_decode`, presumably in another completion. The crash was due to an illegal instruction, specifically RDRAND: ``` 0000 f3 0f c7 f8 rdrand eax 0004 25 ff 03 00 00 and eax, 0x3ff 0009 48 8b 0d e1 c2 2a 00 mov rcx, QWORD PTR [rip+0x2ac2e1] ``` RDRAND was introduced in Ivy Bridge. I suspect that your CPU doesn't support the instruction. What does the following show: ```console grep -i rdrand /proc/cpuinfo | head -1 ``` And what's the output of ```console sudo lscpu ```
Author
Owner

@Ejo2001 commented on GitHub (Feb 17, 2025):

@rick-github

grep -i rdrand /proc/cpuinfo | head -1

Returned:

ejo@arc:~$ grep -i rdrand /proc/cpuinfo | head -1
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities

and

sudo lscpu

Returned:

Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          39 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   8
  On-line CPU(s) list:    0-7
Vendor ID:                GenuineIntel
  BIOS Vendor ID:         Intel(R) Corporation
  Model name:             Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz
    BIOS Model name:      Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz To Be Filled By O.E.M. CPU @ 3.6GHz
    BIOS CPU family:      198
    CPU family:           6
    Model:                158
    Thread(s) per core:   1
    Core(s) per socket:   8
    Socket(s):            1
    Stepping:             13
    CPU(s) scaling MHz:   27%
    CPU max MHz:          4900.0000
    CPU min MHz:          800.0000
    BogoMIPS:             7200.00
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm cons
                          tant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx
                          16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb
                           stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm i
                          da arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities
Caches (sum of all):
  L1d:                    256 KiB (8 instances)
  L1i:                    256 KiB (8 instances)
  L2:                     2 MiB (8 instances)
  L3:                     12 MiB (1 instance)
NUMA:
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-7
Vulnerabilities:
  Gather data sampling:   Mitigation; Microcode
  Itlb multihit:          KVM: Mitigation: VMX unsupported
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Mitigation; Clear CPU buffers; SMT disabled
  Reg file data sampling: Not affected
  Retbleed:               Mitigation; Enhanced IBRS
  Spec rstack overflow:   Not affected
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop
  Srbds:                  Mitigation; Microcode
  Tsx async abort:        Mitigation; TSX disabled

My CPU is a 9700K (Coffe Lake), so it should be much more recent than the Ivy Bridge chips

<!-- gh-comment-id:2664103889 --> @Ejo2001 commented on GitHub (Feb 17, 2025): @rick-github ``` grep -i rdrand /proc/cpuinfo | head -1 ``` Returned: ``` ejo@arc:~$ grep -i rdrand /proc/cpuinfo | head -1 flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities ``` and ``` sudo lscpu ``` Returned: ``` Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Vendor ID: GenuineIntel BIOS Vendor ID: Intel(R) Corporation Model name: Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz BIOS Model name: Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz To Be Filled By O.E.M. CPU @ 3.6GHz BIOS CPU family: 198 CPU family: 6 Model: 158 Thread(s) per core: 1 Core(s) per socket: 8 Socket(s): 1 Stepping: 13 CPU(s) scaling MHz: 27% CPU max MHz: 4900.0000 CPU min MHz: 800.0000 BogoMIPS: 7200.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm cons tant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx 16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm i da arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities Caches (sum of all): L1d: 256 KiB (8 instances) L1i: 256 KiB (8 instances) L2: 2 MiB (8 instances) L3: 12 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-7 Vulnerabilities: Gather data sampling: Mitigation; Microcode Itlb multihit: KVM: Mitigation: VMX unsupported L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Mitigation; Clear CPU buffers; SMT disabled Reg file data sampling: Not affected Retbleed: Mitigation; Enhanced IBRS Spec rstack overflow: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop Srbds: Mitigation; Microcode Tsx async abort: Mitigation; TSX disabled ``` My CPU is a 9700K (Coffe Lake), so it should be much more recent than the Ivy Bridge chips
Author
Owner

@rick-github commented on GitHub (Feb 17, 2025):

My CPU is a 9700K (Coffe Lake), so it should be much more recent than the Ivy Bridge chips

Yep, lscpu shows that the host CPU supports rdrand, so I don't know why it's being treated as illegal. I don't use podman, does it have virtual CPUs that allows control of extensions? What do you see if you run lscpu or grep rdrand /proc/sysinfo inside the container?

<!-- gh-comment-id:2664145448 --> @rick-github commented on GitHub (Feb 17, 2025): > My CPU is a 9700K (Coffe Lake), so it should be much more recent than the Ivy Bridge chips Yep, `lscpu` shows that the host CPU supports `rdrand`, so I don't know why it's being treated as illegal. I don't use podman, does it have virtual CPUs that allows control of extensions? What do you see if you run `lscpu` or `grep rdrand /proc/sysinfo` inside the container?
Author
Owner

@rick-github commented on GitHub (Feb 17, 2025):

Never mind, wrong opcode, it's actually RDPID, not RDRAND.

0000000000000000 <.data>:
   0:   f3 0f c7 f8             rdpid  %rax
   4:   25 ff 03 00 00          and    $0x3ff,%eax
   9:   48 8b 0d e1 c2 2a 00    mov    0x2ac2e1(%rip),%rcx        # 0x2ac2f1

Introduced in Comet Lake (10th gen), after Coffe Lake (9th gen). It's not in the output of your lscpu so that would seem to be the problem. The ipex image is based on ollama-0.5.4, when they upgrade to 0.5.8+ they should be able to leverage the new CPU backends which make it easier to support different CPUs.

<!-- gh-comment-id:2664170534 --> @rick-github commented on GitHub (Feb 17, 2025): Never mind, wrong opcode, it's actually RDPID, not RDRAND. ``` 0000000000000000 <.data>: 0: f3 0f c7 f8 rdpid %rax 4: 25 ff 03 00 00 and $0x3ff,%eax 9: 48 8b 0d e1 c2 2a 00 mov 0x2ac2e1(%rip),%rcx # 0x2ac2f1 ``` Introduced in Comet Lake (10th gen), after Coffe Lake (9th gen). It's not in the output of your `lscpu` so that would seem to be the problem. The ipex image is based on ollama-0.5.4, when they upgrade to 0.5.8+ they should be able to leverage the new CPU backends which make it easier to support different CPUs.
Author
Owner

@Ejo2001 commented on GitHub (Feb 17, 2025):

@rick-github Ah, I see... Is it possible for me to patch it myself, or am I out of luck?

<!-- gh-comment-id:2664216901 --> @Ejo2001 commented on GitHub (Feb 17, 2025): @rick-github Ah, I see... Is it possible for me to patch it myself, or am I out of luck?
Author
Owner

@rick-github commented on GitHub (Feb 18, 2025):

I think your best bet is to file ticket with ipex-llm. They have a Dockerfile for building the container image so I thought it might be possible to pull a more recent version of ollama. Unfortunately they pull ollama in via ipex-llm[cpp] and it's not clear how to build that package.

<!-- gh-comment-id:2664263001 --> @rick-github commented on GitHub (Feb 18, 2025): I think your best bet is to file ticket with [ipex-llm](https://github.com/intel/ipex-llm/issues). They have a [Dockerfile](https://github.com/intel/ipex-llm/blob/main/docker/llm/inference-cpp/Dockerfile) for building the container image so I thought it might be possible to pull a more recent version of ollama. Unfortunately they pull ollama in via ipex-llm[cpp] and it's not clear how to build that package.
Author
Owner

@sgwhat commented on GitHub (Feb 19, 2025):

Are you running ollama in a Dockerfile? And could you show the sycl-ls after activating oneapi?

<!-- gh-comment-id:2667330010 --> @sgwhat commented on GitHub (Feb 19, 2025): Are you running ollama in a Dockerfile? And could you show the `sycl-ls` after activating oneapi?
Author
Owner

@desmondsow commented on GitHub (Feb 20, 2025):

The fastest way to verify this is to use the container https://github.com/intel/ipex-llm/blob/main/docs/mddocs/DockerGuides/docker_cpp_xpu_quickstart.md#start-docker-container.

Ensure you have the latest Ubuntu 24.10, sudo apt upgrade and reboot to upgrade to the latest kernel version. Provide the sycl-ls output from the container.

<!-- gh-comment-id:2670391114 --> @desmondsow commented on GitHub (Feb 20, 2025): The fastest way to verify this is to use the container https://github.com/intel/ipex-llm/blob/main/docs/mddocs/DockerGuides/docker_cpp_xpu_quickstart.md#start-docker-container. Ensure you have the latest Ubuntu 24.10, `sudo apt upgrade` and reboot to upgrade to the latest kernel version. Provide the `sycl-ls` output from the container.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#5976