[GH-ISSUE #969] ###### problem #472

Closed
opened 2026-04-12 10:09:04 -05:00 by GiteaMirror · 30 comments
Owner

Originally created by @k3341095 on GitHub (Nov 2, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/969

Originally assigned to: @BruceMacD on GitHub.

image

The command to install docker and run the 13b model worked fine. However
run and subsequently hit hi, only #### is being taken to infinity.

Originally created by @k3341095 on GitHub (Nov 2, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/969 Originally assigned to: @BruceMacD on GitHub. ![image](https://github.com/jmorganca/ollama/assets/17330375/7148c0f6-47b4-4fa4-b219-436e12776f79) The command to install docker and run the 13b model worked fine. However run and subsequently hit hi, only #### is being taken to infinity.
GiteaMirror added the bug label 2026-04-12 10:09:04 -05:00
Author
Owner

@BruceMacD commented on GitHub (Nov 2, 2023):

Hi @k3341095, which model is this and what are the resources allocated to the container?

<!-- gh-comment-id:1790886699 --> @BruceMacD commented on GitHub (Nov 2, 2023): Hi @k3341095, which model is this and what are the resources allocated to the container?
Author
Owner

@iliabaranov commented on GitHub (Nov 2, 2023):

Hello, same issue here.
I am running mistral natively, just with "ollama run mistral"
It was working fine for a while, and now it's doing the above.

Regardless if I kill -9 it and restart, still the same issue.
It's using 4919MiB / 6144MiB of GPU memory on my card, not much CPU, so resource wise it looks fine...?

<!-- gh-comment-id:1791293222 --> @iliabaranov commented on GitHub (Nov 2, 2023): Hello, same issue here. I am running mistral natively, just with "ollama run mistral" It was working fine for a while, and now it's doing the above. Regardless if I kill -9 it and restart, still the same issue. It's using 4919MiB / 6144MiB of GPU memory on my card, not much CPU, so resource wise it looks fine...?
Author
Owner

@mchiang0610 commented on GitHub (Nov 2, 2023):

@iliabaranov and @k3341095 Sorry about this. May I ask which OS / the system specs where you are seeing this problem?

Just in case the model had issues, would it be possible to ask for the model to be pulled again? (it'll calculate the diff if there is any differences).

ollama pull mistral

run it again.

I'm unable to reproduce on a MacBook Pro 16" M1 16GB

screenshot 000238@2x

<!-- gh-comment-id:1791504904 --> @mchiang0610 commented on GitHub (Nov 2, 2023): @iliabaranov and @k3341095 Sorry about this. May I ask which OS / the system specs where you are seeing this problem? Just in case the model had issues, would it be possible to ask for the model to be pulled again? (it'll calculate the diff if there is any differences). `ollama pull mistral` run it again. I'm unable to reproduce on a MacBook Pro 16" M1 16GB ![screenshot 000238@2x](https://github.com/jmorganca/ollama/assets/3325447/90b96075-e949-4d5a-81d2-93a51c8b09da)
Author
Owner

@mchiang0610 commented on GitHub (Nov 2, 2023):

my ollama version is 0.1.7
ollama -v

<!-- gh-comment-id:1791506558 --> @mchiang0610 commented on GitHub (Nov 2, 2023): my ollama version is 0.1.7 `ollama -v`
Author
Owner

@iliabaranov commented on GitHub (Nov 2, 2023):

same, I'm at 0.1.7

thread:

iliabara@Metis:~$ ollama -v
ollama version 0.1.7
iliabara@Metis:~$ ollama pull mistral
pulling manifest
pulling 6ae280299950... 100% |██████████████████| (4.1/4.1 GB, 25 TB/s)        
pulling 22e1b2e8dc2f... 100% |████████████████████| (43/43 B, 641 kB/s)        
pulling e35ab70a78c7... 100% |████████████████████| (90/90 B, 1.4 MB/s)        
pulling 1cb90d66f4d4... 100% |██████████████████| (381/381 B, 5.9 MB/s)        
verifying sha256 digest
writing manifest
removing any unused layers
success
iliabara@Metis:~$ ollama run mistral
>>> hello?
H#######################################################################################################^C
<!-- gh-comment-id:1791510230 --> @iliabaranov commented on GitHub (Nov 2, 2023): same, I'm at 0.1.7 thread: ``` iliabara@Metis:~$ ollama -v ollama version 0.1.7 iliabara@Metis:~$ ollama pull mistral pulling manifest pulling 6ae280299950... 100% |██████████████████| (4.1/4.1 GB, 25 TB/s) pulling 22e1b2e8dc2f... 100% |████████████████████| (43/43 B, 641 kB/s) pulling e35ab70a78c7... 100% |████████████████████| (90/90 B, 1.4 MB/s) pulling 1cb90d66f4d4... 100% |██████████████████| (381/381 B, 5.9 MB/s) verifying sha256 digest writing manifest removing any unused layers success iliabara@Metis:~$ ollama run mistral >>> hello? H#######################################################################################################^C
Author
Owner

@mchiang0610 commented on GitHub (Nov 2, 2023):

@iliabaranov thanks. Possible to ask what terminal you are using / the OS and spec? Trying to narrow down and troubleshoot this

<!-- gh-comment-id:1791559153 --> @mchiang0610 commented on GitHub (Nov 2, 2023): @iliabaranov thanks. Possible to ask what terminal you are using / the OS and spec? Trying to narrow down and troubleshoot this
Author
Owner

@visualinventor commented on GitHub (Nov 2, 2023):

I am getting a similar garbled response but it is on the second question I ask it as a followup. I'm using Llama2 and the 0.17 version of Ollama. MacOS Sonoma M1 Macbook

<!-- gh-comment-id:1791559563 --> @visualinventor commented on GitHub (Nov 2, 2023): I am getting a similar garbled response but it is on the second question I ask it as a followup. I'm using Llama2 and the 0.17 version of Ollama. MacOS Sonoma M1 Macbook
Author
Owner

@iliabaranov commented on GitHub (Nov 2, 2023):

Sure thing, it's standard terminal in Ubuntu 20.04, nothing particularly special about the setup or OS.

<!-- gh-comment-id:1791567528 --> @iliabaranov commented on GitHub (Nov 2, 2023): Sure thing, it's standard terminal in Ubuntu 20.04, nothing particularly special about the setup or OS.
Author
Owner

@igorschlum commented on GitHub (Nov 3, 2023):

I use Ollama 0.1.7 with MacBook M2 32Go MacOS : 13.5.2 (22G91) and cannot reproduce the issue

(base) igor@macIgor ~ % ollama pull mistral
pulling manifest
pulling 6ae280299950... 100% |████████████████████████████████| (4.1/4.1 GB, 14 TB/s)
pulling 22e1b2e8dc2f... 100% |████████████████████████████████████| (43/43 B, 17 B/s)
pulling e35ab70a78c7... 100% |██████████████████████████████████| (90/90 B, 1.2 MB/s)
pulling 1cb90d66f4d4... 100% |█████████████████████████████████| (381/381 B, 163 B/s)
verifying sha256 digest
writing manifest
removing any unused layers
success
(base) igor@macIgor ~ % ollama run mistral

hello?
Hello! How can I help you today?

can you tell me if the moon is getting bigger today?
I am not aware of any significant events that would cause the Moon to physically change
in size today. However, its appearance may appear to change due to various factors such
as its position in relation to the Earth and Sun or atmospheric conditions. If you have
any other questions, feel free to ask!

hello?
Hello again! How can I assist you further today?

Send a message (/? for help)

<!-- gh-comment-id:1791725886 --> @igorschlum commented on GitHub (Nov 3, 2023): I use Ollama 0.1.7 with MacBook M2 32Go MacOS : 13.5.2 (22G91) and cannot reproduce the issue (base) igor@macIgor ~ % ollama pull mistral pulling manifest pulling 6ae280299950... 100% |████████████████████████████████| (4.1/4.1 GB, 14 TB/s) pulling 22e1b2e8dc2f... 100% |████████████████████████████████████| (43/43 B, 17 B/s) pulling e35ab70a78c7... 100% |██████████████████████████████████| (90/90 B, 1.2 MB/s) pulling 1cb90d66f4d4... 100% |█████████████████████████████████| (381/381 B, 163 B/s) verifying sha256 digest writing manifest removing any unused layers success (base) igor@macIgor ~ % ollama run mistral >>> hello? Hello! How can I help you today? >>> can you tell me if the moon is getting bigger today? I am not aware of any significant events that would cause the Moon to physically change in size today. However, its appearance may appear to change due to various factors such as its position in relation to the Earth and Sun or atmospheric conditions. If you have any other questions, feel free to ask! >>> hello? Hello again! How can I assist you further today? >>> Send a message (/? for help)
Author
Owner

@k3341095 commented on GitHub (Nov 3, 2023):

Ollama latest / Ubuntu 22.04 / nvidia 1080 / 13b model / b450 ryzen 5600g 32gb memory

<!-- gh-comment-id:1791737030 --> @k3341095 commented on GitHub (Nov 3, 2023): Ollama latest / Ubuntu 22.04 / nvidia 1080 / 13b model / b450 ryzen 5600g 32gb memory
Author
Owner

@k3341095 commented on GitHub (Nov 3, 2023):

I've tested it again now and it seems to be due to low memory, the 13B requires at least 16 gigs of GPU memory and my GPU memory is 10GB. I thought 16 gig was system memory...
I thought the 13b model size was less than 8 gigs. My apologies. I think you can close it, the mistral model works fine.

<!-- gh-comment-id:1791844933 --> @k3341095 commented on GitHub (Nov 3, 2023): I've tested it again now and it seems to be due to low memory, the 13B requires at least 16 gigs of GPU memory and my GPU memory is 10GB. I thought 16 gig was system memory... I thought the 13b model size was less than 8 gigs. My apologies. I think you can close it, the mistral model works fine.
Author
Owner

@iliabaranov commented on GitHub (Nov 3, 2023):

That doesn't seem to be the case for me...? my memory doesn't even fill to 100%

<!-- gh-comment-id:1793189317 --> @iliabaranov commented on GitHub (Nov 3, 2023): That doesn't seem to be the case for me...? my memory doesn't even fill to 100%
Author
Owner

@jmorganca commented on GitHub (Nov 4, 2023):

Hi folks this should be fixed in 0.1.8. Please re-open if you still see the issue!

<!-- gh-comment-id:1793351247 --> @jmorganca commented on GitHub (Nov 4, 2023): Hi folks this should be fixed in 0.1.8. Please re-open if you still see the issue!
Author
Owner

@iliabaranov commented on GitHub (Nov 5, 2023):

@jmorganca still an issue unfortunately
Updated to 0.18, pulled latest mistral, exact same issue.
GPU memory is still not full, so unsure that's the issue

<!-- gh-comment-id:1793860346 --> @iliabaranov commented on GitHub (Nov 5, 2023): @jmorganca still an issue unfortunately Updated to 0.18, pulled latest mistral, exact same issue. GPU memory is still not full, so unsure that's the issue
Author
Owner

@BruceMacD commented on GitHub (Nov 10, 2023):

@iliabaranov would you be able to share the output of your nivida-smi command?

<!-- gh-comment-id:1806517284 --> @BruceMacD commented on GitHub (Nov 10, 2023): @iliabaranov would you be able to share the output of your `nivida-smi` command?
Author
Owner

@ramon-buzo commented on GitHub (Nov 27, 2023):

Hello guys, I too am experiencing the same issue. I am running Ollama version 0.1.11 on Ubuntu 22.04 with an Intel Core i5-8600k 32GB of RAM, and an Nvidia GeForce GTX 1060 GPU with 6GB of VRAM. The problem has arisen multiple times, while I was developing a simple web UI for Ollama via API, testing various models (Llama2 7b, Mistral 7b, etc.) to evaluate their behavior with the same questions. To my surprise, I began receiving responses with ###### , and despite my inability to confirm memory issues, This is the output of nvidia-smi command

nvdia-output

<!-- gh-comment-id:1828208910 --> @ramon-buzo commented on GitHub (Nov 27, 2023): Hello guys, I too am experiencing the same issue. I am running Ollama version 0.1.11 on Ubuntu 22.04 with an Intel Core i5-8600k 32GB of RAM, and an Nvidia GeForce GTX 1060 GPU with 6GB of VRAM. The problem has arisen multiple times, while I was developing a simple web UI for Ollama via API, testing various models (Llama2 7b, Mistral 7b, etc.) to evaluate their behavior with the same questions. To my surprise, I began receiving responses with ###### , and despite my inability to confirm memory issues, This is the output of `nvidia-smi` command ![nvdia-output](https://github.com/jmorganca/ollama/assets/152201837/9ca9b955-5b90-46e5-9722-8840563ebef1)
Author
Owner

@phalexo commented on GitHub (Dec 4, 2023):

I see a very similar behavior running on GPUs, VRAM is less than 50%. Besides #### I see page scrolling pretty fast. This is the response to the first query. If I enter another query then it dies completely, complains about cuBlas and suggest that VRAM is low.

I installed go and rebuilt everything, and still the same problem.

The same model appears to work with the host, but is slow.

<!-- gh-comment-id:1839159674 --> @phalexo commented on GitHub (Dec 4, 2023): I see a very similar behavior running on GPUs, VRAM is less than 50%. Besides #### I see page scrolling pretty fast. This is the response to the first query. If I enter another query then it dies completely, complains about cuBlas and suggest that VRAM is low. I installed go and rebuilt everything, and still the same problem. The same model appears to work with the host, but is slow.
Author
Owner

@lnxdevbr commented on GitHub (Jan 3, 2024):

I installed Ubuntu 23.10. with nvidia drivers included and on my i7 4790 with 8gb ram, nvidia 1060 card with 6gb version 0.1.8, it worked with gpu acceleration, but in version 1.17 it didn't work, on my i5 3470, 16gb ram with nvidia 3060 12gb didn't work any version of ollama

<!-- gh-comment-id:1875562704 --> @lnxdevbr commented on GitHub (Jan 3, 2024): I installed Ubuntu 23.10. with nvidia drivers included and on my i7 4790 with 8gb ram, nvidia 1060 card with 6gb version 0.1.8, it worked with gpu acceleration, but in version 1.17 it didn't work, on my i5 3470, 16gb ram with nvidia 3060 12gb didn't work any version of ollama
Author
Owner

@lnxdevbr commented on GitHub (Jan 4, 2024):

After updating my entire system I managed to make the latest version of ollama work correctly, here are my specifications:

arman@AliceAI:$ uname -a
Linux AliceAI 6.5.0-14-generic #14-Ubuntu SMP PREEMPT_DYNAMIC Tue Nov 14 14:59:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
arman@AliceAI:
$ nvidia-smi
Thu Jan 4 17:06:15 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 Off | 00000000:01:00.0 On | N/A |
| 0% 30C P8 16W / 170W | 234MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1761 G /usr/lib/xorg/Xorg 119MiB |
| 0 N/A N/A 1990 G /usr/bin/gnome-shell 63MiB |
| 0 N/A N/A 2523 G /usr/bin/nautilus 24MiB |
| 0 N/A N/A 3157 G /usr/bin/gnome-text-editor 16MiB |
+---------------------------------------------------------------------------------------+

arman@AliceAI:~$ ollama serve
2024/01/04 16:55:58 images.go:737: total blobs: 0
2024/01/04 16:55:58 images.go:744: total unused blobs removed: 0
2024/01/04 16:55:58 routes.go:895: Listening on 127.0.0.1:11434 (version 0.1.17)
[GIN] 2024/01/04 - 16:56:14 | 200 | 588.864µs | 127.0.0.1 | HEAD "/"
[GIN] 2024/01/04 - 16:56:14 | 200 | 4.852801ms | 127.0.0.1 | GET "/api/tags"
[GIN] 2024/01/04 - 17:00:23 | 200 | 18.509µs | 127.0.0.1 | HEAD "/"
[GIN] 2024/01/04 - 17:00:59 | 404 | 159.572µs | 127.0.0.1 | HEAD "/api/blobs/sha256:3ef24972116b3f4b0da187e514c3f29ae653f83c00fd6886d70d7b74694b3833"
[GIN] 2024/01/04 - 17:01:24 | 201 | 24.345565906s | 127.0.0.1 | POST "/api/blobs/sha256:3ef24972116b3f4b0da187e514c3f29ae653f83c00fd6886d70d7b74694b3833"
2024/01/04 17:01:24 images.go:370: [model] - @sha256:3ef24972116b3f4b0da187e514c3f29ae653f83c00fd6886d70d7b74694b3833
[GIN] 2024/01/04 - 17:01:48 | 200 | 24.396759785s | 127.0.0.1 | POST "/api/create"
[GIN] 2024/01/04 - 17:01:59 | 200 | 24.651µs | 127.0.0.1 | HEAD "/"
[GIN] 2024/01/04 - 17:01:59 | 200 | 284.445µs | 127.0.0.1 | POST "/api/show"
[GIN] 2024/01/04 - 17:01:59 | 200 | 196.575µs | 127.0.0.1 | POST "/api/show"
2024/01/04 17:02:02 llama.go:300: 11807 MB VRAM available, loading up to 72 GPU layers
2024/01/04 17:02:02 llama.go:436: starting llama runner
2024/01/04 17:02:02 llama.go:494: waiting for llama runner to start responding
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6
{"timestamp":1704355325,"level":"INFO","function":"main","line":2667,"message":"build info","build":468,"commit":"a7aee47"}
{"timestamp":1704355325,"level":"INFO","function":"main","line":2670,"message":"system info","n_threads":4,"n_threads_batch":-1,"total_threads":4,"system_info":"AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "}
llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from /home/arman/.ollama/models/blobs/sha256:3ef24972116b3f4b0da187e514c3f29ae653f83c00fd6886d70d7b74694b3833 (version GGUF V3 (latest))

llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 7.24 B
llm_load_print_meta: model size = 3.83 GiB (4.54 BPW)
llm_load_print_meta: general.name = beowolx_codeninja-1.0-openchat-7b
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: PAD token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.11 MiB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 70.43 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: VRAM used: 3847.56 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 256.00 MB
llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 159.19 MiB
llama_new_context_with_model: VRAM scratch buffer: 156.00 MiB
llama_new_context_with_model: total VRAM used: 4259.57 MiB (model: 3847.56 MiB, context: 412.00 MiB)

arman@AliceAI:$ ollama create ninja -f Modelfile
transferring model data
creating model layer
using already created layer sha256:3ef24972116b3f4b0da187e514c3f29ae653f83c00fd6886d70d7b74694b3833
writing layer sha256:d61e6266c8740ea80cefacf08e4be7d5ae1d6591e76955ca307ad93d0cc036a6
writing manifest
success
arman@AliceAI:
$ ollama run ninja

create python code for web page scrap using bs4

import requests
from bs4 import BeautifulSoup

def scrape_webpage(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
return soup

soup = scrape_webpage('https://www.example.com/')
print(soup.prettify())

<!-- gh-comment-id:1876689849 --> @lnxdevbr commented on GitHub (Jan 4, 2024): After updating my entire system I managed to make the latest version of ollama work correctly, here are my specifications: arman@AliceAI:~$ uname -a Linux AliceAI 6.5.0-14-generic #14-Ubuntu SMP PREEMPT_DYNAMIC Tue Nov 14 14:59:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux arman@AliceAI:~$ nvidia-smi Thu Jan 4 17:06:15 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3060 Off | 00000000:01:00.0 On | N/A | | 0% 30C P8 16W / 170W | 234MiB / 12288MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 1761 G /usr/lib/xorg/Xorg 119MiB | | 0 N/A N/A 1990 G /usr/bin/gnome-shell 63MiB | | 0 N/A N/A 2523 G /usr/bin/nautilus 24MiB | | 0 N/A N/A 3157 G /usr/bin/gnome-text-editor 16MiB | +---------------------------------------------------------------------------------------+ arman@AliceAI:~$ ollama serve 2024/01/04 16:55:58 images.go:737: total blobs: 0 2024/01/04 16:55:58 images.go:744: total unused blobs removed: 0 2024/01/04 16:55:58 routes.go:895: Listening on 127.0.0.1:11434 (version 0.1.17) [GIN] 2024/01/04 - 16:56:14 | 200 | 588.864µs | 127.0.0.1 | HEAD "/" [GIN] 2024/01/04 - 16:56:14 | 200 | 4.852801ms | 127.0.0.1 | GET "/api/tags" [GIN] 2024/01/04 - 17:00:23 | 200 | 18.509µs | 127.0.0.1 | HEAD "/" [GIN] 2024/01/04 - 17:00:59 | 404 | 159.572µs | 127.0.0.1 | HEAD "/api/blobs/sha256:3ef24972116b3f4b0da187e514c3f29ae653f83c00fd6886d70d7b74694b3833" [GIN] 2024/01/04 - 17:01:24 | 201 | 24.345565906s | 127.0.0.1 | POST "/api/blobs/sha256:3ef24972116b3f4b0da187e514c3f29ae653f83c00fd6886d70d7b74694b3833" 2024/01/04 17:01:24 images.go:370: [model] - @sha256:3ef24972116b3f4b0da187e514c3f29ae653f83c00fd6886d70d7b74694b3833 [GIN] 2024/01/04 - 17:01:48 | 200 | 24.396759785s | 127.0.0.1 | POST "/api/create" [GIN] 2024/01/04 - 17:01:59 | 200 | 24.651µs | 127.0.0.1 | HEAD "/" [GIN] 2024/01/04 - 17:01:59 | 200 | 284.445µs | 127.0.0.1 | POST "/api/show" [GIN] 2024/01/04 - 17:01:59 | 200 | 196.575µs | 127.0.0.1 | POST "/api/show" 2024/01/04 17:02:02 llama.go:300: 11807 MB VRAM available, loading up to 72 GPU layers 2024/01/04 17:02:02 llama.go:436: starting llama runner 2024/01/04 17:02:02 llama.go:494: waiting for llama runner to start responding ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6 {"timestamp":1704355325,"level":"INFO","function":"main","line":2667,"message":"build info","build":468,"commit":"a7aee47"} {"timestamp":1704355325,"level":"INFO","function":"main","line":2670,"message":"system info","n_threads":4,"n_threads_batch":-1,"total_threads":4,"system_info":"AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "} llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from /home/arman/.ollama/models/blobs/sha256:3ef24972116b3f4b0da187e514c3f29ae653f83c00fd6886d70d7b74694b3833 (version GGUF V3 (latest)) llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 7.24 B llm_load_print_meta: model size = 3.83 GiB (4.54 BPW) llm_load_print_meta: general.name = beowolx_codeninja-1.0-openchat-7b llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: PAD token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.11 MiB llm_load_tensors: using CUDA for GPU acceleration llm_load_tensors: mem required = 70.43 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: VRAM used: 3847.56 MiB ................................................................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: VRAM kv self = 256.00 MB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_build_graph: non-view tensors processed: 676/676 llama_new_context_with_model: compute buffer total size = 159.19 MiB llama_new_context_with_model: VRAM scratch buffer: 156.00 MiB llama_new_context_with_model: total VRAM used: 4259.57 MiB (model: 3847.56 MiB, context: 412.00 MiB) arman@AliceAI:~$ ollama create ninja -f Modelfile transferring model data creating model layer using already created layer sha256:3ef24972116b3f4b0da187e514c3f29ae653f83c00fd6886d70d7b74694b3833 writing layer sha256:d61e6266c8740ea80cefacf08e4be7d5ae1d6591e76955ca307ad93d0cc036a6 writing manifest success arman@AliceAI:~$ ollama run ninja >>> create python code for web page scrap using bs4 import requests from bs4 import BeautifulSoup def scrape_webpage(url): response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') return soup soup = scrape_webpage('https://www.example.com/') print(soup.prettify())
Author
Owner

@varonroy commented on GitHub (Jan 6, 2024):

I am having the same issue. Here are my specs:

Spec Value
OS Ubuntu 22.04.3 LTS x86_64
Kernel 5.15.0-91-generic
CPU Intel Xeon Gold 6330 (112) @ 3.100GHz
GPU (2x) NVIDIA A100 PCIe 40GB

Cuda installation:

Spec Value
Cuda 12.3
Driver 545.23.08

nvcc version:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0

ollama version:

$ ollama --version
ollama version is 0.1.18

And this doesn't seem to be a memory issue as the memory is barely used
2024-01-06-231724_2008x1862_scrot

<!-- gh-comment-id:1879873271 --> @varonroy commented on GitHub (Jan 6, 2024): I am having the same issue. Here are my specs: |Spec|Value| |---|---| |OS|Ubuntu 22.04.3 LTS x86_64| |Kernel|5.15.0-91-generic| |CPU|Intel Xeon Gold 6330 (112) @ 3.100GHz| |GPU (2x)|NVIDIA A100 PCIe 40GB| Cuda installation: |Spec|Value| |---|---| |Cuda|12.3| |Driver|545.23.08| nvcc version: ``` $ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Wed_Nov_22_10:17:15_PST_2023 Cuda compilation tools, release 12.3, V12.3.107 Build cuda_12.3.r12.3/compiler.33567101_0 ``` ollama version: ``` $ ollama --version ollama version is 0.1.18 ``` And this doesn't seem to be a memory issue as the memory is barely used ![2024-01-06-231724_2008x1862_scrot](https://github.com/jmorganca/ollama/assets/19289056/c58f8f60-c6c2-4d55-91c3-306ced0d9756)
Author
Owner

@lnxdevbr commented on GitHub (Jan 6, 2024):

try Ubuntu 23.10.1 and before ollama use update ALL packages,worked for
me,New ollama version arrived 1.18, regards

Em dom., 7 de jan. de 2024 08:21, Roy Varon @.***>
escreveu:

I am having the same issue. Here are my specs:
Spec Value
OS Ubuntu 22.04.3 LTS x86_64
Kernel 5.15.0-91-generic
CPU Intel Xeon Gold 6330 (112) @ 3.100GHz
GPU (2x) NVIDIA A100 PCIe 40GB

Cuda installation:
Spec Value
Cuda 12.3
Driver 545.23.08

nvcc version:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0

ollama version:

$ ollama --version
ollama version is 0.1.18

And this doesn't seem to be a memory issue as the memory is barely used
2024-01-06-231724_2008x1862_scrot.png (view on web)
https://github.com/jmorganca/ollama/assets/19289056/c58f8f60-c6c2-4d55-91c3-306ced0d9756


Reply to this email directly, view it on GitHub
https://github.com/jmorganca/ollama/issues/969#issuecomment-1879873271,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/BFC24TXWD4WAZITR43PORTLYNHMALAVCNFSM6AAAAAA62ONCIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZZHA3TGMRXGE
.
You are receiving this because you commented.Message ID:
@.***>

<!-- gh-comment-id:1879880945 --> @lnxdevbr commented on GitHub (Jan 6, 2024): try Ubuntu 23.10.1 and before ollama use update ALL packages,worked for me,New ollama version arrived 1.18, regards Em dom., 7 de jan. de 2024 08:21, Roy Varon ***@***.***> escreveu: > I am having the same issue. Here are my specs: > Spec Value > OS Ubuntu 22.04.3 LTS x86_64 > Kernel 5.15.0-91-generic > CPU Intel Xeon Gold 6330 (112) @ 3.100GHz > GPU (2x) NVIDIA A100 PCIe 40GB > > Cuda installation: > Spec Value > Cuda 12.3 > Driver 545.23.08 > > nvcc version: > > $ nvcc --version > nvcc: NVIDIA (R) Cuda compiler driver > Copyright (c) 2005-2023 NVIDIA Corporation > Built on Wed_Nov_22_10:17:15_PST_2023 > Cuda compilation tools, release 12.3, V12.3.107 > Build cuda_12.3.r12.3/compiler.33567101_0 > > ollama version: > > $ ollama --version > ollama version is 0.1.18 > > And this doesn't seem to be a memory issue as the memory is barely used > 2024-01-06-231724_2008x1862_scrot.png (view on web) > <https://github.com/jmorganca/ollama/assets/19289056/c58f8f60-c6c2-4d55-91c3-306ced0d9756> > > — > Reply to this email directly, view it on GitHub > <https://github.com/jmorganca/ollama/issues/969#issuecomment-1879873271>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/BFC24TXWD4WAZITR43PORTLYNHMALAVCNFSM6AAAAAA62ONCIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZZHA3TGMRXGE> > . > You are receiving this because you commented.Message ID: > ***@***.***> >
Author
Owner

@lnxdevbr commented on GitHub (Jan 6, 2024):

look my kernel version and NVidia , cuda driver posted,maybe installing in
your system works

Em dom., 7 de jan. de 2024 08:49, Miya Sil @.***> escreveu:

try Ubuntu 23.10.1 and before ollama use update ALL packages,worked for
me,New ollama version arrived 1.18, regards

Em dom., 7 de jan. de 2024 08:21, Roy Varon @.***>
escreveu:

I am having the same issue. Here are my specs:
Spec Value
OS Ubuntu 22.04.3 LTS x86_64
Kernel 5.15.0-91-generic
CPU Intel Xeon Gold 6330 (112) @ 3.100GHz
GPU (2x) NVIDIA A100 PCIe 40GB

Cuda installation:
Spec Value
Cuda 12.3
Driver 545.23.08

nvcc version:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0

ollama version:

$ ollama --version
ollama version is 0.1.18

And this doesn't seem to be a memory issue as the memory is barely used
2024-01-06-231724_2008x1862_scrot.png (view on web)
https://github.com/jmorganca/ollama/assets/19289056/c58f8f60-c6c2-4d55-91c3-306ced0d9756


Reply to this email directly, view it on GitHub
https://github.com/jmorganca/ollama/issues/969#issuecomment-1879873271,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/BFC24TXWD4WAZITR43PORTLYNHMALAVCNFSM6AAAAAA62ONCIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZZHA3TGMRXGE
.
You are receiving this because you commented.Message ID:
@.***>

<!-- gh-comment-id:1879881651 --> @lnxdevbr commented on GitHub (Jan 6, 2024): look my kernel version and NVidia , cuda driver posted,maybe installing in your system works Em dom., 7 de jan. de 2024 08:49, Miya Sil ***@***.***> escreveu: > try Ubuntu 23.10.1 and before ollama use update ALL packages,worked for > me,New ollama version arrived 1.18, regards > > Em dom., 7 de jan. de 2024 08:21, Roy Varon ***@***.***> > escreveu: > >> I am having the same issue. Here are my specs: >> Spec Value >> OS Ubuntu 22.04.3 LTS x86_64 >> Kernel 5.15.0-91-generic >> CPU Intel Xeon Gold 6330 (112) @ 3.100GHz >> GPU (2x) NVIDIA A100 PCIe 40GB >> >> Cuda installation: >> Spec Value >> Cuda 12.3 >> Driver 545.23.08 >> >> nvcc version: >> >> $ nvcc --version >> nvcc: NVIDIA (R) Cuda compiler driver >> Copyright (c) 2005-2023 NVIDIA Corporation >> Built on Wed_Nov_22_10:17:15_PST_2023 >> Cuda compilation tools, release 12.3, V12.3.107 >> Build cuda_12.3.r12.3/compiler.33567101_0 >> >> ollama version: >> >> $ ollama --version >> ollama version is 0.1.18 >> >> And this doesn't seem to be a memory issue as the memory is barely used >> 2024-01-06-231724_2008x1862_scrot.png (view on web) >> <https://github.com/jmorganca/ollama/assets/19289056/c58f8f60-c6c2-4d55-91c3-306ced0d9756> >> >> — >> Reply to this email directly, view it on GitHub >> <https://github.com/jmorganca/ollama/issues/969#issuecomment-1879873271>, >> or unsubscribe >> <https://github.com/notifications/unsubscribe-auth/BFC24TXWD4WAZITR43PORTLYNHMALAVCNFSM6AAAAAA62ONCIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZZHA3TGMRXGE> >> . >> You are receiving this because you commented.Message ID: >> ***@***.***> >> >
Author
Owner

@simplesisu commented on GitHub (Jan 8, 2024):

I just added a second RTX 3060 12GB and restarted pc & docker on UnRAID and loaded mixtral just for fun but the docker crashed and I was unable to restart it. Deleted it with image and setup and re-installed it. Now when loading even tinyllama it just outputs infinite ######...What happened and how can it be fixed?...I reinstalled 2 times with the same result

llama_model_loader: - tensor  192:            blk.9.ffn_down.weight q4_0     [  5632,  2048,     1,     1 ]
llama_model_loader: - tensor  193:            blk.9.ffn_gate.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor  194:              blk.9.ffn_up.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor  195:            blk.9.ffn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor  196:              blk.9.attn_k.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor  197:         blk.9.attn_output.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor  198:              blk.9.attn_q.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor  199:              blk.9.attn_v.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor  200:               output_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = TinyLlama
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv   4:                          llama.block_count u32              = 22
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 5632
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 4
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 2
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,61249]   = ["▁ t", "e r", "i n", "▁ a", "e n...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  21:                    tokenizer.chat_template str              = {% for message in messages %}\n{% if m...
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   45 tensors
llama_model_loader: - type q4_0:  155 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_layer          = 22
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 5632
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 1.10 B
llm_load_print_meta: model size       = 606.53 MiB (4.63 BPW) 
llm_load_print_meta: general.name     = TinyLlama
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.08 MiB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =   35.23 MiB
llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 23/23 layers to GPU
llm_load_tensors: VRAM used: 571.37 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 44.00 MB
llama_new_context_with_model: KV self size  =   44.00 MiB, K (f16):   22.00 MiB, V (f16):   22.00 MiB
llama_build_graph: non-view tensors processed: 466/466
llama_new_context_with_model: compute buffer total size = 147.19 MiB
llama_new_context_with_model: VRAM scratch buffer: 144.00 MiB
llama_new_context_with_model: total VRAM used: 759.38 MiB (model: 571.37 MiB, context: 188.00 MiB)
2024/01/08 13:53:37 ext_server_common.go:151: Starting internal llama main loop
[GIN] 2024/01/08 - 13:53:37 | 200 |  7.579335261s |       127.0.0.1 | POST     "/api/generate"
2024/01/08 13:53:55 ext_server_common.go:165: loaded 0 images
NAME                    ID              SIZE    MODIFIED      
tinyllama:latest        2644915ede35    637 MB  4 minutes ago
root@3d05f7684e44:/# ollama tinyllama:latest
Error: unknown command "tinyllama:latest" for "ollama"
root@3d05f7684e44:/# ollama run  tinyllama:latest
>>> tell me a joke about futurama
################################ #####################################################################################################################################################################################################################################################################################################################################################################################################################################################################^Z
[1]+  Stopped                 ollama run tinyllama:latest
:~# nvidia-smi
Mon Jan  8 14:59:51 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060        On  | 00000000:0A:00.0 Off |                  N/A |
| 55%   51C    P2              68W / 170W |    857MiB / 12288MiB |     49%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3060        On  | 00000000:4A:00.0 Off |                  N/A |
| 60%   55C    P2              52W / 170W |    665MiB / 12288MiB |     20%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    120413      C   /bin/ollama                                 850MiB |
|    1   N/A  N/A    120413      C   /bin/ollama                                 658MiB |
+---------------------------------------------------------------------------------------+

ollama version is 0.1.18

<!-- gh-comment-id:1881068878 --> @simplesisu commented on GitHub (Jan 8, 2024): I just added a second RTX 3060 12GB and restarted pc & docker on UnRAID and loaded mixtral just for fun but the docker crashed and I was unable to restart it. Deleted it with image and setup and re-installed it. Now when loading even tinyllama it just outputs infinite ######...What happened and how can it be fixed?...I reinstalled 2 times with the same result ``` llama_model_loader: - tensor 192: blk.9.ffn_down.weight q4_0 [ 5632, 2048, 1, 1 ] llama_model_loader: - tensor 193: blk.9.ffn_gate.weight q4_0 [ 2048, 5632, 1, 1 ] llama_model_loader: - tensor 194: blk.9.ffn_up.weight q4_0 [ 2048, 5632, 1, 1 ] llama_model_loader: - tensor 195: blk.9.ffn_norm.weight f32 [ 2048, 1, 1, 1 ] llama_model_loader: - tensor 196: blk.9.attn_k.weight q4_0 [ 2048, 256, 1, 1 ] llama_model_loader: - tensor 197: blk.9.attn_output.weight q4_0 [ 2048, 2048, 1, 1 ] llama_model_loader: - tensor 198: blk.9.attn_q.weight q4_0 [ 2048, 2048, 1, 1 ] llama_model_loader: - tensor 199: blk.9.attn_v.weight q4_0 [ 2048, 256, 1, 1 ] llama_model_loader: - tensor 200: output_norm.weight f32 [ 2048, 1, 1, 1 ] llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = TinyLlama llama_model_loader: - kv 2: llama.context_length u32 = 2048 llama_model_loader: - kv 3: llama.embedding_length u32 = 2048 llama_model_loader: - kv 4: llama.block_count u32 = 22 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 5632 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 64 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 4 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 11: general.file_type u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,61249] = ["▁ t", "e r", "i n", "▁ a", "e n... llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 2 llama_model_loader: - kv 21: tokenizer.chat_template str = {% for message in messages %}\n{% if m... llama_model_loader: - kv 22: general.quantization_version u32 = 2 llama_model_loader: - type f32: 45 tensors llama_model_loader: - type q4_0: 155 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 2048 llm_load_print_meta: n_embd = 2048 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 4 llm_load_print_meta: n_layer = 22 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 5632 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 2048 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 1B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 1.10 B llm_load_print_meta: model size = 606.53 MiB (4.63 BPW) llm_load_print_meta: general.name = TinyLlama llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: PAD token = 2 '</s>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.08 MiB llm_load_tensors: using CUDA for GPU acceleration llm_load_tensors: mem required = 35.23 MiB llm_load_tensors: offloading 22 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 23/23 layers to GPU llm_load_tensors: VRAM used: 571.37 MiB ....................................................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: VRAM kv self = 44.00 MB llama_new_context_with_model: KV self size = 44.00 MiB, K (f16): 22.00 MiB, V (f16): 22.00 MiB llama_build_graph: non-view tensors processed: 466/466 llama_new_context_with_model: compute buffer total size = 147.19 MiB llama_new_context_with_model: VRAM scratch buffer: 144.00 MiB llama_new_context_with_model: total VRAM used: 759.38 MiB (model: 571.37 MiB, context: 188.00 MiB) 2024/01/08 13:53:37 ext_server_common.go:151: Starting internal llama main loop [GIN] 2024/01/08 - 13:53:37 | 200 | 7.579335261s | 127.0.0.1 | POST "/api/generate" 2024/01/08 13:53:55 ext_server_common.go:165: loaded 0 images ``` ``` NAME ID SIZE MODIFIED tinyllama:latest 2644915ede35 637 MB 4 minutes ago root@3d05f7684e44:/# ollama tinyllama:latest Error: unknown command "tinyllama:latest" for "ollama" root@3d05f7684e44:/# ollama run tinyllama:latest >>> tell me a joke about futurama ################################ #####################################################################################################################################################################################################################################################################################################################################################################################################################################################################^Z [1]+ Stopped ollama run tinyllama:latest ``` ``` :~# nvidia-smi Mon Jan 8 14:59:51 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 545.29.06 Driver Version: 545.29.06 CUDA Version: 12.3 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3060 On | 00000000:0A:00.0 Off | N/A | | 55% 51C P2 68W / 170W | 857MiB / 12288MiB | 49% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce RTX 3060 On | 00000000:4A:00.0 Off | N/A | | 60% 55C P2 52W / 170W | 665MiB / 12288MiB | 20% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 120413 C /bin/ollama 850MiB | | 1 N/A N/A 120413 C /bin/ollama 658MiB | +---------------------------------------------------------------------------------------+ ``` `ollama version is 0.1.18`
Author
Owner

@simplesisu commented on GitHub (Jan 8, 2024):

UPDATE:

  • I removed the second GPU by setting NVIDIA_VISIBLE_DEVICES=1 (brand new GPU (device 0) btw and works in dual mode with other AI frameworks such as oobabooga etc.)

Now it works again! So the questions is why the # output when having a second GPU?

<!-- gh-comment-id:1881087955 --> @simplesisu commented on GitHub (Jan 8, 2024): UPDATE: - I removed the second GPU by setting NVIDIA_VISIBLE_DEVICES=1 (brand new GPU (device 0) btw and works in dual mode with other AI frameworks such as oobabooga etc.) Now it works again! So the questions is why the # output when having a second GPU?
Author
Owner

@phalexo commented on GitHub (Jan 8, 2024):

Check the prompt format for this model. I think I've seen this when I
failed to use the correct prompt format.

On Mon, Jan 8, 2024, 9:01 AM simplesisu @.***> wrote:

I just added a second RTX 3060 12GB and restarted pc & docker on UnRAID
and loaded mixtral just for fun but the docker crashed and I was unable to
restart it. Deleted it with image and setup and re-installed it. Now when
loading even tinyllama it just outputs infinite ######...What happened and
how can it be fixed?...I reinstalled 2 times with the same result

llama_model_loader: - tensor 192: blk.9.ffn_down.weight q4_0 [ 5632, 2048, 1, 1 ]
llama_model_loader: - tensor 193: blk.9.ffn_gate.weight q4_0 [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 194: blk.9.ffn_up.weight q4_0 [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 195: blk.9.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 196: blk.9.attn_k.weight q4_0 [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 197: blk.9.attn_output.weight q4_0 [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 198: blk.9.attn_q.weight q4_0 [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 199: blk.9.attn_v.weight q4_0 [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 200: output_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = TinyLlama
llama_model_loader: - kv 2: llama.context_length u32 = 2048
llama_model_loader: - kv 3: llama.embedding_length u32 = 2048
llama_model_loader: - kv 4: llama.block_count u32 = 22
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 5632
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 64
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 4
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: general.file_type u32 = 2
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,61249] = ["▁ t", "e r", "i n", "▁ a", "e n...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 2
llama_model_loader: - kv 21: tokenizer.chat_template str = {% for message in messages %}\n{% if m...
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 45 tensors
llama_model_loader: - type q4_0: 155 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 4
llm_load_print_meta: n_layer = 22
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 5632
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 1B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 1.10 B
llm_load_print_meta: model size = 606.53 MiB (4.63 BPW)
llm_load_print_meta: general.name = TinyLlama
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: PAD token = 2 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.08 MiB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 35.23 MiB
llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 23/23 layers to GPU
llm_load_tensors: VRAM used: 571.37 MiB
.......................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 44.00 MB
llama_new_context_with_model: KV self size = 44.00 MiB, K (f16): 22.00 MiB, V (f16): 22.00 MiB
llama_build_graph: non-view tensors processed: 466/466
llama_new_context_with_model: compute buffer total size = 147.19 MiB
llama_new_context_with_model: VRAM scratch buffer: 144.00 MiB
llama_new_context_with_model: total VRAM used: 759.38 MiB (model: 571.37 MiB, context: 188.00 MiB)
2024/01/08 13:53:37 ext_server_common.go:151: Starting internal llama main loop
[GIN] 2024/01/08 - 13:53:37 | 200 | 7.579335261s | 127.0.0.1 | POST "/api/generate"
2024/01/08 13:53:55 ext_server_common.go:165: loaded 0 images

NAME ID SIZE MODIFIED
tinyllama:latest 2644915ede35 637 MB 4 minutes ago
@.:/# ollama tinyllama:latest
Error: unknown command "tinyllama:latest" for "ollama"
@.
:/# ollama run tinyllama:latest

tell me a joke about futurama
################################ #####################################################################################################################################################################################################################################################################################################################################################################################################################################################################^Z
[1]+ Stopped ollama run tinyllama:latest

:~# nvidia-smi
Mon Jan 8 14:59:51 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06 Driver Version: 545.29.06 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 On | 00000000:0A:00.0 Off | N/A |
| 55% 51C P2 68W / 170W | 857MiB / 12288MiB | 49% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 3060 On | 00000000:4A:00.0 Off | N/A |
| 60% 55C P2 52W / 170W | 665MiB / 12288MiB | 20% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 120413 C /bin/ollama 850MiB |
| 1 N/A N/A 120413 C /bin/ollama 658MiB |
+---------------------------------------------------------------------------------------+

ollama version is 0.1.18


Reply to this email directly, view it on GitHub
https://github.com/jmorganca/ollama/issues/969#issuecomment-1881068878,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABDD3ZLEGEZJK6OTHIFYKELYNP33HAVCNFSM6AAAAAA62ONCIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBRGA3DQOBXHA
.
You are receiving this because you commented.Message ID:
@.***>

<!-- gh-comment-id:1881094839 --> @phalexo commented on GitHub (Jan 8, 2024): Check the prompt format for this model. I think I've seen this when I failed to use the correct prompt format. On Mon, Jan 8, 2024, 9:01 AM simplesisu ***@***.***> wrote: > I just added a second RTX 3060 12GB and restarted pc & docker on UnRAID > and loaded mixtral just for fun but the docker crashed and I was unable to > restart it. Deleted it with image and setup and re-installed it. Now when > loading even tinyllama it just outputs infinite ######...What happened and > how can it be fixed?...I reinstalled 2 times with the same result > > llama_model_loader: - tensor 192: blk.9.ffn_down.weight q4_0 [ 5632, 2048, 1, 1 ] > llama_model_loader: - tensor 193: blk.9.ffn_gate.weight q4_0 [ 2048, 5632, 1, 1 ] > llama_model_loader: - tensor 194: blk.9.ffn_up.weight q4_0 [ 2048, 5632, 1, 1 ] > llama_model_loader: - tensor 195: blk.9.ffn_norm.weight f32 [ 2048, 1, 1, 1 ] > llama_model_loader: - tensor 196: blk.9.attn_k.weight q4_0 [ 2048, 256, 1, 1 ] > llama_model_loader: - tensor 197: blk.9.attn_output.weight q4_0 [ 2048, 2048, 1, 1 ] > llama_model_loader: - tensor 198: blk.9.attn_q.weight q4_0 [ 2048, 2048, 1, 1 ] > llama_model_loader: - tensor 199: blk.9.attn_v.weight q4_0 [ 2048, 256, 1, 1 ] > llama_model_loader: - tensor 200: output_norm.weight f32 [ 2048, 1, 1, 1 ] > llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. > llama_model_loader: - kv 0: general.architecture str = llama > llama_model_loader: - kv 1: general.name str = TinyLlama > llama_model_loader: - kv 2: llama.context_length u32 = 2048 > llama_model_loader: - kv 3: llama.embedding_length u32 = 2048 > llama_model_loader: - kv 4: llama.block_count u32 = 22 > llama_model_loader: - kv 5: llama.feed_forward_length u32 = 5632 > llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 64 > llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 > llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 4 > llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 > llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000 > llama_model_loader: - kv 11: general.file_type u32 = 2 > llama_model_loader: - kv 12: tokenizer.ggml.model str = llama > llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<... > llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... > llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... > llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,61249] = ["▁ t", "e r", "i n", "▁ a", "e n... > llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1 > llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2 > llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0 > llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 2 > llama_model_loader: - kv 21: tokenizer.chat_template str = {% for message in messages %}\n{% if m... > llama_model_loader: - kv 22: general.quantization_version u32 = 2 > llama_model_loader: - type f32: 45 tensors > llama_model_loader: - type q4_0: 155 tensors > llama_model_loader: - type q6_K: 1 tensors > llm_load_vocab: special tokens definition check successful ( 259/32000 ). > llm_load_print_meta: format = GGUF V3 (latest) > llm_load_print_meta: arch = llama > llm_load_print_meta: vocab type = SPM > llm_load_print_meta: n_vocab = 32000 > llm_load_print_meta: n_merges = 0 > llm_load_print_meta: n_ctx_train = 2048 > llm_load_print_meta: n_embd = 2048 > llm_load_print_meta: n_head = 32 > llm_load_print_meta: n_head_kv = 4 > llm_load_print_meta: n_layer = 22 > llm_load_print_meta: n_rot = 64 > llm_load_print_meta: n_gqa = 8 > llm_load_print_meta: f_norm_eps = 0.0e+00 > llm_load_print_meta: f_norm_rms_eps = 1.0e-05 > llm_load_print_meta: f_clamp_kqv = 0.0e+00 > llm_load_print_meta: f_max_alibi_bias = 0.0e+00 > llm_load_print_meta: n_ff = 5632 > llm_load_print_meta: n_expert = 0 > llm_load_print_meta: n_expert_used = 0 > llm_load_print_meta: rope scaling = linear > llm_load_print_meta: freq_base_train = 10000.0 > llm_load_print_meta: freq_scale_train = 1 > llm_load_print_meta: n_yarn_orig_ctx = 2048 > llm_load_print_meta: rope_finetuned = unknown > llm_load_print_meta: model type = 1B > llm_load_print_meta: model ftype = Q4_0 > llm_load_print_meta: model params = 1.10 B > llm_load_print_meta: model size = 606.53 MiB (4.63 BPW) > llm_load_print_meta: general.name = TinyLlama > llm_load_print_meta: BOS token = 1 '<s>' > llm_load_print_meta: EOS token = 2 '</s>' > llm_load_print_meta: UNK token = 0 '<unk>' > llm_load_print_meta: PAD token = 2 '</s>' > llm_load_print_meta: LF token = 13 '<0x0A>' > llm_load_tensors: ggml ctx size = 0.08 MiB > llm_load_tensors: using CUDA for GPU acceleration > llm_load_tensors: mem required = 35.23 MiB > llm_load_tensors: offloading 22 repeating layers to GPU > llm_load_tensors: offloading non-repeating layers to GPU > llm_load_tensors: offloaded 23/23 layers to GPU > llm_load_tensors: VRAM used: 571.37 MiB > ....................................................................................... > llama_new_context_with_model: n_ctx = 2048 > llama_new_context_with_model: freq_base = 10000.0 > llama_new_context_with_model: freq_scale = 1 > llama_kv_cache_init: VRAM kv self = 44.00 MB > llama_new_context_with_model: KV self size = 44.00 MiB, K (f16): 22.00 MiB, V (f16): 22.00 MiB > llama_build_graph: non-view tensors processed: 466/466 > llama_new_context_with_model: compute buffer total size = 147.19 MiB > llama_new_context_with_model: VRAM scratch buffer: 144.00 MiB > llama_new_context_with_model: total VRAM used: 759.38 MiB (model: 571.37 MiB, context: 188.00 MiB) > 2024/01/08 13:53:37 ext_server_common.go:151: Starting internal llama main loop > [GIN] 2024/01/08 - 13:53:37 | 200 | 7.579335261s | 127.0.0.1 | POST "/api/generate" > 2024/01/08 13:53:55 ext_server_common.go:165: loaded 0 images > > NAME ID SIZE MODIFIED > tinyllama:latest 2644915ede35 637 MB 4 minutes ago > ***@***.***:/# ollama tinyllama:latest > Error: unknown command "tinyllama:latest" for "ollama" > ***@***.***:/# ollama run tinyllama:latest > >>> tell me a joke about futurama > ################################ #####################################################################################################################################################################################################################################################################################################################################################################################################################################################################^Z > [1]+ Stopped ollama run tinyllama:latest > > :~# nvidia-smi > Mon Jan 8 14:59:51 2024 > +---------------------------------------------------------------------------------------+ > | NVIDIA-SMI 545.29.06 Driver Version: 545.29.06 CUDA Version: 12.3 | > |-----------------------------------------+----------------------+----------------------+ > | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | > | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | > | | | MIG M. | > |=========================================+======================+======================| > | 0 NVIDIA GeForce RTX 3060 On | 00000000:0A:00.0 Off | N/A | > | 55% 51C P2 68W / 170W | 857MiB / 12288MiB | 49% Default | > | | | N/A | > +-----------------------------------------+----------------------+----------------------+ > | 1 NVIDIA GeForce RTX 3060 On | 00000000:4A:00.0 Off | N/A | > | 60% 55C P2 52W / 170W | 665MiB / 12288MiB | 20% Default | > | | | N/A | > +-----------------------------------------+----------------------+----------------------+ > > +---------------------------------------------------------------------------------------+ > | Processes: | > | GPU GI CI PID Type Process name GPU Memory | > | ID ID Usage | > |=======================================================================================| > | 0 N/A N/A 120413 C /bin/ollama 850MiB | > | 1 N/A N/A 120413 C /bin/ollama 658MiB | > +---------------------------------------------------------------------------------------+ > > ollama version is 0.1.18 > > — > Reply to this email directly, view it on GitHub > <https://github.com/jmorganca/ollama/issues/969#issuecomment-1881068878>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABDD3ZLEGEZJK6OTHIFYKELYNP33HAVCNFSM6AAAAAA62ONCIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBRGA3DQOBXHA> > . > You are receiving this because you commented.Message ID: > ***@***.***> >
Author
Owner

@simplesisu commented on GitHub (Jan 8, 2024):

For which model?

<!-- gh-comment-id:1881191044 --> @simplesisu commented on GitHub (Jan 8, 2024): For which model?
Author
Owner

@IliyanGochev commented on GitHub (Jan 9, 2024):

Had the same problem, Ubuntu 22.04 LTS, 2xRTX3090, no matter the model tired (Phi, Mistral) I'd either get infinite gibberish words or infinite # signs.

I've tried the suggestion of upgrading to the latest 23.x of Ubuntu, but that did not help.
It only broke my Nvidia driver (545 / CUDA 12.3 at the time) and Ollama / Llama.cpp ran on CPU-only mode.

Then I've downgraded the driver to 535 / CUDA 12.2 and now I'm able to run both Phi and Mistral, and even Mixtral without a problem on the GPUs.

<!-- gh-comment-id:1883149099 --> @IliyanGochev commented on GitHub (Jan 9, 2024): Had the same problem, Ubuntu 22.04 LTS, 2xRTX3090, no matter the model tired (Phi, Mistral) I'd either get infinite gibberish words or infinite # signs. I've tried the suggestion of upgrading to the latest 23.x of Ubuntu, but that did not help. It only broke my Nvidia driver (545 / CUDA 12.3 at the time) and Ollama / Llama.cpp ran on CPU-only mode. Then I've downgraded the driver to 535 / CUDA 12.2 and now I'm able to run both Phi and Mistral, and even Mixtral without a problem on the GPUs.
Author
Owner

@simplesisu commented on GitHub (Jan 12, 2024):

Had the same problem, Ubuntu 22.04 LTS, 2xRTX3090, no matter the model tired (Phi, Mistral) I'd either get infinite gibberish words or infinite # signs.

I've tried the suggestion of upgrading to the latest 23.x of Ubuntu, but that did not help. It only broke my Nvidia driver (545 / CUDA 12.3 at the time) and Ollama / Llama.cpp ran on CPU-only mode.

Then I've downgraded the driver to 535 / CUDA 12.2 and now I'm able to run both Phi and Mistral, and even Mixtral without a problem on the GPUs.

Glad it worked for you! which version of 535...( v535.146.02, v535.129.03 or other?)

<!-- gh-comment-id:1889530177 --> @simplesisu commented on GitHub (Jan 12, 2024): > Had the same problem, Ubuntu 22.04 LTS, 2xRTX3090, no matter the model tired (Phi, Mistral) I'd either get infinite gibberish words or infinite # signs. > > I've tried the suggestion of upgrading to the latest 23.x of Ubuntu, but that did not help. It only broke my Nvidia driver (545 / CUDA 12.3 at the time) and Ollama / Llama.cpp ran on CPU-only mode. > > Then I've downgraded the driver to 535 / CUDA 12.2 and now I'm able to run both Phi and Mistral, and even Mixtral without a problem on the GPUs. Glad it worked for you! which version of 535..**.( v535.146.02, v535.129.03 or other?)**
Author
Owner

@igorschlum commented on GitHub (Jan 12, 2024):

Could you try with version 0.1.20? It could solve the issue

<!-- gh-comment-id:1889654619 --> @igorschlum commented on GitHub (Jan 12, 2024): Could you try with version 0.1.20? It could solve the issue
Author
Owner

@jmorganca commented on GitHub (Apr 17, 2024):

Hi there, this should be fixed now. If not please let me know

<!-- gh-comment-id:2060215179 --> @jmorganca commented on GitHub (Apr 17, 2024): Hi there, this should be fixed now. If not please let me know
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#472