[GH-ISSUE #969] ###### problem #472

New Issue

GiteaMirror · 2026-04-12T10:09:04-05:00

GiteaMirror commented

2026-04-12 10:09:04 -05:00

Originally created by @k3341095 on GitHub (Nov 2, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/969

Originally assigned to: @BruceMacD on GitHub.

The command to install docker and run the 13b model worked fine. However
run and subsequently hit hi, only #### is being taken to infinity.

Originally created by @k3341095 on GitHub (Nov 2, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/969 Originally assigned to: @BruceMacD on GitHub. ![image](https://github.com/jmorganca/ollama/assets/17330375/7148c0f6-47b4-4fa4-b219-436e12776f79) The command to install docker and run the 13b model worked fine. However run and subsequently hit hi, only #### is being taken to infinity.

GiteaMirror added the bug label 2026-04-12 10:09:04 -05:00

GiteaMirror closed this issue

2026-04-12 10:09:04 -05:00

GiteaMirror commented

2026-04-12 10:09:05 -05:00

@BruceMacD commented on GitHub (Nov 2, 2023):

Hi @k3341095, which model is this and what are the resources allocated to the container?

@BruceMacD commented on GitHub (Nov 2, 2023): Hi @k3341095, which model is this and what are the resources allocated to the container?

GiteaMirror commented

2026-04-12 10:09:05 -05:00

@iliabaranov commented on GitHub (Nov 2, 2023):

Hello, same issue here.
I am running mistral natively, just with "ollama run mistral"
It was working fine for a while, and now it's doing the above.

Regardless if I kill -9 it and restart, still the same issue.
It's using 4919MiB / 6144MiB of GPU memory on my card, not much CPU, so resource wise it looks fine...?

@iliabaranov commented on GitHub (Nov 2, 2023): Hello, same issue here. I am running mistral natively, just with "ollama run mistral" It was working fine for a while, and now it's doing the above. Regardless if I kill -9 it and restart, still the same issue. It's using 4919MiB / 6144MiB of GPU memory on my card, not much CPU, so resource wise it looks fine...?

GiteaMirror commented

2026-04-12 10:09:05 -05:00

@mchiang0610 commented on GitHub (Nov 2, 2023):

@iliabaranov and @k3341095 Sorry about this. May I ask which OS / the system specs where you are seeing this problem?

Just in case the model had issues, would it be possible to ask for the model to be pulled again? (it'll calculate the diff if there is any differences).

ollama pull mistral

run it again.

I'm unable to reproduce on a MacBook Pro 16" M1 16GB

@mchiang0610 commented on GitHub (Nov 2, 2023): @iliabaranov and @k3341095 Sorry about this. May I ask which OS / the system specs where you are seeing this problem? Just in case the model had issues, would it be possible to ask for the model to be pulled again? (it'll calculate the diff if there is any differences). `ollama pull mistral` run it again. I'm unable to reproduce on a MacBook Pro 16" M1 16GB ![screenshot 000238@2x](https://github.com/jmorganca/ollama/assets/3325447/90b96075-e949-4d5a-81d2-93a51c8b09da)

GiteaMirror commented

2026-04-12 10:09:06 -05:00

@mchiang0610 commented on GitHub (Nov 2, 2023):

my ollama version is 0.1.7
ollama -v

@mchiang0610 commented on GitHub (Nov 2, 2023): my ollama version is 0.1.7 `ollama -v`

GiteaMirror commented

2026-04-12 10:09:06 -05:00

@iliabaranov commented on GitHub (Nov 2, 2023):

same, I'm at 0.1.7

thread:

iliabara@Metis:~$ ollama -v
ollama version 0.1.7
iliabara@Metis:~$ ollama pull mistral
pulling manifest
pulling 6ae280299950... 100% |██████████████████| (4.1/4.1 GB, 25 TB/s)        
pulling 22e1b2e8dc2f... 100% |████████████████████| (43/43 B, 641 kB/s)        
pulling e35ab70a78c7... 100% |████████████████████| (90/90 B, 1.4 MB/s)        
pulling 1cb90d66f4d4... 100% |██████████████████| (381/381 B, 5.9 MB/s)        
verifying sha256 digest
writing manifest
removing any unused layers
success
iliabara@Metis:~$ ollama run mistral
>>> hello?
H#######################################################################################################^C

@iliabaranov commented on GitHub (Nov 2, 2023): same, I'm at 0.1.7 thread: ``` iliabara@Metis:~$ ollama -v ollama version 0.1.7 iliabara@Metis:~$ ollama pull mistral pulling manifest pulling 6ae280299950... 100% |██████████████████| (4.1/4.1 GB, 25 TB/s) pulling 22e1b2e8dc2f... 100% |████████████████████| (43/43 B, 641 kB/s) pulling e35ab70a78c7... 100% |████████████████████| (90/90 B, 1.4 MB/s) pulling 1cb90d66f4d4... 100% |██████████████████| (381/381 B, 5.9 MB/s) verifying sha256 digest writing manifest removing any unused layers success iliabara@Metis:~$ ollama run mistral >>> hello? H#######################################################################################################^C

GiteaMirror commented

2026-04-12 10:09:06 -05:00

@mchiang0610 commented on GitHub (Nov 2, 2023):

@iliabaranov thanks. Possible to ask what terminal you are using / the OS and spec? Trying to narrow down and troubleshoot this

@mchiang0610 commented on GitHub (Nov 2, 2023): @iliabaranov thanks. Possible to ask what terminal you are using / the OS and spec? Trying to narrow down and troubleshoot this

GiteaMirror commented

2026-04-12 10:09:07 -05:00

@visualinventor commented on GitHub (Nov 2, 2023):

I am getting a similar garbled response but it is on the second question I ask it as a followup. I'm using Llama2 and the 0.17 version of Ollama. MacOS Sonoma M1 Macbook

@visualinventor commented on GitHub (Nov 2, 2023): I am getting a similar garbled response but it is on the second question I ask it as a followup. I'm using Llama2 and the 0.17 version of Ollama. MacOS Sonoma M1 Macbook

GiteaMirror commented

2026-04-12 10:09:07 -05:00

@iliabaranov commented on GitHub (Nov 2, 2023):

Sure thing, it's standard terminal in Ubuntu 20.04, nothing particularly special about the setup or OS.

@iliabaranov commented on GitHub (Nov 2, 2023): Sure thing, it's standard terminal in Ubuntu 20.04, nothing particularly special about the setup or OS.

GiteaMirror commented

2026-04-12 10:09:08 -05:00

@igorschlum commented on GitHub (Nov 3, 2023):

I use Ollama 0.1.7 with MacBook M2 32Go MacOS : 13.5.2 (22G91) and cannot reproduce the issue

(base) igor@macIgor ~ % ollama pull mistral
pulling manifest
pulling 6ae280299950... 100% |████████████████████████████████| (4.1/4.1 GB, 14 TB/s)
pulling 22e1b2e8dc2f... 100% |████████████████████████████████████| (43/43 B, 17 B/s)
pulling e35ab70a78c7... 100% |██████████████████████████████████| (90/90 B, 1.2 MB/s)
pulling 1cb90d66f4d4... 100% |█████████████████████████████████| (381/381 B, 163 B/s)
verifying sha256 digest
writing manifest
removing any unused layers
success
(base) igor@macIgor ~ % ollama run mistral

hello?
Hello! How can I help you today?

can you tell me if the moon is getting bigger today?
I am not aware of any significant events that would cause the Moon to physically change
in size today. However, its appearance may appear to change due to various factors such
as its position in relation to the Earth and Sun or atmospheric conditions. If you have
any other questions, feel free to ask!

hello?
Hello again! How can I assist you further today?

Send a message (/? for help)

@igorschlum commented on GitHub (Nov 3, 2023): I use Ollama 0.1.7 with MacBook M2 32Go MacOS : 13.5.2 (22G91) and cannot reproduce the issue (base) igor@macIgor ~ % ollama pull mistral pulling manifest pulling 6ae280299950... 100% |████████████████████████████████| (4.1/4.1 GB, 14 TB/s) pulling 22e1b2e8dc2f... 100% |████████████████████████████████████| (43/43 B, 17 B/s) pulling e35ab70a78c7... 100% |██████████████████████████████████| (90/90 B, 1.2 MB/s) pulling 1cb90d66f4d4... 100% |█████████████████████████████████| (381/381 B, 163 B/s) verifying sha256 digest writing manifest removing any unused layers success (base) igor@macIgor ~ % ollama run mistral >>> hello? Hello! How can I help you today? >>> can you tell me if the moon is getting bigger today? I am not aware of any significant events that would cause the Moon to physically change in size today. However, its appearance may appear to change due to various factors such as its position in relation to the Earth and Sun or atmospheric conditions. If you have any other questions, feel free to ask! >>> hello? Hello again! How can I assist you further today? >>> Send a message (/? for help)

GiteaMirror commented

2026-04-12 10:09:08 -05:00

@k3341095 commented on GitHub (Nov 3, 2023):

Ollama latest / Ubuntu 22.04 / nvidia 1080 / 13b model / b450 ryzen 5600g 32gb memory

@k3341095 commented on GitHub (Nov 3, 2023): Ollama latest / Ubuntu 22.04 / nvidia 1080 / 13b model / b450 ryzen 5600g 32gb memory

GiteaMirror commented

2026-04-12 10:09:09 -05:00

@k3341095 commented on GitHub (Nov 3, 2023):

I've tested it again now and it seems to be due to low memory, the 13B requires at least 16 gigs of GPU memory and my GPU memory is 10GB. I thought 16 gig was system memory...
I thought the 13b model size was less than 8 gigs. My apologies. I think you can close it, the mistral model works fine.

@k3341095 commented on GitHub (Nov 3, 2023): I've tested it again now and it seems to be due to low memory, the 13B requires at least 16 gigs of GPU memory and my GPU memory is 10GB. I thought 16 gig was system memory... I thought the 13b model size was less than 8 gigs. My apologies. I think you can close it, the mistral model works fine.

GiteaMirror commented

2026-04-12 10:09:09 -05:00

@iliabaranov commented on GitHub (Nov 3, 2023):

That doesn't seem to be the case for me...? my memory doesn't even fill to 100%

@iliabaranov commented on GitHub (Nov 3, 2023): That doesn't seem to be the case for me...? my memory doesn't even fill to 100%

GiteaMirror commented

2026-04-12 10:09:09 -05:00

@jmorganca commented on GitHub (Nov 4, 2023):

Hi folks this should be fixed in 0.1.8. Please re-open if you still see the issue!

@jmorganca commented on GitHub (Nov 4, 2023): Hi folks this should be fixed in 0.1.8. Please re-open if you still see the issue!

GiteaMirror commented

2026-04-12 10:09:10 -05:00

@iliabaranov commented on GitHub (Nov 5, 2023):

@jmorganca still an issue unfortunately
Updated to 0.18, pulled latest mistral, exact same issue.
GPU memory is still not full, so unsure that's the issue

@iliabaranov commented on GitHub (Nov 5, 2023): @jmorganca still an issue unfortunately Updated to 0.18, pulled latest mistral, exact same issue. GPU memory is still not full, so unsure that's the issue

GiteaMirror commented

2026-04-12 10:09:10 -05:00

@BruceMacD commented on GitHub (Nov 10, 2023):

@iliabaranov would you be able to share the output of your nivida-smi command?

@BruceMacD commented on GitHub (Nov 10, 2023): @iliabaranov would you be able to share the output of your `nivida-smi` command?

GiteaMirror commented

2026-04-12 10:09:11 -05:00

@ramon-buzo commented on GitHub (Nov 27, 2023):

Hello guys, I too am experiencing the same issue. I am running Ollama version 0.1.11 on Ubuntu 22.04 with an Intel Core i5-8600k 32GB of RAM, and an Nvidia GeForce GTX 1060 GPU with 6GB of VRAM. The problem has arisen multiple times, while I was developing a simple web UI for Ollama via API, testing various models (Llama2 7b, Mistral 7b, etc.) to evaluate their behavior with the same questions. To my surprise, I began receiving responses with ###### , and despite my inability to confirm memory issues, This is the output of nvidia-smi command

@ramon-buzo commented on GitHub (Nov 27, 2023): Hello guys, I too am experiencing the same issue. I am running Ollama version 0.1.11 on Ubuntu 22.04 with an Intel Core i5-8600k 32GB of RAM, and an Nvidia GeForce GTX 1060 GPU with 6GB of VRAM. The problem has arisen multiple times, while I was developing a simple web UI for Ollama via API, testing various models (Llama2 7b, Mistral 7b, etc.) to evaluate their behavior with the same questions. To my surprise, I began receiving responses with ###### , and despite my inability to confirm memory issues, This is the output of `nvidia-smi` command ![nvdia-output](https://github.com/jmorganca/ollama/assets/152201837/9ca9b955-5b90-46e5-9722-8840563ebef1)

GiteaMirror commented

2026-04-12 10:09:11 -05:00

@phalexo commented on GitHub (Dec 4, 2023):

I see a very similar behavior running on GPUs, VRAM is less than 50%. Besides #### I see page scrolling pretty fast. This is the response to the first query. If I enter another query then it dies completely, complains about cuBlas and suggest that VRAM is low.

I installed go and rebuilt everything, and still the same problem.

The same model appears to work with the host, but is slow.

@phalexo commented on GitHub (Dec 4, 2023): I see a very similar behavior running on GPUs, VRAM is less than 50%. Besides #### I see page scrolling pretty fast. This is the response to the first query. If I enter another query then it dies completely, complains about cuBlas and suggest that VRAM is low. I installed go and rebuilt everything, and still the same problem. The same model appears to work with the host, but is slow.

GiteaMirror commented

2026-04-12 10:09:12 -05:00

@lnxdevbr commented on GitHub (Jan 3, 2024):

I installed Ubuntu 23.10. with nvidia drivers included and on my i7 4790 with 8gb ram, nvidia 1060 card with 6gb version 0.1.8, it worked with gpu acceleration, but in version 1.17 it didn't work, on my i5 3470, 16gb ram with nvidia 3060 12gb didn't work any version of ollama

@lnxdevbr commented on GitHub (Jan 3, 2024): I installed Ubuntu 23.10. with nvidia drivers included and on my i7 4790 with 8gb ram, nvidia 1060 card with 6gb version 0.1.8, it worked with gpu acceleration, but in version 1.17 it didn't work, on my i5 3470, 16gb ram with nvidia 3060 12gb didn't work any version of ollama

GiteaMirror commented

2026-04-12 10:09:12 -05:00

@lnxdevbr commented on GitHub (Jan 4, 2024):

After updating my entire system I managed to make the latest version of ollama work correctly, here are my specifications:

arman@AliceAI:$ uname -a
Linux AliceAI 6.5.0-14-generic #14-Ubuntu SMP PREEMPT_DYNAMIC Tue Nov 14 14:59:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
arman@AliceAI:$ nvidia-smi
Thu Jan 4 17:06:15 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 Off | 00000000:01:00.0 On | N/A |
| 0% 30C P8 16W / 170W | 234MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1761 G /usr/lib/xorg/Xorg 119MiB |
| 0 N/A N/A 1990 G /usr/bin/gnome-shell 63MiB |
| 0 N/A N/A 2523 G /usr/bin/nautilus 24MiB |
| 0 N/A N/A 3157 G /usr/bin/gnome-text-editor 16MiB |
+---------------------------------------------------------------------------------------+

arman@AliceAI:~$ ollama serve
2024/01/04 16:55:58 images.go:737: total blobs: 0
2024/01/04 16:55:58 images.go:744: total unused blobs removed: 0
2024/01/04 16:55:58 routes.go:895: Listening on 127.0.0.1:11434 (version 0.1.17)
[GIN] 2024/01/04 - 16:56:14 | 200 | 588.864µs | 127.0.0.1 | HEAD "/"
[GIN] 2024/01/04 - 16:56:14 | 200 | 4.852801ms | 127.0.0.1 | GET "/api/tags"
[GIN] 2024/01/04 - 17:00:23 | 200 | 18.509µs | 127.0.0.1 | HEAD "/"
[GIN] 2024/01/04 - 17:00:59 | 404 | 159.572µs | 127.0.0.1 | HEAD "/api/blobs/sha256:3ef24972116b3f4b0da187e514c3f29ae653f83c00fd6886d70d7b74694b3833"
[GIN] 2024/01/04 - 17:01:24 | 201 | 24.345565906s | 127.0.0.1 | POST "/api/blobs/sha256:3ef24972116b3f4b0da187e514c3f29ae653f83c00fd6886d70d7b74694b3833"
2024/01/04 17:01:24 images.go:370: [model] - @sha256:3ef24972116b3f4b0da187e514c3f29ae653f83c00fd6886d70d7b74694b3833
[GIN] 2024/01/04 - 17:01:48 | 200 | 24.396759785s | 127.0.0.1 | POST "/api/create"
[GIN] 2024/01/04 - 17:01:59 | 200 | 24.651µs | 127.0.0.1 | HEAD "/"
[GIN] 2024/01/04 - 17:01:59 | 200 | 284.445µs | 127.0.0.1 | POST "/api/show"
[GIN] 2024/01/04 - 17:01:59 | 200 | 196.575µs | 127.0.0.1 | POST "/api/show"
2024/01/04 17:02:02 llama.go:300: 11807 MB VRAM available, loading up to 72 GPU layers
2024/01/04 17:02:02 llama.go:436: starting llama runner
2024/01/04 17:02:02 llama.go:494: waiting for llama runner to start responding
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6
{"timestamp":1704355325,"level":"INFO","function":"main","line":2667,"message":"build info","build":468,"commit":"a7aee47"}
{"timestamp":1704355325,"level":"INFO","function":"main","line":2670,"message":"system info","n_threads":4,"n_threads_batch":-1,"total_threads":4,"system_info":"AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "}
llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from /home/arman/.ollama/models/blobs/sha256:3ef24972116b3f4b0da187e514c3f29ae653f83c00fd6886d70d7b74694b3833 (version GGUF V3 (latest))

llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 7.24 B
llm_load_print_meta: model size = 3.83 GiB (4.54 BPW)
llm_load_print_meta: general.name = beowolx_codeninja-1.0-openchat-7b
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: PAD token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.11 MiB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 70.43 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: VRAM used: 3847.56 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 256.00 MB
llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 159.19 MiB
llama_new_context_with_model: VRAM scratch buffer: 156.00 MiB
llama_new_context_with_model: total VRAM used: 4259.57 MiB (model: 3847.56 MiB, context: 412.00 MiB)

arman@AliceAI:$ ollama create ninja -f Modelfile
transferring model data
creating model layer
using already created layer sha256:3ef24972116b3f4b0da187e514c3f29ae653f83c00fd6886d70d7b74694b3833
writing layer sha256:d61e6266c8740ea80cefacf08e4be7d5ae1d6591e76955ca307ad93d0cc036a6
writing manifest
success
arman@AliceAI:$ ollama run ninja

create python code for web page scrap using bs4

import requests
from bs4 import BeautifulSoup

def scrape_webpage(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
return soup

soup = scrape_webpage('https://www.example.com/')
print(soup.prettify())

@lnxdevbr commented on GitHub (Jan 4, 2024): After updating my entire system I managed to make the latest version of ollama work correctly, here are my specifications: arman@AliceAI:~$ uname -a Linux AliceAI 6.5.0-14-generic #14-Ubuntu SMP PREEMPT_DYNAMIC Tue Nov 14 14:59:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux arman@AliceAI:~$ nvidia-smi Thu Jan 4 17:06:15 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3060 Off | 00000000:01:00.0 On | N/A | | 0% 30C P8 16W / 170W | 234MiB / 12288MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 1761 G /usr/lib/xorg/Xorg 119MiB | | 0 N/A N/A 1990 G /usr/bin/gnome-shell 63MiB | | 0 N/A N/A 2523 G /usr/bin/nautilus 24MiB | | 0 N/A N/A 3157 G /usr/bin/gnome-text-editor 16MiB | +---------------------------------------------------------------------------------------+ arman@AliceAI:~$ ollama serve 2024/01/04 16:55:58 images.go:737: total blobs: 0 2024/01/04 16:55:58 images.go:744: total unused blobs removed: 0 2024/01/04 16:55:58 routes.go:895: Listening on 127.0.0.1:11434 (version 0.1.17) [GIN] 2024/01/04 - 16:56:14 | 200 | 588.864µs | 127.0.0.1 | HEAD "/" [GIN] 2024/01/04 - 16:56:14 | 200 | 4.852801ms | 127.0.0.1 | GET "/api/tags" [GIN] 2024/01/04 - 17:00:23 | 200 | 18.509µs | 127.0.0.1 | HEAD "/" [GIN] 2024/01/04 - 17:00:59 | 404 | 159.572µs | 127.0.0.1 | HEAD "/api/blobs/sha256:3ef24972116b3f4b0da187e514c3f29ae653f83c00fd6886d70d7b74694b3833" [GIN] 2024/01/04 - 17:01:24 | 201 | 24.345565906s | 127.0.0.1 | POST "/api/blobs/sha256:3ef24972116b3f4b0da187e514c3f29ae653f83c00fd6886d70d7b74694b3833" 2024/01/04 17:01:24 images.go:370: [model] - @sha256:3ef24972116b3f4b0da187e514c3f29ae653f83c00fd6886d70d7b74694b3833 [GIN] 2024/01/04 - 17:01:48 | 200 | 24.396759785s | 127.0.0.1 | POST "/api/create" [GIN] 2024/01/04 - 17:01:59 | 200 | 24.651µs | 127.0.0.1 | HEAD "/" [GIN] 2024/01/04 - 17:01:59 | 200 | 284.445µs | 127.0.0.1 | POST "/api/show" [GIN] 2024/01/04 - 17:01:59 | 200 | 196.575µs | 127.0.0.1 | POST "/api/show" 2024/01/04 17:02:02 llama.go:300: 11807 MB VRAM available, loading up to 72 GPU layers 2024/01/04 17:02:02 llama.go:436: starting llama runner 2024/01/04 17:02:02 llama.go:494: waiting for llama runner to start responding ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6 {"timestamp":1704355325,"level":"INFO","function":"main","line":2667,"message":"build info","build":468,"commit":"a7aee47"} {"timestamp":1704355325,"level":"INFO","function":"main","line":2670,"message":"system info","n_threads":4,"n_threads_batch":-1,"total_threads":4,"system_info":"AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "} llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from /home/arman/.ollama/models/blobs/sha256:3ef24972116b3f4b0da187e514c3f29ae653f83c00fd6886d70d7b74694b3833 (version GGUF V3 (latest)) llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 7.24 B llm_load_print_meta: model size = 3.83 GiB (4.54 BPW) llm_load_print_meta: general.name = beowolx_codeninja-1.0-openchat-7b llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: PAD token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.11 MiB llm_load_tensors: using CUDA for GPU acceleration llm_load_tensors: mem required = 70.43 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: VRAM used: 3847.56 MiB ................................................................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: VRAM kv self = 256.00 MB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_build_graph: non-view tensors processed: 676/676 llama_new_context_with_model: compute buffer total size = 159.19 MiB llama_new_context_with_model: VRAM scratch buffer: 156.00 MiB llama_new_context_with_model: total VRAM used: 4259.57 MiB (model: 3847.56 MiB, context: 412.00 MiB) arman@AliceAI:~$ ollama create ninja -f Modelfile transferring model data creating model layer using already created layer sha256:3ef24972116b3f4b0da187e514c3f29ae653f83c00fd6886d70d7b74694b3833 writing layer sha256:d61e6266c8740ea80cefacf08e4be7d5ae1d6591e76955ca307ad93d0cc036a6 writing manifest success arman@AliceAI:~$ ollama run ninja >>> create python code for web page scrap using bs4 import requests from bs4 import BeautifulSoup def scrape_webpage(url): response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') return soup soup = scrape_webpage('https://www.example.com/') print(soup.prettify())

GiteaMirror commented

2026-04-12 10:09:12 -05:00

@varonroy commented on GitHub (Jan 6, 2024):

I am having the same issue. Here are my specs:

Spec	Value
OS	Ubuntu 22.04.3 LTS x86_64
Kernel	5.15.0-91-generic
CPU	Intel Xeon Gold 6330 (112) @ 3.100GHz
GPU (2x)	NVIDIA A100 PCIe 40GB

Cuda installation:

Spec	Value
Cuda	12.3
Driver	545.23.08

nvcc version:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0

ollama version:

$ ollama --version
ollama version is 0.1.18

And this doesn't seem to be a memory issue as the memory is barely used

@varonroy commented on GitHub (Jan 6, 2024): I am having the same issue. Here are my specs: |Spec|Value| |---|---| |OS|Ubuntu 22.04.3 LTS x86_64| |Kernel|5.15.0-91-generic| |CPU|Intel Xeon Gold 6330 (112) @ 3.100GHz| |GPU (2x)|NVIDIA A100 PCIe 40GB| Cuda installation: |Spec|Value| |---|---| |Cuda|12.3| |Driver|545.23.08| nvcc version: ``` $ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Wed_Nov_22_10:17:15_PST_2023 Cuda compilation tools, release 12.3, V12.3.107 Build cuda_12.3.r12.3/compiler.33567101_0 ``` ollama version: ``` $ ollama --version ollama version is 0.1.18 ``` And this doesn't seem to be a memory issue as the memory is barely used ![2024-01-06-231724_2008x1862_scrot](https://github.com/jmorganca/ollama/assets/19289056/c58f8f60-c6c2-4d55-91c3-306ced0d9756)

GiteaMirror commented

2026-04-12 10:09:13 -05:00

@lnxdevbr commented on GitHub (Jan 6, 2024):

try Ubuntu 23.10.1 and before ollama use update ALL packages,worked for
me,New ollama version arrived 1.18, regards

Em dom., 7 de jan. de 2024 08:21, Roy Varon @.***>
escreveu:

I am having the same issue. Here are my specs:
Spec Value
OS Ubuntu 22.04.3 LTS x86_64
Kernel 5.15.0-91-generic
CPU Intel Xeon Gold 6330 (112) @ 3.100GHz
GPU (2x) NVIDIA A100 PCIe 40GB

Cuda installation:
Spec Value
Cuda 12.3
Driver 545.23.08

nvcc version:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0

ollama version:

$ ollama --version
ollama version is 0.1.18

And this doesn't seem to be a memory issue as the memory is barely used
2024-01-06-231724_2008x1862_scrot.png (view on web)
https://github.com/jmorganca/ollama/assets/19289056/c58f8f60-c6c2-4d55-91c3-306ced0d9756

—
Reply to this email directly, view it on GitHub
https://github.com/jmorganca/ollama/issues/969#issuecomment-1879873271,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/BFC24TXWD4WAZITR43PORTLYNHMALAVCNFSM6AAAAAA62ONCIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZZHA3TGMRXGE
.
You are receiving this because you commented.Message ID:
@.***>

@lnxdevbr commented on GitHub (Jan 6, 2024): try Ubuntu 23.10.1 and before ollama use update ALL packages,worked for me,New ollama version arrived 1.18, regards Em dom., 7 de jan. de 2024 08:21, Roy Varon ***@***.***> escreveu: > I am having the same issue. Here are my specs: > Spec Value > OS Ubuntu 22.04.3 LTS x86_64 > Kernel 5.15.0-91-generic > CPU Intel Xeon Gold 6330 (112) @ 3.100GHz > GPU (2x) NVIDIA A100 PCIe 40GB > > Cuda installation: > Spec Value > Cuda 12.3 > Driver 545.23.08 > > nvcc version: > > $ nvcc --version > nvcc: NVIDIA (R) Cuda compiler driver > Copyright (c) 2005-2023 NVIDIA Corporation > Built on Wed_Nov_22_10:17:15_PST_2023 > Cuda compilation tools, release 12.3, V12.3.107 > Build cuda_12.3.r12.3/compiler.33567101_0 > > ollama version: > > $ ollama --version > ollama version is 0.1.18 > > And this doesn't seem to be a memory issue as the memory is barely used > 2024-01-06-231724_2008x1862_scrot.png (view on web) > <https://github.com/jmorganca/ollama/assets/19289056/c58f8f60-c6c2-4d55-91c3-306ced0d9756> > > — > Reply to this email directly, view it on GitHub > <https://github.com/jmorganca/ollama/issues/969#issuecomment-1879873271>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/BFC24TXWD4WAZITR43PORTLYNHMALAVCNFSM6AAAAAA62ONCIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZZHA3TGMRXGE> > . > You are receiving this because you commented.Message ID: > ***@***.***> >

GiteaMirror commented

2026-04-12 10:09:13 -05:00

@lnxdevbr commented on GitHub (Jan 6, 2024):

look my kernel version and NVidia , cuda driver posted,maybe installing in
your system works

Em dom., 7 de jan. de 2024 08:49, Miya Sil @.***> escreveu:

try Ubuntu 23.10.1 and before ollama use update ALL packages,worked for
me,New ollama version arrived 1.18, regards

Em dom., 7 de jan. de 2024 08:21, Roy Varon @.***>
escreveu:

I am having the same issue. Here are my specs:
Spec Value
OS Ubuntu 22.04.3 LTS x86_64
Kernel 5.15.0-91-generic
CPU Intel Xeon Gold 6330 (112) @ 3.100GHz
GPU (2x) NVIDIA A100 PCIe 40GB

Cuda installation:
Spec Value
Cuda 12.3
Driver 545.23.08

nvcc version:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0

ollama version:

$ ollama --version
ollama version is 0.1.18

And this doesn't seem to be a memory issue as the memory is barely used
2024-01-06-231724_2008x1862_scrot.png (view on web)
https://github.com/jmorganca/ollama/assets/19289056/c58f8f60-c6c2-4d55-91c3-306ced0d9756

—
Reply to this email directly, view it on GitHub
https://github.com/jmorganca/ollama/issues/969#issuecomment-1879873271,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/BFC24TXWD4WAZITR43PORTLYNHMALAVCNFSM6AAAAAA62ONCIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZZHA3TGMRXGE
.
You are receiving this because you commented.Message ID:
@.***>

@lnxdevbr commented on GitHub (Jan 6, 2024): look my kernel version and NVidia , cuda driver posted,maybe installing in your system works Em dom., 7 de jan. de 2024 08:49, Miya Sil ***@***.***> escreveu: > try Ubuntu 23.10.1 and before ollama use update ALL packages,worked for > me,New ollama version arrived 1.18, regards > > Em dom., 7 de jan. de 2024 08:21, Roy Varon ***@***.***> > escreveu: > >> I am having the same issue. Here are my specs: >> Spec Value >> OS Ubuntu 22.04.3 LTS x86_64 >> Kernel 5.15.0-91-generic >> CPU Intel Xeon Gold 6330 (112) @ 3.100GHz >> GPU (2x) NVIDIA A100 PCIe 40GB >> >> Cuda installation: >> Spec Value >> Cuda 12.3 >> Driver 545.23.08 >> >> nvcc version: >> >> $ nvcc --version >> nvcc: NVIDIA (R) Cuda compiler driver >> Copyright (c) 2005-2023 NVIDIA Corporation >> Built on Wed_Nov_22_10:17:15_PST_2023 >> Cuda compilation tools, release 12.3, V12.3.107 >> Build cuda_12.3.r12.3/compiler.33567101_0 >> >> ollama version: >> >> $ ollama --version >> ollama version is 0.1.18 >> >> And this doesn't seem to be a memory issue as the memory is barely used >> 2024-01-06-231724_2008x1862_scrot.png (view on web) >> <https://github.com/jmorganca/ollama/assets/19289056/c58f8f60-c6c2-4d55-91c3-306ced0d9756> >> >> — >> Reply to this email directly, view it on GitHub >> <https://github.com/jmorganca/ollama/issues/969#issuecomment-1879873271>, >> or unsubscribe >> <https://github.com/notifications/unsubscribe-auth/BFC24TXWD4WAZITR43PORTLYNHMALAVCNFSM6AAAAAA62ONCIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZZHA3TGMRXGE> >> . >> You are receiving this because you commented.Message ID: >> ***@***.***> >> >

GiteaMirror commented

2026-04-12 10:09:14 -05:00

@simplesisu commented on GitHub (Jan 8, 2024):

I just added a second RTX 3060 12GB and restarted pc & docker on UnRAID and loaded mixtral just for fun but the docker crashed and I was unable to restart it. Deleted it with image and setup and re-installed it. Now when loading even tinyllama it just outputs infinite ######...What happened and how can it be fixed?...I reinstalled 2 times with the same result

llama_model_loader: - tensor  192:            blk.9.ffn_down.weight q4_0     [  5632,  2048,     1,     1 ]
llama_model_loader: - tensor  193:            blk.9.ffn_gate.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor  194:              blk.9.ffn_up.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor  195:            blk.9.ffn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor  196:              blk.9.attn_k.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor  197:         blk.9.attn_output.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor  198:              blk.9.attn_q.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor  199:              blk.9.attn_v.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor  200:               output_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = TinyLlama
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv   4:                          llama.block_count u32              = 22
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 5632
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 4
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 2
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,61249]   = ["▁ t", "e r", "i n", "▁ a", "e n...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  21:                    tokenizer.chat_template str              = {% for message in messages %}\n{% if m...
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   45 tensors
llama_model_loader: - type q4_0:  155 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_layer          = 22
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 5632
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 1.10 B
llm_load_print_meta: model size       = 606.53 MiB (4.63 BPW) 
llm_load_print_meta: general.name     = TinyLlama
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.08 MiB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =   35.23 MiB
llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 23/23 layers to GPU
llm_load_tensors: VRAM used: 571.37 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 44.00 MB
llama_new_context_with_model: KV self size  =   44.00 MiB, K (f16):   22.00 MiB, V (f16):   22.00 MiB
llama_build_graph: non-view tensors processed: 466/466
llama_new_context_with_model: compute buffer total size = 147.19 MiB
llama_new_context_with_model: VRAM scratch buffer: 144.00 MiB
llama_new_context_with_model: total VRAM used: 759.38 MiB (model: 571.37 MiB, context: 188.00 MiB)
2024/01/08 13:53:37 ext_server_common.go:151: Starting internal llama main loop
[GIN] 2024/01/08 - 13:53:37 | 200 |  7.579335261s |       127.0.0.1 | POST     "/api/generate"
2024/01/08 13:53:55 ext_server_common.go:165: loaded 0 images

NAME                    ID              SIZE    MODIFIED      
tinyllama:latest        2644915ede35    637 MB  4 minutes ago
root@3d05f7684e44:/# ollama tinyllama:latest
Error: unknown command "tinyllama:latest" for "ollama"
root@3d05f7684e44:/# ollama run  tinyllama:latest
>>> tell me a joke about futurama
################################ #####################################################################################################################################################################################################################################################################################################################################################################################################################################################################^Z
[1]+  Stopped                 ollama run tinyllama:latest

:~# nvidia-smi
Mon Jan  8 14:59:51 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060        On  | 00000000:0A:00.0 Off |                  N/A |
| 55%   51C    P2              68W / 170W |    857MiB / 12288MiB |     49%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3060        On  | 00000000:4A:00.0 Off |                  N/A |
| 60%   55C    P2              52W / 170W |    665MiB / 12288MiB |     20%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    120413      C   /bin/ollama                                 850MiB |
|    1   N/A  N/A    120413      C   /bin/ollama                                 658MiB |
+---------------------------------------------------------------------------------------+

ollama version is 0.1.18

@simplesisu commented on GitHub (Jan 8, 2024): I just added a second RTX 3060 12GB and restarted pc & docker on UnRAID and loaded mixtral just for fun but the docker crashed and I was unable to restart it. Deleted it with image and setup and re-installed it. Now when loading even tinyllama it just outputs infinite ######...What happened and how can it be fixed?...I reinstalled 2 times with the same result ``` llama_model_loader: - tensor 192: blk.9.ffn_down.weight q4_0 [ 5632, 2048, 1, 1 ] llama_model_loader: - tensor 193: blk.9.ffn_gate.weight q4_0 [ 2048, 5632, 1, 1 ] llama_model_loader: - tensor 194: blk.9.ffn_up.weight q4_0 [ 2048, 5632, 1, 1 ] llama_model_loader: - tensor 195: blk.9.ffn_norm.weight f32 [ 2048, 1, 1, 1 ] llama_model_loader: - tensor 196: blk.9.attn_k.weight q4_0 [ 2048, 256, 1, 1 ] llama_model_loader: - tensor 197: blk.9.attn_output.weight q4_0 [ 2048, 2048, 1, 1 ] llama_model_loader: - tensor 198: blk.9.attn_q.weight q4_0 [ 2048, 2048, 1, 1 ] llama_model_loader: - tensor 199: blk.9.attn_v.weight q4_0 [ 2048, 256, 1, 1 ] llama_model_loader: - tensor 200: output_norm.weight f32 [ 2048, 1, 1, 1 ] llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = TinyLlama llama_model_loader: - kv 2: llama.context_length u32 = 2048 llama_model_loader: - kv 3: llama.embedding_length u32 = 2048 llama_model_loader: - kv 4: llama.block_count u32 = 22 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 5632 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 64 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 4 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 11: general.file_type u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,61249] = ["▁ t", "e r", "i n", "▁ a", "e n... llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 2 llama_model_loader: - kv 21: tokenizer.chat_template str = {% for message in messages %}\n{% if m... llama_model_loader: - kv 22: general.quantization_version u32 = 2 llama_model_loader: - type f32: 45 tensors llama_model_loader: - type q4_0: 155 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 2048 llm_load_print_meta: n_embd = 2048 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 4 llm_load_print_meta: n_layer = 22 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 5632 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 2048 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 1B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 1.10 B llm_load_print_meta: model size = 606.53 MiB (4.63 BPW) llm_load_print_meta: general.name = TinyLlama llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: PAD token = 2 '</s>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.08 MiB llm_load_tensors: using CUDA for GPU acceleration llm_load_tensors: mem required = 35.23 MiB llm_load_tensors: offloading 22 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 23/23 layers to GPU llm_load_tensors: VRAM used: 571.37 MiB ....................................................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: VRAM kv self = 44.00 MB llama_new_context_with_model: KV self size = 44.00 MiB, K (f16): 22.00 MiB, V (f16): 22.00 MiB llama_build_graph: non-view tensors processed: 466/466 llama_new_context_with_model: compute buffer total size = 147.19 MiB llama_new_context_with_model: VRAM scratch buffer: 144.00 MiB llama_new_context_with_model: total VRAM used: 759.38 MiB (model: 571.37 MiB, context: 188.00 MiB) 2024/01/08 13:53:37 ext_server_common.go:151: Starting internal llama main loop [GIN] 2024/01/08 - 13:53:37 | 200 | 7.579335261s | 127.0.0.1 | POST "/api/generate" 2024/01/08 13:53:55 ext_server_common.go:165: loaded 0 images ``` ``` NAME ID SIZE MODIFIED tinyllama:latest 2644915ede35 637 MB 4 minutes ago root@3d05f7684e44:/# ollama tinyllama:latest Error: unknown command "tinyllama:latest" for "ollama" root@3d05f7684e44:/# ollama run tinyllama:latest >>> tell me a joke about futurama ################################ #####################################################################################################################################################################################################################################################################################################################################################################################################################################################################^Z [1]+ Stopped ollama run tinyllama:latest ``` ``` :~# nvidia-smi Mon Jan 8 14:59:51 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 545.29.06 Driver Version: 545.29.06 CUDA Version: 12.3 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3060 On | 00000000:0A:00.0 Off | N/A | | 55% 51C P2 68W / 170W | 857MiB / 12288MiB | 49% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce RTX 3060 On | 00000000:4A:00.0 Off | N/A | | 60% 55C P2 52W / 170W | 665MiB / 12288MiB | 20% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 120413 C /bin/ollama 850MiB | | 1 N/A N/A 120413 C /bin/ollama 658MiB | +---------------------------------------------------------------------------------------+ ``` `ollama version is 0.1.18`

GiteaMirror commented

2026-04-12 10:09:14 -05:00

@simplesisu commented on GitHub (Jan 8, 2024):

UPDATE:

I removed the second GPU by setting NVIDIA_VISIBLE_DEVICES=1 (brand new GPU (device 0) btw and works in dual mode with other AI frameworks such as oobabooga etc.)

Now it works again! So the questions is why the # output when having a second GPU?

@simplesisu commented on GitHub (Jan 8, 2024): UPDATE: - I removed the second GPU by setting NVIDIA_VISIBLE_DEVICES=1 (brand new GPU (device 0) btw and works in dual mode with other AI frameworks such as oobabooga etc.) Now it works again! So the questions is why the # output when having a second GPU?

GiteaMirror commented

2026-04-12 10:09:14 -05:00

@phalexo commented on GitHub (Jan 8, 2024):

Check the prompt format for this model. I think I've seen this when I
failed to use the correct prompt format.

On Mon, Jan 8, 2024, 9:01 AM simplesisu @.***> wrote:

I just added a second RTX 3060 12GB and restarted pc & docker on UnRAID
and loaded mixtral just for fun but the docker crashed and I was unable to
restart it. Deleted it with image and setup and re-installed it. Now when
loading even tinyllama it just outputs infinite ######...What happened and
how can it be fixed?...I reinstalled 2 times with the same result

llama_model_loader: - tensor 192: blk.9.ffn_down.weight q4_0 [ 5632, 2048, 1, 1 ]
llama_model_loader: - tensor 193: blk.9.ffn_gate.weight q4_0 [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 194: blk.9.ffn_up.weight q4_0 [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 195: blk.9.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 196: blk.9.attn_k.weight q4_0 [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 197: blk.9.attn_output.weight q4_0 [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 198: blk.9.attn_q.weight q4_0 [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 199: blk.9.attn_v.weight q4_0 [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 200: output_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = TinyLlama
llama_model_loader: - kv 2: llama.context_length u32 = 2048
llama_model_loader: - kv 3: llama.embedding_length u32 = 2048
llama_model_loader: - kv 4: llama.block_count u32 = 22
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 5632
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 64
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 4
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: general.file_type u32 = 2
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "~~", "~~", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,61249] = ["▁ t", "e r", "i n", "▁ a", "e n...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 2
llama_model_loader: - kv 21: tokenizer.chat_template str = {% for message in messages %}\n{% if m...
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 45 tensors
llama_model_loader: - type q4_0: 155 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 4
llm_load_print_meta: n_layer = 22
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 5632
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 1B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 1.10 B
llm_load_print_meta: model size = 606.53 MiB (4.63 BPW)
llm_load_print_meta: general.name = TinyLlama
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: PAD token = 2 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.08 MiB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 35.23 MiB
llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 23/23 layers to GPU
llm_load_tensors: VRAM used: 571.37 MiB
.......................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 44.00 MB
llama_new_context_with_model: KV self size = 44.00 MiB, K (f16): 22.00 MiB, V (f16): 22.00 MiB
llama_build_graph: non-view tensors processed: 466/466
llama_new_context_with_model: compute buffer total size = 147.19 MiB
llama_new_context_with_model: VRAM scratch buffer: 144.00 MiB
llama_new_context_with_model: total VRAM used: 759.38 MiB (model: 571.37 MiB, context: 188.00 MiB)
2024/01/08 13:53:37 ext_server_common.go:151: Starting internal llama main loop
[GIN] 2024/01/08 - 13:53:37 | 200 | 7.579335261s | 127.0.0.1 | POST "/api/generate"
2024/01/08 13:53:55 ext_server_common.go:165: loaded 0 images

NAME ID SIZE MODIFIED
tinyllama:latest 2644915ede35 637 MB 4 minutes ago
@.:/# ollama tinyllama:latest
Error: unknown command "tinyllama:latest" for "ollama"
@.:/# ollama run tinyllama:latest

tell me a joke about futurama
################################ #####################################################################################################################################################################################################################################################################################################################################################################################################################################################################^Z
[1]+ Stopped ollama run tinyllama:latest

:~# nvidia-smi
Mon Jan 8 14:59:51 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06 Driver Version: 545.29.06 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 On | 00000000:0A:00.0 Off | N/A |
| 55% 51C P2 68W / 170W | 857MiB / 12288MiB | 49% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 3060 On | 00000000:4A:00.0 Off | N/A |
| 60% 55C P2 52W / 170W | 665MiB / 12288MiB | 20% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 120413 C /bin/ollama 850MiB |
| 1 N/A N/A 120413 C /bin/ollama 658MiB |
+---------------------------------------------------------------------------------------+

ollama version is 0.1.18

—
Reply to this email directly, view it on GitHub
https://github.com/jmorganca/ollama/issues/969#issuecomment-1881068878,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABDD3ZLEGEZJK6OTHIFYKELYNP33HAVCNFSM6AAAAAA62ONCIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBRGA3DQOBXHA
.
You are receiving this because you commented.Message ID:
@.***>

@phalexo commented on GitHub (Jan 8, 2024): Check the prompt format for this model. I think I've seen this when I failed to use the correct prompt format. On Mon, Jan 8, 2024, 9:01 AM simplesisu ***@***.***> wrote: > I just added a second RTX 3060 12GB and restarted pc & docker on UnRAID > and loaded mixtral just for fun but the docker crashed and I was unable to > restart it. Deleted it with image and setup and re-installed it. Now when > loading even tinyllama it just outputs infinite ######...What happened and > how can it be fixed?...I reinstalled 2 times with the same result > > llama_model_loader: - tensor 192: blk.9.ffn_down.weight q4_0 [ 5632, 2048, 1, 1 ] > llama_model_loader: - tensor 193: blk.9.ffn_gate.weight q4_0 [ 2048, 5632, 1, 1 ] > llama_model_loader: - tensor 194: blk.9.ffn_up.weight q4_0 [ 2048, 5632, 1, 1 ] > llama_model_loader: - tensor 195: blk.9.ffn_norm.weight f32 [ 2048, 1, 1, 1 ] > llama_model_loader: - tensor 196: blk.9.attn_k.weight q4_0 [ 2048, 256, 1, 1 ] > llama_model_loader: - tensor 197: blk.9.attn_output.weight q4_0 [ 2048, 2048, 1, 1 ] > llama_model_loader: - tensor 198: blk.9.attn_q.weight q4_0 [ 2048, 2048, 1, 1 ] > llama_model_loader: - tensor 199: blk.9.attn_v.weight q4_0 [ 2048, 256, 1, 1 ] > llama_model_loader: - tensor 200: output_norm.weight f32 [ 2048, 1, 1, 1 ] > llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. > llama_model_loader: - kv 0: general.architecture str = llama > llama_model_loader: - kv 1: general.name str = TinyLlama > llama_model_loader: - kv 2: llama.context_length u32 = 2048 > llama_model_loader: - kv 3: llama.embedding_length u32 = 2048 > llama_model_loader: - kv 4: llama.block_count u32 = 22 > llama_model_loader: - kv 5: llama.feed_forward_length u32 = 5632 > llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 64 > llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 > llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 4 > llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 > llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000 > llama_model_loader: - kv 11: general.file_type u32 = 2 > llama_model_loader: - kv 12: tokenizer.ggml.model str = llama > llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<... > llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... > llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... > llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,61249] = ["▁ t", "e r", "i n", "▁ a", "e n... > llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1 > llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2 > llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0 > llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 2 > llama_model_loader: - kv 21: tokenizer.chat_template str = {% for message in messages %}\n{% if m... > llama_model_loader: - kv 22: general.quantization_version u32 = 2 > llama_model_loader: - type f32: 45 tensors > llama_model_loader: - type q4_0: 155 tensors > llama_model_loader: - type q6_K: 1 tensors > llm_load_vocab: special tokens definition check successful ( 259/32000 ). > llm_load_print_meta: format = GGUF V3 (latest) > llm_load_print_meta: arch = llama > llm_load_print_meta: vocab type = SPM > llm_load_print_meta: n_vocab = 32000 > llm_load_print_meta: n_merges = 0 > llm_load_print_meta: n_ctx_train = 2048 > llm_load_print_meta: n_embd = 2048 > llm_load_print_meta: n_head = 32 > llm_load_print_meta: n_head_kv = 4 > llm_load_print_meta: n_layer = 22 > llm_load_print_meta: n_rot = 64 > llm_load_print_meta: n_gqa = 8 > llm_load_print_meta: f_norm_eps = 0.0e+00 > llm_load_print_meta: f_norm_rms_eps = 1.0e-05 > llm_load_print_meta: f_clamp_kqv = 0.0e+00 > llm_load_print_meta: f_max_alibi_bias = 0.0e+00 > llm_load_print_meta: n_ff = 5632 > llm_load_print_meta: n_expert = 0 > llm_load_print_meta: n_expert_used = 0 > llm_load_print_meta: rope scaling = linear > llm_load_print_meta: freq_base_train = 10000.0 > llm_load_print_meta: freq_scale_train = 1 > llm_load_print_meta: n_yarn_orig_ctx = 2048 > llm_load_print_meta: rope_finetuned = unknown > llm_load_print_meta: model type = 1B > llm_load_print_meta: model ftype = Q4_0 > llm_load_print_meta: model params = 1.10 B > llm_load_print_meta: model size = 606.53 MiB (4.63 BPW) > llm_load_print_meta: general.name = TinyLlama > llm_load_print_meta: BOS token = 1 '<s>' > llm_load_print_meta: EOS token = 2 '</s>' > llm_load_print_meta: UNK token = 0 '<unk>' > llm_load_print_meta: PAD token = 2 '</s>' > llm_load_print_meta: LF token = 13 '<0x0A>' > llm_load_tensors: ggml ctx size = 0.08 MiB > llm_load_tensors: using CUDA for GPU acceleration > llm_load_tensors: mem required = 35.23 MiB > llm_load_tensors: offloading 22 repeating layers to GPU > llm_load_tensors: offloading non-repeating layers to GPU > llm_load_tensors: offloaded 23/23 layers to GPU > llm_load_tensors: VRAM used: 571.37 MiB > ....................................................................................... > llama_new_context_with_model: n_ctx = 2048 > llama_new_context_with_model: freq_base = 10000.0 > llama_new_context_with_model: freq_scale = 1 > llama_kv_cache_init: VRAM kv self = 44.00 MB > llama_new_context_with_model: KV self size = 44.00 MiB, K (f16): 22.00 MiB, V (f16): 22.00 MiB > llama_build_graph: non-view tensors processed: 466/466 > llama_new_context_with_model: compute buffer total size = 147.19 MiB > llama_new_context_with_model: VRAM scratch buffer: 144.00 MiB > llama_new_context_with_model: total VRAM used: 759.38 MiB (model: 571.37 MiB, context: 188.00 MiB) > 2024/01/08 13:53:37 ext_server_common.go:151: Starting internal llama main loop > [GIN] 2024/01/08 - 13:53:37 | 200 | 7.579335261s | 127.0.0.1 | POST "/api/generate" > 2024/01/08 13:53:55 ext_server_common.go:165: loaded 0 images > > NAME ID SIZE MODIFIED > tinyllama:latest 2644915ede35 637 MB 4 minutes ago > ***@***.***:/# ollama tinyllama:latest > Error: unknown command "tinyllama:latest" for "ollama" > ***@***.***:/# ollama run tinyllama:latest > >>> tell me a joke about futurama > ################################ #####################################################################################################################################################################################################################################################################################################################################################################################################################################################################^Z > [1]+ Stopped ollama run tinyllama:latest > > :~# nvidia-smi > Mon Jan 8 14:59:51 2024 > +---------------------------------------------------------------------------------------+ > | NVIDIA-SMI 545.29.06 Driver Version: 545.29.06 CUDA Version: 12.3 | > |-----------------------------------------+----------------------+----------------------+ > | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | > | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | > | | | MIG M. | > |=========================================+======================+======================| > | 0 NVIDIA GeForce RTX 3060 On | 00000000:0A:00.0 Off | N/A | > | 55% 51C P2 68W / 170W | 857MiB / 12288MiB | 49% Default | > | | | N/A | > +-----------------------------------------+----------------------+----------------------+ > | 1 NVIDIA GeForce RTX 3060 On | 00000000:4A:00.0 Off | N/A | > | 60% 55C P2 52W / 170W | 665MiB / 12288MiB | 20% Default | > | | | N/A | > +-----------------------------------------+----------------------+----------------------+ > > +---------------------------------------------------------------------------------------+ > | Processes: | > | GPU GI CI PID Type Process name GPU Memory | > | ID ID Usage | > |=======================================================================================| > | 0 N/A N/A 120413 C /bin/ollama 850MiB | > | 1 N/A N/A 120413 C /bin/ollama 658MiB | > +---------------------------------------------------------------------------------------+ > > ollama version is 0.1.18 > > — > Reply to this email directly, view it on GitHub > <https://github.com/jmorganca/ollama/issues/969#issuecomment-1881068878>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABDD3ZLEGEZJK6OTHIFYKELYNP33HAVCNFSM6AAAAAA62ONCIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBRGA3DQOBXHA> > . > You are receiving this because you commented.Message ID: > ***@***.***> >

GiteaMirror commented

2026-04-12 10:09:15 -05:00

@simplesisu commented on GitHub (Jan 8, 2024):

For which model?

@simplesisu commented on GitHub (Jan 8, 2024): For which model?

GiteaMirror commented

2026-04-12 10:09:15 -05:00

@IliyanGochev commented on GitHub (Jan 9, 2024):

Had the same problem, Ubuntu 22.04 LTS, 2xRTX3090, no matter the model tired (Phi, Mistral) I'd either get infinite gibberish words or infinite # signs.

I've tried the suggestion of upgrading to the latest 23.x of Ubuntu, but that did not help.
It only broke my Nvidia driver (545 / CUDA 12.3 at the time) and Ollama / Llama.cpp ran on CPU-only mode.

Then I've downgraded the driver to 535 / CUDA 12.2 and now I'm able to run both Phi and Mistral, and even Mixtral without a problem on the GPUs.

@IliyanGochev commented on GitHub (Jan 9, 2024): Had the same problem, Ubuntu 22.04 LTS, 2xRTX3090, no matter the model tired (Phi, Mistral) I'd either get infinite gibberish words or infinite # signs. I've tried the suggestion of upgrading to the latest 23.x of Ubuntu, but that did not help. It only broke my Nvidia driver (545 / CUDA 12.3 at the time) and Ollama / Llama.cpp ran on CPU-only mode. Then I've downgraded the driver to 535 / CUDA 12.2 and now I'm able to run both Phi and Mistral, and even Mixtral without a problem on the GPUs.

GiteaMirror commented

2026-04-12 10:09:15 -05:00

@simplesisu commented on GitHub (Jan 12, 2024):

Had the same problem, Ubuntu 22.04 LTS, 2xRTX3090, no matter the model tired (Phi, Mistral) I'd either get infinite gibberish words or infinite # signs.

I've tried the suggestion of upgrading to the latest 23.x of Ubuntu, but that did not help. It only broke my Nvidia driver (545 / CUDA 12.3 at the time) and Ollama / Llama.cpp ran on CPU-only mode.

Then I've downgraded the driver to 535 / CUDA 12.2 and now I'm able to run both Phi and Mistral, and even Mixtral without a problem on the GPUs.

Glad it worked for you! which version of 535...( v535.146.02, v535.129.03 or other?)

@simplesisu commented on GitHub (Jan 12, 2024): > Had the same problem, Ubuntu 22.04 LTS, 2xRTX3090, no matter the model tired (Phi, Mistral) I'd either get infinite gibberish words or infinite # signs. > > I've tried the suggestion of upgrading to the latest 23.x of Ubuntu, but that did not help. It only broke my Nvidia driver (545 / CUDA 12.3 at the time) and Ollama / Llama.cpp ran on CPU-only mode. > > Then I've downgraded the driver to 535 / CUDA 12.2 and now I'm able to run both Phi and Mistral, and even Mixtral without a problem on the GPUs. Glad it worked for you! which version of 535..**.( v535.146.02, v535.129.03 or other?)**

GiteaMirror commented

2026-04-12 10:09:15 -05:00

@igorschlum commented on GitHub (Jan 12, 2024):

Could you try with version 0.1.20? It could solve the issue

@igorschlum commented on GitHub (Jan 12, 2024): Could you try with version 0.1.20? It could solve the issue

GiteaMirror commented

2026-04-12 10:09:16 -05:00

@jmorganca commented on GitHub (Apr 17, 2024):

Hi there, this should be fixed now. If not please let me know

@jmorganca commented on GitHub (Apr 17, 2024): Hi there, this should be fixed now. If not please let me know

GiteaMirror referenced this issue

2026-04-12 22:53:24 -05:00

[PR #472] [MERGED] Added missing options params to the embeddings docs #10170

GiteaMirror referenced this issue

2026-04-16 04:59:28 -05:00

[PR #472] [MERGED] Added missing options params to the embeddings docs #15441

GiteaMirror referenced this issue

2026-04-19 15:12:39 -05:00

[PR #472] [MERGED] Added missing options params to the embeddings docs #20710

GiteaMirror referenced this issue

2026-04-22 01:50:31 -05:00

[GH-ISSUE #436] Ollama embeddings LangChain integration #25966

GiteaMirror referenced this issue

2026-04-22 20:45:40 -05:00

[PR #472] [MERGED] Added missing options params to the embeddings docs #36043

GiteaMirror referenced this issue

2026-04-24 21:18:08 -05:00

[PR #472] [MERGED] Added missing options params to the embeddings docs #41418

GiteaMirror referenced this issue

2026-04-27 23:40:47 -05:00

[GH-ISSUE #436] Ollama embeddings LangChain integration #46715

GiteaMirror referenced this issue

2026-04-29 11:27:57 -05:00

[PR #472] [MERGED] Added missing options params to the embeddings docs #56867

GiteaMirror referenced this issue

2026-05-03 07:58:50 -05:00

[GH-ISSUE #436] Ollama embeddings LangChain integration #62242

GiteaMirror referenced this issue

2026-05-05 03:59:02 -05:00

[PR #472] [MERGED] Added missing options params to the embeddings docs #72464

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

hoyyeva/editor-config-repair

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

hoyyeva/launch-backup-ux

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-mlx-decode-checkpoints

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

brucemacd/download-before-remove

parth/update-claude-docs

parth-anthropic-reference-images-path

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#472