[GH-ISSUE #5005] ollama creat -f Modelfile doesn't process utf-8 encoding correctly #3166

Closed
opened 2026-04-12 13:39:20 -05:00 by GiteaMirror · 8 comments
Owner

Originally created by @MGdesigner on GitHub (Jun 12, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5005

Originally assigned to: @mxyng on GitHub.

What is the issue?

Today I upgrade Ollama version to 0.1.43 from official site. After creating new model, I found my system prompt(written with CJK kanjis ) in modelfile didn't work. Then I check it out by

ollama show mymodel:latest --modelfile

Then I found that modelfile of the model is not encoded correctly. I have also check the situation by using my old modelfile . The situation is the same. Only models created by 0.1.42 or earlier work correctly . Please fix it.

OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.1.43

Originally created by @MGdesigner on GitHub (Jun 12, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5005 Originally assigned to: @mxyng on GitHub. ### What is the issue? Today I upgrade Ollama version to 0.1.43 from official site. After creating new model, I found my system prompt(written with CJK kanjis ) in modelfile didn't work. Then I check it out by > ollama show mymodel:latest --modelfile Then I found that modelfile of the model is not encoded correctly. I have also check the situation by using my old modelfile . The situation is the same. Only models created by 0.1.42 or earlier work correctly . Please fix it. ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version 0.1.43
GiteaMirror added the bug label 2026-04-12 13:39:20 -05:00
Author
Owner

@MGdesigner commented on GitHub (Jun 12, 2024):

It's the screenshot :
modelfile_encodingfailed

<!-- gh-comment-id:2163773243 --> @MGdesigner commented on GitHub (Jun 12, 2024): It's the screenshot : ![modelfile_encodingfailed](https://github.com/ollama/ollama/assets/4480740/d1db5ed3-67ba-4775-abba-c1728de09d8b)
Author
Owner

@cdfmlr commented on GitHub (Jun 13, 2024):

Met the same problem.

I concern some update in 66ab48772f are breaking CJK characters into incomplete parts, which could result in the problem:

66ab48772f/parser/parser.go (L92-L98)

Notice that both UTF-8 and UTF-16 are variable-length encodings, meaning that a single character can be represented by multiple bytes. If the scanner.Split(sc.ScanBytes) function is splitting the input into individual bytes, then it could indeed break a multi-byte character into incomplete parts. The DecodeRune function is then called on these potentially incomplete characters, which could result in incorrect decoding and garbled text as shown by @MGdesigner's screenshot.

Disclaimer: I am not good at the encoding stuff and not sure I understand the code properly. But ChatGPT confirmed my point and it further suggests:

To avoid this, the program should ensure that it never splits a multi-byte character into separate parts. This could be achieved by using a scanner function that understands the variable-length nature of the encoding and always returns complete characters. For example, in the case of UTF-8, Go’s bufio package provides the ScanRunes function, which is a split function for Scanner that returns each UTF-8-encoded rune as a token. Similar functionality would need to be implemented for UTF-16.

<!-- gh-comment-id:2165142329 --> @cdfmlr commented on GitHub (Jun 13, 2024): Met the same problem. I concern some update in https://github.com/ollama/ollama/commit/66ab48772f4f41f3f27fb93e15ef0cf756bda3d0 are breaking CJK characters into incomplete parts, which could result in the problem: https://github.com/ollama/ollama/blob/66ab48772f4f41f3f27fb93e15ef0cf756bda3d0/parser/parser.go#L92-L98 Notice that both UTF-8 and UTF-16 are variable-length encodings, meaning that a single character can be represented by multiple bytes. If the `scanner.Split(sc.ScanBytes)` function is splitting the input into individual bytes, then it could indeed break a multi-byte character into incomplete parts. The `DecodeRune` function is then called on these potentially incomplete characters, which could result in incorrect decoding and garbled text as shown by @MGdesigner's screenshot. Disclaimer: I am not good at the encoding stuff and not sure I understand the code properly. But ChatGPT confirmed my point and it further suggests: > To avoid this, the program should ensure that it never splits a multi-byte character into separate parts. This could be achieved by using a scanner function that understands the variable-length nature of the encoding and always returns complete characters. For example, in the case of UTF-8, Go’s bufio package provides the ScanRunes function, which is a split function for Scanner that returns each UTF-8-encoded rune as a token. Similar functionality would need to be implemented for UTF-16.
Author
Owner

@MGdesigner commented on GitHub (Jun 13, 2024):

To avoid this, the program should ensure that it never splits a multi-byte character into separate parts. This could be achieved by using a scanner function that understands the variable-length nature of the encoding and always returns complete characters. For example, in the case of UTF-8, Go’s bufio package provides the ScanRunes function, which is a split function for Scanner that returns each UTF-8-encoded rune as a token. Similar functionality would need to be implemented for UTF-16.

I agree you. I upload the new screenshot which is correctly encoded one. This moddel is created by old Ollama version. We can compare the license part with last screenshot.

modelfile_encodingfailed2(correct)

<!-- gh-comment-id:2165330068 --> @MGdesigner commented on GitHub (Jun 13, 2024): > > To avoid this, the program should ensure that it never splits a multi-byte character into separate parts. This could be achieved by using a scanner function that understands the variable-length nature of the encoding and always returns complete characters. For example, in the case of UTF-8, Go’s bufio package provides the ScanRunes function, which is a split function for Scanner that returns each UTF-8-encoded rune as a token. Similar functionality would need to be implemented for UTF-16. I agree you. I upload the new screenshot which is correctly encoded one. This moddel is created by old Ollama version. We can compare the license part with last screenshot. ![modelfile_encodingfailed2(correct)](https://github.com/ollama/ollama/assets/4480740/4741b0d3-0765-441d-bdec-cf91c389ad28)
Author
Owner

@cdfmlr commented on GitHub (Jun 13, 2024):

Yes. I am working with a fix. It's simple to fix the proper utf-8 & cjk support. But I am now failed on a test case of utf-16 encode with emoji. Seems there are other utf-16 bugs. 🤦

<!-- gh-comment-id:2165347570 --> @cdfmlr commented on GitHub (Jun 13, 2024): Yes. I am working with a fix. It's simple to fix the proper utf-8 & cjk support. But I am now failed on a test case of utf-16 encode with emoji. Seems there are other utf-16 bugs. 🤦
Author
Owner

@cdfmlr commented on GitHub (Jun 13, 2024):

While testing the CJK support, I tested the support for emoji as well, but I couldn't make test cases containing emoji work with utf-16. Here is my test case:

func TestParseFileCJKParseFile(t *testing.T) {
	data := `FROM bob
PARAMETER param1 1
PARAMETER param2 4096
SYSTEM """
👋 你好!You are a utf8 file with CJK characters.
"""
`
	// TODO: If adding emoji to the SYSTEM, it will fail to parse.
	var simulateUTF16File = func(endian binary.ByteOrder) []byte {
		utf16File := utf16.Encode(append([]rune{'\ufffe'}, []rune(data)...))
		buf := new(bytes.Buffer)
		err := binary.Write(buf, endian, utf16File)
		require.NoError(t, err)
		return buf.Bytes()
	}

	cases := []struct {
		name        string
		encodedFile []byte
	}{
		{"utf16le", simulateUTF16File(binary.LittleEndian)},
		{"utf16be", simulateUTF16File(binary.BigEndian)},
		{"utf8", []byte(data)},
	}

	for _, c := range cases {
		t.Run(c.name, func(t *testing.T) {
			actual, err := ParseFile(bytes.NewReader(c.encodedFile))
			require.NoError(t, err)

			expected := []Command{
				{Name: "model", Args: "bob"},
				{Name: "param1", Args: "1"},
				{Name: "param2", Args: "4096"},
				{Name: "system", Args: "\n👋 你好!You are a utf8 file with CJK characters.\n"},
			}

			assert.Equal(t, expected, actual.Commands)
		})
	}
}

The UTF-8 case passes, but both utf-16 be and le fails with garbled text:

    parser_test.go:586: 
        	Error Trace:	/src/ollama/parser/parser_test.go:586
        	Error:      	Not equal: 
        	            	expected: []parser.Command{parser.Command{Name:"model", Args:"bob"}, parser.Command{Name:"param1", Args:"1"}, parser.Command{Name:"param2", Args:"4096"}, parser.Command{Name:"system", Args:"\n👋 你好!You are a utf8 file with CJK characters.\n"}}
        	            	actual  : []parser.Command{parser.Command{Name:"model", Args:"bob"}, parser.Command{Name:"param1", Args:"1"}, parser.Command{Name:"param2", Args:"4096"}, parser.Command{Name:"system", Args:"\n�� 你好!You are a utf8 file with CJK characters.\n"}}
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -15,3 +15,3 @@
        	            	   Name: (string) (len=6) "system",
        	            	-  Args: (string) (len=56) "\n👋 你好!You are a utf8 file with CJK characters.\n"
        	            	+  Args: (string) (len=58) "\n�� 你好!You are a utf8 file with CJK characters.\n"
        	            	  }
        	Test:       	TestParseFileCJKParseFile/utf16be
    --- FAIL: TestParseFileCJKParseFile/utf16be (0.00s)

Expected :[]parser.Command{parser.Command{Name:"model", Args:"bob"}, parser.Command{Name:"param1", Args:"1"}, parser.Command{Name:"param2", Args:"4096"}, parser.Command{Name:"system", Args:"\n👋 你好!You are a utf8 file with CJK characters.\n"}}
Actual   :[]parser.Command{parser.Command{Name:"model", Args:"bob"}, parser.Command{Name:"param1", Args:"1"}, parser.Command{Name:"param2", Args:"4096"}, parser.Command{Name:"system", Args:"\n�� 你好!You are a utf8 file with CJK characters.\n"}}

The simulateUTF16File is copied from the current TestParseFileUTF16ParseFile by @pdevine in ccdf0b2a44.

@MGdesigner @007gzs @pdevine Any idea about this?

<!-- gh-comment-id:2165447330 --> @cdfmlr commented on GitHub (Jun 13, 2024): While testing the CJK support, I tested the support for emoji as well, but I couldn't make test cases containing emoji work with utf-16. Here is my test case: ```go func TestParseFileCJKParseFile(t *testing.T) { data := `FROM bob PARAMETER param1 1 PARAMETER param2 4096 SYSTEM """ 👋 你好!You are a utf8 file with CJK characters. """ ` // TODO: If adding emoji to the SYSTEM, it will fail to parse. var simulateUTF16File = func(endian binary.ByteOrder) []byte { utf16File := utf16.Encode(append([]rune{'\ufffe'}, []rune(data)...)) buf := new(bytes.Buffer) err := binary.Write(buf, endian, utf16File) require.NoError(t, err) return buf.Bytes() } cases := []struct { name string encodedFile []byte }{ {"utf16le", simulateUTF16File(binary.LittleEndian)}, {"utf16be", simulateUTF16File(binary.BigEndian)}, {"utf8", []byte(data)}, } for _, c := range cases { t.Run(c.name, func(t *testing.T) { actual, err := ParseFile(bytes.NewReader(c.encodedFile)) require.NoError(t, err) expected := []Command{ {Name: "model", Args: "bob"}, {Name: "param1", Args: "1"}, {Name: "param2", Args: "4096"}, {Name: "system", Args: "\n👋 你好!You are a utf8 file with CJK characters.\n"}, } assert.Equal(t, expected, actual.Commands) }) } } ``` The UTF-8 case passes, but both utf-16 be and le fails with garbled text: ``` parser_test.go:586: Error Trace: /src/ollama/parser/parser_test.go:586 Error: Not equal: expected: []parser.Command{parser.Command{Name:"model", Args:"bob"}, parser.Command{Name:"param1", Args:"1"}, parser.Command{Name:"param2", Args:"4096"}, parser.Command{Name:"system", Args:"\n👋 你好!You are a utf8 file with CJK characters.\n"}} actual : []parser.Command{parser.Command{Name:"model", Args:"bob"}, parser.Command{Name:"param1", Args:"1"}, parser.Command{Name:"param2", Args:"4096"}, parser.Command{Name:"system", Args:"\n�� 你好!You are a utf8 file with CJK characters.\n"}} Diff: --- Expected +++ Actual @@ -15,3 +15,3 @@ Name: (string) (len=6) "system", - Args: (string) (len=56) "\n👋 你好!You are a utf8 file with CJK characters.\n" + Args: (string) (len=58) "\n�� 你好!You are a utf8 file with CJK characters.\n" } Test: TestParseFileCJKParseFile/utf16be --- FAIL: TestParseFileCJKParseFile/utf16be (0.00s) Expected :[]parser.Command{parser.Command{Name:"model", Args:"bob"}, parser.Command{Name:"param1", Args:"1"}, parser.Command{Name:"param2", Args:"4096"}, parser.Command{Name:"system", Args:"\n👋 你好!You are a utf8 file with CJK characters.\n"}} Actual :[]parser.Command{parser.Command{Name:"model", Args:"bob"}, parser.Command{Name:"param1", Args:"1"}, parser.Command{Name:"param2", Args:"4096"}, parser.Command{Name:"system", Args:"\n�� 你好!You are a utf8 file with CJK characters.\n"}} ``` The simulateUTF16File is copied from the current [TestParseFileUTF16ParseFile](https://github.com/ollama/ollama/blob/916bc46c42e7db86d48a13253eb3f47c134fe9df/parser/parser_test.go#L514) by @pdevine in https://github.com/ollama/ollama/commit/ccdf0b2a449d812a3708a3083f6a725289f4f750. @MGdesigner @007gzs @pdevine Any idea about this?
Author
Owner

@jmorganca commented on GitHub (Jun 13, 2024):

Hi all, sorry about this - working on a fix

<!-- gh-comment-id:2166227930 --> @jmorganca commented on GitHub (Jun 13, 2024): Hi all, sorry about this - working on a fix
Author
Owner

@MGdesigner commented on GitHub (Jun 14, 2024):

While testing the CJK support, I tested the support for emoji as well, but I couldn't make test cases containing emoji work with utf-16. Here is my test case:

@MGdesigner @007gzs @pdevine Any idea about this?

Today Ollama provides new version, 0.1.44. I have tested UTF-8 file. UTF-8 wrong encoding got fixed in 0.1.44. Can you test UTF-16 part again with Ollama 0.1.44? If works, I close the issue.

<!-- gh-comment-id:2167172322 --> @MGdesigner commented on GitHub (Jun 14, 2024): > While testing the CJK support, I tested the support for emoji as well, but I couldn't make test cases containing emoji work with utf-16. Here is my test case: > @MGdesigner @007gzs @pdevine Any idea about this? Today Ollama provides new version, 0.1.44. I have tested UTF-8 file. UTF-8 wrong encoding got fixed in 0.1.44. Can you test UTF-16 part again with Ollama 0.1.44? If works, I close the issue.
Author
Owner

@cdfmlr commented on GitHub (Jun 14, 2024):

Today Ollama provides new version, 0.1.44. I have tested UTF-8 file. UTF-8 wrong encoding got fixed in 0.1.44. Can you test UTF-16 part again with Ollama 0.1.44? If works, I close the issue.

I don't have a requirement of UTF-16. I am using UTF-8 as well. And It looks like both UTF-8 and UTF-16 are well tested since cd234ce22c. I am ok to close this.

<!-- gh-comment-id:2167180654 --> @cdfmlr commented on GitHub (Jun 14, 2024): > Today Ollama provides new version, 0.1.44. I have tested UTF-8 file. UTF-8 wrong encoding got fixed in 0.1.44. Can you test UTF-16 part again with Ollama 0.1.44? If works, I close the issue. I don't have a requirement of UTF-16. I am using UTF-8 as well. And It looks like both UTF-8 and UTF-16 are well tested since https://github.com/ollama/ollama/commit/cd234ce22c85bf34dc50b05c93c4dab513ae8f99. I am ok to close this.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#3166