Wrong display of cyrillic symbols in UTF-8 file #8970

Closed
opened 2025-11-02 08:24:22 -06:00 by GiteaMirror · 10 comments
Owner

Originally created by @sIspravnikov on GitHub (May 18, 2022).

Description

I have a file in UTF-8 encoding with cyrillic comment.
When i open this file in gitea web view, display of cyrillic symbols seems broken.
But in gitea file editor, diff page and others, this symbols displays correct.

Gitea Version

1.16.7

Can you reproduce the bug on the Gitea demo site?

Yes

Log Gist

No response

Screenshots

View file:

1
Edit file:

2
Diff changes in file:

3

reproduced it also on demo site
https://try.gitea.io/sIspravnikov/test/src/branch/main/build.gradle.kts

Git Version

2.30.3

Operating System

ubuntu 20.04

How are you running Gitea?

official docker-container

Database

PostgreSQL

Originally created by @sIspravnikov on GitHub (May 18, 2022). ### Description I have a file in UTF-8 encoding with cyrillic comment. When i open this file in gitea web view, display of cyrillic symbols seems broken. But in gitea file editor, diff page and others, this symbols displays correct. ### Gitea Version 1.16.7 ### Can you reproduce the bug on the Gitea demo site? Yes ### Log Gist _No response_ ### Screenshots View file: ![1](https://user-images.githubusercontent.com/65862178/169022047-a88a6ee2-88a9-4fb0-81f7-cf56f8844cf9.PNG) Edit file: ![2](https://user-images.githubusercontent.com/65862178/169022051-078ed869-2d85-4537-8ce2-f7cfcd756899.PNG) Diff changes in file: ![3](https://user-images.githubusercontent.com/65862178/169022054-a5175328-7d7d-4157-8c78-5fefb0072c76.PNG) reproduced it also on demo site https://try.gitea.io/sIspravnikov/test/src/branch/main/build.gradle.kts ### Git Version 2.30.3 ### Operating System ubuntu 20.04 ### How are you running Gitea? official docker-container ### Database PostgreSQL
GiteaMirror added the type/bug label 2025-11-02 08:24:22 -06:00
Author
Owner

@silverwind commented on GitHub (May 19, 2022):

File displays fine in raw view for me, so this is definitely a bug. Also it triggers the "hidden unicode" incorrectly. Maybe something for @zeripath to check out.

@silverwind commented on GitHub (May 19, 2022): File displays fine in raw view for me, so this is definitely a bug. Also it triggers the "hidden unicode" incorrectly. Maybe something for @zeripath to check out.
Author
Owner

@ashimokawa commented on GitHub (May 20, 2022):

This seems to be a bug of the code formatter.

Can reproduce on codeberg.org also:

d94df248a7/build.gradle.kts

but when I delete everything except the function with the cyrillic characters they suddenly look correct.

https://codeberg.org/test/test/src/branch/main/build.gradle.kts

@ashimokawa commented on GitHub (May 20, 2022): This seems to be a bug of the code formatter. Can reproduce on codeberg.org also: https://codeberg.org/test/test/src/commit/d94df248a7937afc16bb6d3c98cf17a8b7862ff0/build.gradle.kts but when I delete everything except the function with the cyrillic characters they suddenly look correct. https://codeberg.org/test/test/src/branch/main/build.gradle.kts
Author
Owner

@sIspravnikov commented on GitHub (May 20, 2022):

In issue #14434 was an idea, about chardet buffer size 1024 bytes, so i made a branch on demosite, where i placed cyrillic comment in beginning of the file:
0682cf1ba2
and view page now display everything correct:
https://try.gitea.io/sIspravnikov/test/src/branch/test/build.gradle.kts
And @ashimokawa have done simillar, when everything, except the function with the cyrillic characters, was deleted, cyrillic symbols now catched in first 1024 bytes buffer.

So, i think, @lunny idea about chardet looks correct.

@sIspravnikov commented on GitHub (May 20, 2022): In issue #14434 was an idea, about chardet buffer size 1024 bytes, so i made a branch on demosite, where i placed cyrillic comment in beginning of the file: https://try.gitea.io/sIspravnikov/test/commit/0682cf1ba2862f1a39e734ecb6fa6a7bcdaa57fb and view page now display everything correct: https://try.gitea.io/sIspravnikov/test/src/branch/test/build.gradle.kts And @ashimokawa have done simillar, when everything, except the function with the cyrillic characters, was deleted, cyrillic symbols now catched in first 1024 bytes buffer. So, i think, @lunny idea about chardet looks correct.
Author
Owner

@supermangithu commented on GitHub (May 20, 2022):

Suggest can add Chinese, Japanese UTF-8 added in Orgnization and repo name,or can let admin set to add it individually.Thanks!
Kindly regards.

@supermangithu commented on GitHub (May 20, 2022): Suggest can add Chinese, Japanese UTF-8 added in Orgnization and repo name,or can let admin set to add it individually.Thanks! Kindly regards.
Author
Owner

@wxiaoguang commented on GitHub (May 20, 2022):

@kaolagithub unrelated, there are other similar issues in the issue list, please search for them. for example: 18405, 13295 and 9032

@wxiaoguang commented on GitHub (May 20, 2022): @kaolagithub unrelated, there are other similar issues in the issue list, please search for them. for example: 18405, 13295 and 9032
Author
Owner

@zeripath commented on GitHub (May 21, 2022):

Hi I'm just looking at this.

This looks like a double encoding utf8 problem.

The problem is not the escapecontrolreader - there are test cases to ensure that it's doing what it should do on utf8 code.

The problem will be earlier that that.

@zeripath commented on GitHub (May 21, 2022): Hi I'm just looking at this. This looks like a double encoding utf8 problem. The problem is not the escapecontrolreader - there are test cases to ensure that it's doing what it should do on utf8 code. The problem will be earlier that that.
Author
Owner

@zeripath commented on GitHub (May 21, 2022):

The issue is that the file is being detected as ISO-8859-1.

In fact if you check the debug logs you will see:

2022/05/21 09:36:27 ...s/charset/charset.go:190:DetectEncoding() [D] [6288a48b] Detected encoding: ISO-8859-1
@zeripath commented on GitHub (May 21, 2022): The issue is that the file is being detected as ISO-8859-1. In fact if you check the debug logs you will see: ``` 2022/05/21 09:36:27 ...s/charset/charset.go:190:DetectEncoding() [D] [6288a48b] Detected encoding: ISO-8859-1 ```
Author
Owner

@zeripath commented on GitHub (May 21, 2022):

I bet the reason why the detection is failing is that the 2048th byte is within a utf8 character - and... therefore a slight change in the file would cause the correct rendering.

Is this is a somewhat carefully calculated failing example? That kind of information would have been helpful information to provide - because it would have helped us to immediately understand where the problem was.

@zeripath commented on GitHub (May 21, 2022): I bet the reason why the detection is failing is that the 2048th byte is within a utf8 character - and... therefore a slight change in the file would cause the correct rendering. Is this is a somewhat carefully calculated failing example? That kind of information would have been helpful information to provide - because it would have helped us to immediately understand where the problem was.
Author
Owner

@wxiaoguang commented on GitHub (May 21, 2022):

This example is the 2048 problem.

func renderFile(ctx *context.Context, entry *git.TreeEntry, treeLink, rawLink string) {
...
	buf := make([]byte, 1024)
	n, _ := util.ReadAtMost(dataRc, buf)
	buf = buf[:n]
...
		rd := charset.ToUTF8WithFallbackReader(io.MultiReader(bytes.NewReader(buf), dataRc))

The ToUTF8WithFallbackReader only reads first 2048 to detect, which may break UTF-8 runes.

image

Then there is a weighted algorithm to decide (guess) which encoding should be used.

Usually there should be no problem. But, with this sample:

image

The top confidence is not UTF-8 here.

The confidence of the result seems related to some statistics-based models (just a guess, haven't look into it), so a strings.Repeat("a" ...) won't trigger the bug, the repeated "a"'s top confidence result is still UTF-8.

@wxiaoguang commented on GitHub (May 21, 2022): This example is the 2048 problem. ``` func renderFile(ctx *context.Context, entry *git.TreeEntry, treeLink, rawLink string) { ... buf := make([]byte, 1024) n, _ := util.ReadAtMost(dataRc, buf) buf = buf[:n] ... rd := charset.ToUTF8WithFallbackReader(io.MultiReader(bytes.NewReader(buf), dataRc)) ``` The `ToUTF8WithFallbackReader` only reads first 2048 to detect, which may break UTF-8 runes. ![image](https://user-images.githubusercontent.com/2114189/169647854-3c9d4d23-b0be-4006-88be-14fcc0102c6c.png) Then there is a weighted algorithm to decide (guess) which encoding should be used. Usually there should be no problem. But, with this sample: ![image](https://user-images.githubusercontent.com/2114189/169648516-9e82270a-fbee-403e-aea0-280f4ad9fc4b.png) The top confidence is not UTF-8 here. The `confidence` of the result seems related to some statistics-based models (just a guess, haven't look into it), so a `strings.Repeat("a" ...)` won't trigger the bug, the repeated `"a"`'s top confidence result is still `UTF-8`.
Author
Owner

@wxiaoguang commented on GitHub (May 21, 2022):

Here is a designed test case how to trigger the bug:

func TestDetectEncodingEx(t *testing.T) {
	for testLen := 0; testLen < 2048; testLen++ {
		pattern := "    test { () }\n"
		input := ""
		for len(input) < testLen {
			input += pattern
		}
		input = input[:testLen]
		input += "// Выключаем"
		rd := ToUTF8WithFallbackReader(bytes.NewReader([]byte(input)))
		r, _ := io.ReadAll(rd)
		assert.EqualValuesf(t, input, string(r), "testing string len=%d", testLen)
	}
}

It will always fail:

        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -127,2 +127,2 @@
        	            	     test { () }
        	            	-    te// Выключаем
        	            	+    te// Выключаем
        	Test:       	TestDetectEncodingEx
        	Messages:   	testing string len=2038
@wxiaoguang commented on GitHub (May 21, 2022): Here is a designed test case how to trigger the bug: ``` func TestDetectEncodingEx(t *testing.T) { for testLen := 0; testLen < 2048; testLen++ { pattern := " test { () }\n" input := "" for len(input) < testLen { input += pattern } input = input[:testLen] input += "// Выключаем" rd := ToUTF8WithFallbackReader(bytes.NewReader([]byte(input))) r, _ := io.ReadAll(rd) assert.EqualValuesf(t, input, string(r), "testing string len=%d", testLen) } } ``` It will always fail: ``` Diff: --- Expected +++ Actual @@ -127,2 +127,2 @@ test { () } - te// Выключаем + te// Выключаем Test: TestDetectEncodingEx Messages: testing string len=2038 ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/gitea#8970