mirror of
https://github.com/go-gitea/gitea.git
synced 2026-03-22 06:24:14 -05:00
Wrong display of cyrillic symbols in UTF-8 file #8970
Closed
opened 2025-11-02 08:24:22 -06:00 by GiteaMirror
·
10 comments
No Branch/Tag Specified
main
release/v1.25
release/v1.24
release/v1.23
release/v1.22
release/v1.21
release/v1.20
release/v1.19
release/v1.18
release/v1.17
release/v1.16
release/v1.15
release/v1.14
release/v1.13
release/v1.12
release/v1.11
release/v1.10
release/v1.9
release/v1.8
v1.25.3
v1.25.2
v1.25.1
v1.25.0
v1.24.7
v1.25.0-rc0
v1.26.0-dev
v1.24.6
v1.24.5
v1.24.4
v1.24.3
v1.24.2
v1.24.1
v1.24.0
v1.23.8
v1.24.0-rc0
v1.25.0-dev
v1.23.7
v1.23.6
v1.23.5
v1.23.4
v1.23.3
v1.23.2
v1.23.1
v1.23.0
v1.23.0-rc0
v1.24.0-dev
v1.22.6
v1.22.5
v1.22.4
v1.22.3
v1.22.2
v1.22.1
v1.22.0
v1.23.0-dev
v1.22.0-rc1
v1.21.11
v1.22.0-rc0
v1.21.10
v1.21.9
v1.21.8
v1.21.7
v1.21.6
v1.21.5
v1.21.4
v1.21.3
v1.21.2
v1.20.6
v1.21.1
v1.21.0
v1.21.0-rc2
v1.21.0-rc1
v1.20.5
v1.22.0-dev
v1.21.0-rc0
v1.20.4
v1.20.3
v1.20.2
v1.20.1
v1.20.0
v1.19.4
v1.21.0-dev
v1.20.0-rc2
v1.20.0-rc1
v1.20.0-rc0
v1.19.3
v1.19.2
v1.19.1
v1.19.0
v1.19.0-rc1
v1.20.0-dev
v1.19.0-rc0
v1.18.5
v1.18.4
v1.18.3
v1.18.2
v1.18.1
v1.18.0
v1.17.4
v1.18.0-rc1
v1.19.0-dev
v1.18.0-rc0
v1.17.3
v1.17.2
v1.17.1
v1.17.0
v1.17.0-rc2
v1.16.9
v1.17.0-rc1
v1.18.0-dev
v1.16.8
v1.16.7
v1.16.6
v1.16.5
v1.16.4
v1.16.3
v1.16.2
v1.16.1
v1.16.0
v1.15.11
v1.17.0-dev
v1.16.0-rc1
v1.15.10
v1.15.9
v1.15.8
v1.15.7
v1.15.6
v1.15.5
v1.15.4
v1.15.3
v1.15.2
v1.15.1
v1.14.7
v1.15.0
v1.15.0-rc3
v1.14.6
v1.15.0-rc2
v1.14.5
v1.16.0-dev
v1.15.0-rc1
v1.14.4
v1.14.3
v1.14.2
v1.14.1
v1.14.0
v1.13.7
v1.14.0-rc2
v1.13.6
v1.13.5
v1.14.0-rc1
v1.15.0-dev
v1.13.4
v1.13.3
v1.13.2
v1.13.1
v1.13.0
v1.12.6
v1.13.0-rc2
v1.14.0-dev
v1.13.0-rc1
v1.12.5
v1.12.4
v1.12.3
v1.12.2
v1.12.1
v1.11.8
v1.12.0
v1.11.7
v1.12.0-rc2
v1.11.6
v1.12.0-rc1
v1.13.0-dev
v1.11.5
v1.11.4
v1.11.3
v1.10.6
v1.12.0-dev
v1.11.2
v1.10.5
v1.11.1
v1.10.4
v1.11.0
v1.11.0-rc2
v1.10.3
v1.11.0-rc1
v1.10.2
v1.10.1
v1.10.0
v1.9.6
v1.9.5
v1.10.0-rc2
v1.11.0-dev
v1.10.0-rc1
v1.9.4
v1.9.3
v1.9.2
v1.9.1
v1.9.0
v1.9.0-rc2
v1.10.0-dev
v1.9.0-rc1
v1.8.3
v1.8.2
v1.8.1
v1.8.0
v1.8.0-rc3
v1.7.6
v1.8.0-rc2
v1.7.5
v1.8.0-rc1
v1.9.0-dev
v1.7.4
v1.7.3
v1.7.2
v1.7.1
v1.7.0
v1.7.0-rc3
v1.6.4
v1.7.0-rc2
v1.6.3
v1.7.0-rc1
v1.7.0-dev
v1.6.2
v1.6.1
v1.6.0
v1.6.0-rc2
v1.5.3
v1.6.0-rc1
v1.6.0-dev
v1.5.2
v1.5.1
v1.5.0
v1.5.0-rc2
v1.5.0-rc1
v1.5.0-dev
v1.4.3
v1.4.2
v1.4.1
v1.4.0
v1.4.0-rc3
v1.4.0-rc2
v1.3.3
v1.4.0-rc1
v1.3.2
v1.3.1
v1.3.0
v1.3.0-rc2
v1.3.0-rc1
v1.2.3
v1.2.2
v1.2.1
v1.2.0
v1.2.0-rc3
v1.2.0-rc2
v1.1.4
v1.2.0-rc1
v1.1.3
v1.1.2
v1.1.1
v1.1.0
v1.0.2
v1.0.1
v1.0.0
v0.9.99
Labels
Clear labels
$20
$250
$50
$500
backport/done
💎 Bounty
docs-update-needed
good first issue
hacktoberfest
issue/bounty
issue/confirmed
issue/critical
issue/duplicate
issue/needs-feedback
issue/not-a-bug
issue/regression
issue/stale
issue/workaround
lgtm/need 2
modifies/api
modifies/translation
outdated/backport/v1.18
outdated/theme/markdown
outdated/theme/timetracker
performance/bigrepo
performance/cpu
performance/memory
performance/speed
pr/breaking
proposal/accepted
proposal/rejected
pr/wip
pull-request
reviewed/wontfix
💰 Rewarded
skip-changelog
status/blocked
topic/accessibility
topic/api
topic/authentication
topic/build
topic/code-linting
topic/commit-signing
topic/content-rendering
topic/deployment
topic/distribution
topic/federation
topic/gitea-actions
topic/issues
topic/lfs
topic/mobile
topic/moderation
topic/packages
topic/pr
topic/projects
topic/repo
topic/repo-migration
topic/security
topic/theme
topic/ui
topic/ui-interaction
topic/ux
topic/webhooks
topic/wiki
type/bug
type/deprecation
type/docs
type/enhancement
type/feature
type/miscellaneous
type/proposal
type/question
type/refactoring
type/summary
type/testing
type/upstream
Mirrored from GitHub Pull Request
No Label
type/bug
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/gitea#8970
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @sIspravnikov on GitHub (May 18, 2022).
Description
I have a file in UTF-8 encoding with cyrillic comment.
When i open this file in gitea web view, display of cyrillic symbols seems broken.
But in gitea file editor, diff page and others, this symbols displays correct.
Gitea Version
1.16.7
Can you reproduce the bug on the Gitea demo site?
Yes
Log Gist
No response
Screenshots
View file:
Edit file:
Diff changes in file:
reproduced it also on demo site
https://try.gitea.io/sIspravnikov/test/src/branch/main/build.gradle.kts
Git Version
2.30.3
Operating System
ubuntu 20.04
How are you running Gitea?
official docker-container
Database
PostgreSQL
@silverwind commented on GitHub (May 19, 2022):
File displays fine in raw view for me, so this is definitely a bug. Also it triggers the "hidden unicode" incorrectly. Maybe something for @zeripath to check out.
@ashimokawa commented on GitHub (May 20, 2022):
This seems to be a bug of the code formatter.
Can reproduce on codeberg.org also:
d94df248a7/build.gradle.ktsbut when I delete everything except the function with the cyrillic characters they suddenly look correct.
https://codeberg.org/test/test/src/branch/main/build.gradle.kts
@sIspravnikov commented on GitHub (May 20, 2022):
In issue #14434 was an idea, about chardet buffer size 1024 bytes, so i made a branch on demosite, where i placed cyrillic comment in beginning of the file:
0682cf1ba2and view page now display everything correct:
https://try.gitea.io/sIspravnikov/test/src/branch/test/build.gradle.kts
And @ashimokawa have done simillar, when everything, except the function with the cyrillic characters, was deleted, cyrillic symbols now catched in first 1024 bytes buffer.
So, i think, @lunny idea about chardet looks correct.
@supermangithu commented on GitHub (May 20, 2022):
Suggest can add Chinese, Japanese UTF-8 added in Orgnization and repo name,or can let admin set to add it individually.Thanks!
Kindly regards.
@wxiaoguang commented on GitHub (May 20, 2022):
@kaolagithub unrelated, there are other similar issues in the issue list, please search for them. for example: 18405, 13295 and 9032
@zeripath commented on GitHub (May 21, 2022):
Hi I'm just looking at this.
This looks like a double encoding utf8 problem.
The problem is not the escapecontrolreader - there are test cases to ensure that it's doing what it should do on utf8 code.
The problem will be earlier that that.
@zeripath commented on GitHub (May 21, 2022):
The issue is that the file is being detected as ISO-8859-1.
In fact if you check the debug logs you will see:
@zeripath commented on GitHub (May 21, 2022):
I bet the reason why the detection is failing is that the 2048th byte is within a utf8 character - and... therefore a slight change in the file would cause the correct rendering.
Is this is a somewhat carefully calculated failing example? That kind of information would have been helpful information to provide - because it would have helped us to immediately understand where the problem was.
@wxiaoguang commented on GitHub (May 21, 2022):
This example is the 2048 problem.
The
ToUTF8WithFallbackReaderonly reads first 2048 to detect, which may break UTF-8 runes.Then there is a weighted algorithm to decide (guess) which encoding should be used.
Usually there should be no problem. But, with this sample:
The top confidence is not UTF-8 here.
The
confidenceof the result seems related to some statistics-based models (just a guess, haven't look into it), so astrings.Repeat("a" ...)won't trigger the bug, the repeated"a"'s top confidence result is stillUTF-8.@wxiaoguang commented on GitHub (May 21, 2022):
Here is a designed test case how to trigger the bug:
It will always fail: