Some common characters should not be treaded as ambiguous Unicode characters #9485

Closed
opened 2025-11-02 08:40:21 -06:00 by GiteaMirror · 8 comments
Owner

Originally created by @wxiaoguang on GitHub (Aug 30, 2022).

Gitea 1.18-dev

https://try.gitea.io/wxiaoguang/test/src/branch/master/test-chars.md

Some common characters should not be treaded as ambiguous Unicode characters.

Many CJK punctuations are quite common in daily usage, they should not be marked as ambiguous character.

Otherwise the misleading warning appears on every page which contains CJK texts.

image

Originally created by @wxiaoguang on GitHub (Aug 30, 2022). Gitea 1.18-dev https://try.gitea.io/wxiaoguang/test/src/branch/master/test-chars.md Some common characters should not be treaded as ambiguous Unicode characters. Many CJK punctuations are quite common in daily usage, they should not be marked as `ambiguous character`. Otherwise the misleading warning appears on every page which contains CJK texts. ![image](https://user-images.githubusercontent.com/2114189/187440457-8c5fc86d-e9b5-4381-ae41-c1303c0f7d17.png)
GiteaMirror added the topic/uitype/bug labels 2025-11-02 08:40:21 -06:00
Author
Owner

@lunny commented on GitHub (Nov 27, 2022):

At least these Chinese characters are normal and should not be warning.

@lunny commented on GitHub (Nov 27, 2022): At least these Chinese characters are normal and should not be warning.
Author
Owner

@zeripath commented on GitHub (Nov 28, 2022):

If your locale is zh-CN these will not be shown as ambiguous. Further if you look carefully, it's only the when the characters (e.g. (FULL-WIDTH COMMA)) are next to English words/latin text that they're shown as ambiguous. That's because the ambiguous detection is looking locally at the specific words and not the whole sentence - it has to do this in order to prevent misbehaviour elsewhere.

@zeripath commented on GitHub (Nov 28, 2022): If your locale is zh-CN these will not be shown as ambiguous. Further if you look carefully, it's only the when the characters (e.g. `,` (FULL-WIDTH COMMA)) are next to English words/latin text that they're shown as ambiguous. That's because the ambiguous detection is looking locally at the specific words and not the whole sentence - it has to do this in order to prevent misbehaviour elsewhere.
Author
Owner

@zeripath commented on GitHub (Nov 28, 2022):

I think what we should do is only show ambiguous warnings on Source Code and not on rendered files - as per VSCode.

@zeripath commented on GitHub (Nov 28, 2022): I think what we should do is only show ambiguous warnings on Source Code and not on rendered files - as per VSCode.
Author
Owner

@zeripath commented on GitHub (Nov 28, 2022):

At least these Chinese characters are normal and should not be warning.

Only if you're using them in Chinese text - they're very definitely ambiguous otherwise. The difficulty is always what is "Chinese" or "English" text. The algorithm - which is the same as that used in vscode - does this on a word by word basis. One could try to come up with some heuristic to determine whether a paragraph as a whole is some language or other - but that's almost certainly still an open problem in NLP.

The best solution I suspect is just drop the warning on Rendered pages, default to unescaped and allow people to escape if they are interested.

@zeripath commented on GitHub (Nov 28, 2022): > At least these Chinese characters are normal and should not be warning. Only if you're using them in Chinese text - they're very definitely ambiguous otherwise. The difficulty is always what is "Chinese" or "English" text. The algorithm - which is the same as that used in vscode - does this on a word by word basis. One could try to come up with some heuristic to determine whether a paragraph as a whole is some language or other - but that's almost certainly still an open problem in NLP. The best solution I suspect is just drop the warning on Rendered pages, default to unescaped and allow people to escape if they are interested.
Author
Owner

@wxiaoguang commented on GitHub (Dec 1, 2022):

If your locale is zh-CN these will not be shown as ambiguous.

It still shows.

https://gitea.com/xorm/xorm/src/branch/master/README_CN.md?lang=zh-CN

image

@wxiaoguang commented on GitHub (Dec 1, 2022): > If your locale is zh-CN these will not be shown as ambiguous. It still shows. https://gitea.com/xorm/xorm/src/branch/master/README_CN.md?lang=zh-CN <details> ![image](https://user-images.githubusercontent.com/2114189/204949224-0339d776-25ed-45a5-8f3f-451c977218eb.png) </details>
Author
Owner

@silverwind commented on GitHub (Dec 1, 2022):

Event gitea's own README is marked as containing "ambiguous Unicode characters" because of the pronouncation characters. I think it's silly that we even mark these and I think the set of matched characters should be reduced to the absolute minimum that may be actually malicious, e.g. non-standard whitespace, text reversal and such.

image
@silverwind commented on GitHub (Dec 1, 2022): Event gitea's own README is marked as containing "ambiguous Unicode characters" because of the pronouncation characters. I think it's silly that we even mark these and I think the set of matched characters should be reduced to the absolute minimum that may be actually malicious, e.g. non-standard whitespace, text reversal and such. <img width="439" alt="image" src="https://user-images.githubusercontent.com/115237/205051756-c95144cd-6908-43a8-8926-f5117b8b6b64.png">
Author
Owner

@zeripath commented on GitHub (Dec 3, 2022):

I think the only answer is to not render the warning on rendered pages.

This is a simple 3 line diff:

diff --git a/templates/repo/view_file.tmpl b/templates/repo/view_file.tmpl
index 0fe0a1319..9d82cc018 100644
--- a/templates/repo/view_file.tmpl
+++ b/templates/repo/view_file.tmpl
@@ -58,7 +58,9 @@
 		</div>
 	</h4>
 	<div class="ui attached table unstackable segment">
-		{{template "repo/unicode_escape_prompt" dict "EscapeStatus" .EscapeStatus "root" $}}
+		{{if not (or .IsMarkup .IsRenderedHTML)}}
+			{{template "repo/unicode_escape_prompt" dict "EscapeStatus" .EscapeStatus "root" $}}
+		{{end}}
 		<div class="file-view{{if .IsMarkup}} markup {{.MarkupType}}{{else if .IsRenderedHTML}} plain-text{{else if .IsTextSource}} code-view{{end}}">
 			{{if .IsMarkup}}
 				{{if .FileContent}}{{.FileContent | Safe}}{{end}}

@zeripath commented on GitHub (Dec 3, 2022): I think the only answer is to not render the warning on rendered pages. This is a simple 3 line diff: ```patch diff --git a/templates/repo/view_file.tmpl b/templates/repo/view_file.tmpl index 0fe0a1319..9d82cc018 100644 --- a/templates/repo/view_file.tmpl +++ b/templates/repo/view_file.tmpl @@ -58,7 +58,9 @@ </div> </h4> <div class="ui attached table unstackable segment"> - {{template "repo/unicode_escape_prompt" dict "EscapeStatus" .EscapeStatus "root" $}} + {{if not (or .IsMarkup .IsRenderedHTML)}} + {{template "repo/unicode_escape_prompt" dict "EscapeStatus" .EscapeStatus "root" $}} + {{end}} <div class="file-view{{if .IsMarkup}} markup {{.MarkupType}}{{else if .IsRenderedHTML}} plain-text{{else if .IsTextSource}} code-view{{end}}"> {{if .IsMarkup}} {{if .FileContent}}{{.FileContent | Safe}}{{end}} ```
Author
Owner

@zeripath commented on GitHub (Dec 3, 2022):

If your locale is zh-CN these will not be shown as ambiguous.

It still shows.

OK well that's a bug. Looking at ambiguous_gen the locale exceptions there are looking for zh-hant/zh-hans so we'll need to add a mapping from zh-CN to zh-hans

@zeripath commented on GitHub (Dec 3, 2022): > > If your locale is zh-CN these will not be shown as ambiguous. > > It still shows. OK well that's a bug. Looking at ambiguous_gen the locale exceptions there are looking for zh-hant/zh-hans so we'll need to add a mapping from zh-CN to zh-hans
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/gitea#9485