mirror of
https://github.com/go-gitea/gitea.git
synced 2026-05-11 09:40:19 -05:00
Seeing regular files with special characters in it #6205
Open
opened 2025-11-02 06:48:17 -06:00 by GiteaMirror
·
18 comments
No Branch/Tag Specified
main
release/v1.25
release/v1.24
release/v1.23
release/v1.22
release/v1.21
release/v1.20
release/v1.19
release/v1.18
release/v1.17
release/v1.16
release/v1.15
release/v1.14
release/v1.13
release/v1.12
release/v1.11
release/v1.10
release/v1.9
release/v1.8
v1.25.3
v1.25.2
v1.25.1
v1.25.0
v1.24.7
v1.25.0-rc0
v1.26.0-dev
v1.24.6
v1.24.5
v1.24.4
v1.24.3
v1.24.2
v1.24.1
v1.24.0
v1.23.8
v1.24.0-rc0
v1.25.0-dev
v1.23.7
v1.23.6
v1.23.5
v1.23.4
v1.23.3
v1.23.2
v1.23.1
v1.23.0
v1.23.0-rc0
v1.24.0-dev
v1.22.6
v1.22.5
v1.22.4
v1.22.3
v1.22.2
v1.22.1
v1.22.0
v1.23.0-dev
v1.22.0-rc1
v1.21.11
v1.22.0-rc0
v1.21.10
v1.21.9
v1.21.8
v1.21.7
v1.21.6
v1.21.5
v1.21.4
v1.21.3
v1.21.2
v1.20.6
v1.21.1
v1.21.0
v1.21.0-rc2
v1.21.0-rc1
v1.20.5
v1.22.0-dev
v1.21.0-rc0
v1.20.4
v1.20.3
v1.20.2
v1.20.1
v1.20.0
v1.19.4
v1.21.0-dev
v1.20.0-rc2
v1.20.0-rc1
v1.20.0-rc0
v1.19.3
v1.19.2
v1.19.1
v1.19.0
v1.19.0-rc1
v1.20.0-dev
v1.19.0-rc0
v1.18.5
v1.18.4
v1.18.3
v1.18.2
v1.18.1
v1.18.0
v1.17.4
v1.18.0-rc1
v1.19.0-dev
v1.18.0-rc0
v1.17.3
v1.17.2
v1.17.1
v1.17.0
v1.17.0-rc2
v1.16.9
v1.17.0-rc1
v1.18.0-dev
v1.16.8
v1.16.7
v1.16.6
v1.16.5
v1.16.4
v1.16.3
v1.16.2
v1.16.1
v1.16.0
v1.15.11
v1.17.0-dev
v1.16.0-rc1
v1.15.10
v1.15.9
v1.15.8
v1.15.7
v1.15.6
v1.15.5
v1.15.4
v1.15.3
v1.15.2
v1.15.1
v1.14.7
v1.15.0
v1.15.0-rc3
v1.14.6
v1.15.0-rc2
v1.14.5
v1.16.0-dev
v1.15.0-rc1
v1.14.4
v1.14.3
v1.14.2
v1.14.1
v1.14.0
v1.13.7
v1.14.0-rc2
v1.13.6
v1.13.5
v1.14.0-rc1
v1.15.0-dev
v1.13.4
v1.13.3
v1.13.2
v1.13.1
v1.13.0
v1.12.6
v1.13.0-rc2
v1.14.0-dev
v1.13.0-rc1
v1.12.5
v1.12.4
v1.12.3
v1.12.2
v1.12.1
v1.11.8
v1.12.0
v1.11.7
v1.12.0-rc2
v1.11.6
v1.12.0-rc1
v1.13.0-dev
v1.11.5
v1.11.4
v1.11.3
v1.10.6
v1.12.0-dev
v1.11.2
v1.10.5
v1.11.1
v1.10.4
v1.11.0
v1.11.0-rc2
v1.10.3
v1.11.0-rc1
v1.10.2
v1.10.1
v1.10.0
v1.9.6
v1.9.5
v1.10.0-rc2
v1.11.0-dev
v1.10.0-rc1
v1.9.4
v1.9.3
v1.9.2
v1.9.1
v1.9.0
v1.9.0-rc2
v1.10.0-dev
v1.9.0-rc1
v1.8.3
v1.8.2
v1.8.1
v1.8.0
v1.8.0-rc3
v1.7.6
v1.8.0-rc2
v1.7.5
v1.8.0-rc1
v1.9.0-dev
v1.7.4
v1.7.3
v1.7.2
v1.7.1
v1.7.0
v1.7.0-rc3
v1.6.4
v1.7.0-rc2
v1.6.3
v1.7.0-rc1
v1.7.0-dev
v1.6.2
v1.6.1
v1.6.0
v1.6.0-rc2
v1.5.3
v1.6.0-rc1
v1.6.0-dev
v1.5.2
v1.5.1
v1.5.0
v1.5.0-rc2
v1.5.0-rc1
v1.5.0-dev
v1.4.3
v1.4.2
v1.4.1
v1.4.0
v1.4.0-rc3
v1.4.0-rc2
v1.3.3
v1.4.0-rc1
v1.3.2
v1.3.1
v1.3.0
v1.3.0-rc2
v1.3.0-rc1
v1.2.3
v1.2.2
v1.2.1
v1.2.0
v1.2.0-rc3
v1.2.0-rc2
v1.1.4
v1.2.0-rc1
v1.1.3
v1.1.2
v1.1.1
v1.1.0
v1.0.2
v1.0.1
v1.0.0
v0.9.99
Labels
Clear labels
$20
$250
$50
$500
backport/done
💎 Bounty
docs-update-needed
good first issue
hacktoberfest
issue/bounty
issue/confirmed
issue/critical
issue/duplicate
issue/needs-feedback
issue/not-a-bug
issue/regression
issue/stale
issue/workaround
lgtm/need 2
modifies/api
modifies/translation
outdated/backport/v1.18
outdated/theme/markdown
outdated/theme/timetracker
performance/bigrepo
performance/cpu
performance/memory
performance/speed
pr/breaking
proposal/accepted
proposal/rejected
pr/wip
pull-request
reviewed/wontfix
💰 Rewarded
skip-changelog
status/blocked
topic/accessibility
topic/api
topic/authentication
topic/build
topic/code-linting
topic/commit-signing
topic/content-rendering
topic/deployment
topic/distribution
topic/federation
topic/gitea-actions
topic/issues
topic/lfs
topic/mobile
topic/moderation
topic/packages
topic/pr
topic/projects
topic/repo
topic/repo-migration
topic/security
topic/theme
topic/ui
topic/ui-interaction
topic/ux
topic/webhooks
topic/wiki
type/bug
type/deprecation
type/docs
type/enhancement
type/feature
type/miscellaneous
type/proposal
type/question
type/refactoring
type/summary
type/testing
type/upstream
Mirrored from GitHub Pull Request
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/gitea#6205
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @alexanderadam on GitHub (Oct 23, 2020).
Description
I cannot see a file if it contains special chars. Is there any way to enforce watching the file anyway?
See this repo for instance.
The file contains this
And it makes sense that gitea thinks, that this could be a binary file because of the
. But it obviously isn't a binary file and I would love to see the file anyway. Instead of having to download it.Funnily enough, the commit view shows the content anyway. So something is inconsistent here.
Screenshots
@lunny commented on GitHub (Oct 23, 2020):
What character does the filename contain?
@alexanderadam commented on GitHub (Oct 23, 2020):
The file contains this
and the character causing this is "
".You can also inspect / checkout the example repo if it helps.
@silverwind commented on GitHub (Oct 23, 2020):
That's the unicode replacement character which essentially should be considered as text, not binary. How does GitHub handle this case?
@alexanderadam commented on GitHub (Oct 23, 2020):
I pasted the exact character. GitHub is only showing the Unicode replacement character (which is IMHO also the best way of handling this). You can simply try it out by yourself:
'Previewtab@zeripath commented on GitHub (Oct 23, 2020):
https://github.com/zeripath/pathological/blob/be-broken/regular_text_file.rb is how it appears on github.com
@alexanderadam commented on GitHub (Oct 23, 2020):
So similar to the commit view of Gitea (when you're copying the test, you're getting the 'correct' character as well).
Thus, if the state of Gitea should be on par with the view of GitHub, the 'regular Gitea view' must be fixed, right?
@mrsdizzie commented on GitHub (Oct 23, 2020):
I this case it is because http.DetectContentType returns application/octet-stream for this example:
bfc553164a/modules/base/tool.go (L357)https://golang.org/pkg/net/http/#DetectContentType
Thats also what my system returns locally when checking out the repo...maybe we need a different test for text then though this file is telling everybody else "im not a text file" so ...
@silverwind commented on GitHub (Oct 23, 2020):
Sounds lika a golang bug if � is detected as binary.
https://mimesniff.spec.whatwg.org/#identifying-a-resource-with-an-unknown-mime-type
@alexanderadam commented on GitHub (Oct 23, 2020):
It's an entirely different issue and yet related:
It the comments Gitea doesn't render the character like GitHub does (
�).Should it it stay this way or adapted to GitHub's logic (or to put it differently: should I open an issue for that)?
@mrsdizzie commented on GitHub (Oct 23, 2020):
Comments seem OK to me (I left an example that looks fine) and I suspect that is just another copy/paste issue or something.
To focus on this issue: It isn't just golang that detects mime type like this, the file command seems to also. So if there is a bug its in the general spec of mime type signatures maybe.
But I bet Github just doesn't use mime type detection for text files for these reasons.
We could instead maybe do something like this:
Which seems to work in a few simple tests including this example
@silverwind commented on GitHub (Oct 23, 2020):
utf8.Validsounds like a good idea for binary detection. That linked spec only does it on the first 1445 bytes and I generally think we should limit that parsing for performance reasons.@zeripath commented on GitHub (Oct 23, 2020):
Well 0x1e is a control character - I think it's worth noting that well behaved documents should not have. I mean it's invalid in XML.
@alexanderadam commented on GitHub (Oct 23, 2020):
Also every Go code is invalid XML and yet, Gitea is able to present Go code properly, right?
I'm not sure how we ended up that XML is the reference here. 😉
Other source code obviously can contain different kinds of control characters.
Furthermore GitHub, which seem to be a reference for many things in Gitea, is handling it properly, too.
I absolutely agree that it's an edge case but I'm really curious how we got to XML conformity here. 🤔
@zeripath commented on GitHub (Oct 24, 2020):
Have you considered that that the fact that it is invalid in w3c text formats like XML, HTML etc might be the reason why the content type is detected as binary...?
@alexanderadam commented on GitHub (Oct 26, 2020):
So the current check whether a file is a binary file is checking whether the file would be XML?
Again: this would exclude many code files from other languages, since Go or C code isn't a W3C format.
I mean, guessing file types is definitely difficult. A complicated content check could sometimes be worse than just mapping the file extension to a mime type. And even going for the magic bytes can lead to false results.
Or do you want to imply that the browser decides whether the
View Rawview is shown instead the file contents? 🤔@zeripath commented on GitHub (Oct 26, 2020):
No I'm explaining that the reason why detectcontenttype is saying that the file is binary is because the character is not allowed in web text formats. (Even if browsers don't just barf on them they're supposed to be escaped in html etc.)
How do you propose we detect binary formats?
I ask in all honesty because the technique used in git itself is pretty horrible and fatally flawed - IIRC it's simply does the file contain a NUL (0x0) character within the first 1kb. This is why git handles UTF-16 formats badly.
The next problem we face is the detect content encoding problem. Because people persist in committing documents not in UTF-8 and in encodings like CP-1252 we need to detect these to show them. In general encoding detection libraries do not expect to see non CR or LF control characters and will assign a very low likelihood to any encoding.
Then once you've got those sorted you'll need to escape the control characters as they are absolutely not supposed to be in html documents - (I don't know whether highlighter is doing the correct thing above) - they should therefore be replaced.
Dealing with text encodings is not easy. It is subtle and full of problems. The utf8.Valid solution noted above would fail non-utf8 character encodings.
@alexanderadam commented on GitHub (Oct 26, 2020):
Thank you Zeri, your last comment indeed explains your thoughts pretty good. 👍
@stale[bot] commented on GitHub (Dec 25, 2020):
This issue has been automatically marked as stale because it has not had recent activity. I am here to help clear issues left open even if solved or waiting for more insight. This issue will be closed if no further activity occurs during the next 2 weeks. If the issue is still valid just add a comment to keep it alive. Thank you for your contributions.