Seeing regular files with special characters in it #6205

Open
opened 2025-11-02 06:48:17 -06:00 by GiteaMirror · 18 comments
Owner

Originally created by @alexanderadam on GitHub (Oct 23, 2020).

  • Gitea version (or commit ref): 1.14.0

Description

I cannot see a file if it contains special chars. Is there any way to enforce watching the file anyway?
See this repo for instance.

The file contains this

# frozen_string_literal: true

puts 'Foobar!'

And it makes sense that gitea thinks, that this could be a binary file because of the . But it obviously isn't a binary file and I would love to see the file anyway. Instead of having to download it.

Funnily enough, the commit view shows the content anyway. So something is inconsistent here.

Screenshots

Screenshot from 2020-10-23 12-55-05

Originally created by @alexanderadam on GitHub (Oct 23, 2020). - Gitea version (or commit ref): 1.14.0 ## Description I cannot see a file if it contains special chars. Is there any way to enforce watching the file anyway? See [this repo](https://try.gitea.io/alexanderadam/text_file_with_special_char/src/branch/master/regular_text_file.rb) for instance. The file contains this ```ruby # frozen_string_literal: true puts 'Foobar!' ``` And it makes sense that gitea thinks, that this could be a binary file because of the ``. But it obviously isn't a binary file and I would love to see the file anyway. Instead of having to download it. Funnily enough, [the commit view shows the content anyway](https://try.gitea.io/alexanderadam/text_file_with_special_char/commit/b81c44c0aaa02cb9a8df7bf7072a225868757178). So something is inconsistent here. ## Screenshots ![Screenshot from 2020-10-23 12-55-05](https://user-images.githubusercontent.com/372620/96995782-07984d00-152f-11eb-8094-8f9f665ff30a.png)
GiteaMirror added the issue/confirmedissue/stale labels 2025-11-02 06:48:17 -06:00
Author
Owner

@lunny commented on GitHub (Oct 23, 2020):

What character does the filename contain?

@lunny commented on GitHub (Oct 23, 2020): What character does the filename contain?
Author
Owner

@alexanderadam commented on GitHub (Oct 23, 2020):

The file contains this

# frozen_string_literal: true

puts 'Foobar!'

and the character causing this is "".

You can also inspect / checkout the example repo if it helps.

@alexanderadam commented on GitHub (Oct 23, 2020): The file contains this ```ruby # frozen_string_literal: true puts 'Foobar!' ``` and the character causing this is "``". You can also inspect / checkout [the example repo](https://try.gitea.io/alexanderadam/text_file_with_special_char/) if it helps.
Author
Owner

@silverwind commented on GitHub (Oct 23, 2020):

That's the unicode replacement character which essentially should be considered as text, not binary. How does GitHub handle this case?

@silverwind commented on GitHub (Oct 23, 2020): That's the unicode replacement character which essentially should be considered as text, not binary. How does GitHub handle this case?
Author
Owner

@alexanderadam commented on GitHub (Oct 23, 2020):

That's the unicode replacement character which essntially should be considered as text, not binary.

I pasted the exact character. GitHub is only showing the Unicode replacement character (which is IMHO also the best way of handling this). You can simply try it out by yourself:

  1. Go to the Gitea commit view
  2. Copy the text between the '
  3. Paste it here in a comment field on GitHub
  4. See how it is rendered by switching to the Preview tab

GitHub's rendering of unrenderable characters

@alexanderadam commented on GitHub (Oct 23, 2020): > That's the unicode replacement character which essntially should be considered as text, not binary. I pasted the exact character. GitHub is only _showing_ the Unicode replacement character (which is IMHO also the best way of handling this). You can simply try it out by yourself: 1. Go to [the Gitea commit view](https://try.gitea.io/alexanderadam/text_file_with_special_char/commit/b81c44c0aaa02cb9a8df7bf7072a225868757178) 2. Copy the text between the `'` 3. Paste it here in a comment field on GitHub 4. See how it is rendered by switching to the `Preview` tab ![GitHub's rendering of unrenderable characters](https://user-images.githubusercontent.com/372620/97004179-f35b4c80-153c-11eb-85e9-a9155cf6e215.gif)
Author
Owner

@zeripath commented on GitHub (Oct 23, 2020):

https://github.com/zeripath/pathological/blob/be-broken/regular_text_file.rb is how it appears on github.com

@zeripath commented on GitHub (Oct 23, 2020): https://github.com/zeripath/pathological/blob/be-broken/regular_text_file.rb is how it appears on github.com
Author
Owner

@alexanderadam commented on GitHub (Oct 23, 2020):

https://github.com/zeripath/pathological/blob/be-broken/regular_text_file.rb is how it appears on github.com

So similar to the commit view of Gitea (when you're copying the test, you're getting the 'correct' character as well).
Thus, if the state of Gitea should be on par with the view of GitHub, the 'regular Gitea view' must be fixed, right?

@alexanderadam commented on GitHub (Oct 23, 2020): > https://github.com/zeripath/pathological/blob/be-broken/regular_text_file.rb is how it appears on github.com So similar to [the commit view of Gitea](https://try.gitea.io/alexanderadam/text_file_with_special_char/commit/b81c44c0aaa02cb9a8df7bf7072a225868757178) (when you're copying the test, you're getting the 'correct' character as well). Thus, if the state of Gitea should be on par with the view of GitHub, [the 'regular Gitea view'](https://try.gitea.io/alexanderadam/text_file_with_special_char/src/branch/master/regular_text_file.rb) must be fixed, right?
Author
Owner

@mrsdizzie commented on GitHub (Oct 23, 2020):

I this case it is because http.DetectContentType returns application/octet-stream for this example:

bfc553164a/modules/base/tool.go (L357)

https://golang.org/pkg/net/http/#DetectContentType

Thats also what my system returns locally when checking out the repo...maybe we need a different test for text then though this file is telling everybody else "im not a text file" so ...

@mrsdizzie commented on GitHub (Oct 23, 2020): I this case it is because http.DetectContentType returns application/octet-stream for this example: https://github.com/go-gitea/gitea/blob/bfc553164aafaacf5a82427b85145aa2ef34aaa3/modules/base/tool.go#L357 https://golang.org/pkg/net/http/#DetectContentType Thats also what my system returns locally when checking out the repo...maybe we need a different test for text then though this file is telling everybody else "im not a text file" so ...
Author
Owner

@silverwind commented on GitHub (Oct 23, 2020):

Sounds lika a golang bug if � is detected as binary.

https://mimesniff.spec.whatwg.org/#identifying-a-resource-with-an-unknown-mime-type

@silverwind commented on GitHub (Oct 23, 2020): Sounds lika a golang bug if � is detected as binary. https://mimesniff.spec.whatwg.org/#identifying-a-resource-with-an-unknown-mime-type
Author
Owner

@alexanderadam commented on GitHub (Oct 23, 2020):

It's an entirely different issue and yet related:

It the comments Gitea doesn't render the character like GitHub does ().
Should it it stay this way or adapted to GitHub's logic (or to put it differently: should I open an issue for that)?

@alexanderadam commented on GitHub (Oct 23, 2020): It's an entirely different issue and yet related: It the _comments_ [Gitea doesn't render the character](https://try.gitea.io/alexanderadam/text_file_with_special_char/issues/1) like GitHub does (`�`). Should it it stay this way or adapted to GitHub's logic (or to put it differently: should I open an issue for that)?
Author
Owner

@mrsdizzie commented on GitHub (Oct 23, 2020):

Comments seem OK to me (I left an example that looks fine) and I suspect that is just another copy/paste issue or something.

To focus on this issue: It isn't just golang that detects mime type like this, the file command seems to also. So if there is a bug its in the general spec of mime type signatures maybe.

But I bet Github just doesn't use mime type detection for text files for these reasons.

We could instead maybe do something like this:

// IsTextFile returns true if file content format is plain text or empty.
func IsTextFile(data []byte) bool {
   if len(data) == 0 {
   	return true
   }
   return utf8.Valid(data)
}

Which seems to work in a few simple tests including this example

@mrsdizzie commented on GitHub (Oct 23, 2020): Comments seem OK to me (I left an example that looks fine) and I suspect that is just another copy/paste issue or something. To focus on this issue: It isn't just golang that detects mime type like this, the file command seems to also. So if there is a bug its in the general spec of mime type signatures maybe. But I bet Github just doesn't use mime type detection for text files for these reasons. We could instead maybe do something like this: ```go // IsTextFile returns true if file content format is plain text or empty. func IsTextFile(data []byte) bool { if len(data) == 0 { return true } return utf8.Valid(data) } ``` Which seems to work in a few simple tests including this example
Author
Owner

@silverwind commented on GitHub (Oct 23, 2020):

utf8.Valid sounds like a good idea for binary detection. That linked spec only does it on the first 1445 bytes and I generally think we should limit that parsing for performance reasons.

@silverwind commented on GitHub (Oct 23, 2020): `utf8.Valid` sounds like a good idea for binary detection. That linked spec only does it on the first 1445 bytes and I generally think we should limit that parsing for performance reasons.
Author
Owner

@zeripath commented on GitHub (Oct 23, 2020):

Well 0x1e is a control character - I think it's worth noting that well behaved documents should not have. I mean it's invalid in XML.

@zeripath commented on GitHub (Oct 23, 2020): Well 0x1e is a control character - I think it's worth noting that well behaved documents should not have. I mean it's invalid in XML.
Author
Owner

@alexanderadam commented on GitHub (Oct 23, 2020):

I mean it's invalid in XML.

Also every Go code is invalid XML and yet, Gitea is able to present Go code properly, right?
I'm not sure how we ended up that XML is the reference here. 😉
Other source code obviously can contain different kinds of control characters.

Furthermore GitHub, which seem to be a reference for many things in Gitea, is handling it properly, too.

I absolutely agree that it's an edge case but I'm really curious how we got to XML conformity here. 🤔

@alexanderadam commented on GitHub (Oct 23, 2020): > I mean it's invalid in XML. Also every Go code is invalid XML and yet, Gitea is able to present Go code properly, right? I'm not sure how we ended up that XML is the reference here. :wink: _Other_ source code obviously _can_ contain different kinds of control characters. Furthermore GitHub, which seem to be a reference for many things in Gitea, is handling it properly, too. I absolutely agree that it's an edge case but I'm really curious how we got to XML conformity here. :thinking:
Author
Owner

@zeripath commented on GitHub (Oct 24, 2020):

Have you considered that that the fact that it is invalid in w3c text formats like XML, HTML etc might be the reason why the content type is detected as binary...?

@zeripath commented on GitHub (Oct 24, 2020): Have you considered that that the fact that it is invalid in w3c text formats like XML, HTML etc might be the reason why the content type is detected as binary...?
Author
Owner

@alexanderadam commented on GitHub (Oct 26, 2020):

So the current check whether a file is a binary file is checking whether the file would be XML?
Again: this would exclude many code files from other languages, since Go or C code isn't a W3C format.

I mean, guessing file types is definitely difficult. A complicated content check could sometimes be worse than just mapping the file extension to a mime type. And even going for the magic bytes can lead to false results.

Or do you want to imply that the browser decides whether the View Raw view is shown instead the file contents? 🤔

@alexanderadam commented on GitHub (Oct 26, 2020): So the current check whether a file is a binary file is checking whether the file would be XML? **Again:** this would exclude many code files from other languages, since Go or C code isn't a W3C format. I mean, guessing file types is definitely difficult. A complicated content check could sometimes be worse than just mapping the file extension to a mime type. And even going for the magic bytes can lead to false results. Or do you want to imply that the browser decides whether the `View Raw` view is shown instead the file contents? :thinking:
Author
Owner

@zeripath commented on GitHub (Oct 26, 2020):

No I'm explaining that the reason why detectcontenttype is saying that the file is binary is because the character is not allowed in web text formats. (Even if browsers don't just barf on them they're supposed to be escaped in html etc.)

How do you propose we detect binary formats?

I ask in all honesty because the technique used in git itself is pretty horrible and fatally flawed - IIRC it's simply does the file contain a NUL (0x0) character within the first 1kb. This is why git handles UTF-16 formats badly.

The next problem we face is the detect content encoding problem. Because people persist in committing documents not in UTF-8 and in encodings like CP-1252 we need to detect these to show them. In general encoding detection libraries do not expect to see non CR or LF control characters and will assign a very low likelihood to any encoding.

Then once you've got those sorted you'll need to escape the control characters as they are absolutely not supposed to be in html documents - (I don't know whether highlighter is doing the correct thing above) - they should therefore be replaced.

Dealing with text encodings is not easy. It is subtle and full of problems. The utf8.Valid solution noted above would fail non-utf8 character encodings.

@zeripath commented on GitHub (Oct 26, 2020): No I'm explaining that the reason why detectcontenttype is saying that the file is binary is because the character is not allowed in web text formats. (Even if browsers don't just barf on them they're supposed to be escaped in html etc.) How do you propose we detect binary formats? I ask in all honesty because the technique used in git itself is pretty horrible and fatally flawed - IIRC it's simply does the file contain a NUL (0x0) character within the first 1kb. This is why git handles UTF-16 formats badly. The next problem we face is the detect content encoding problem. Because people persist in committing documents not in UTF-8 and in encodings like CP-1252 we need to detect these to show them. In general encoding detection libraries do not expect to see non CR or LF control characters and will assign a very low likelihood to any encoding. Then once you've got those sorted you'll need to escape the control characters as they are absolutely not supposed to be in html documents - (I don't know whether highlighter is doing the correct thing above) - they should therefore be replaced. Dealing with text encodings is not easy. It is subtle and full of problems. The utf8.Valid solution noted above would fail non-utf8 character encodings.
Author
Owner

@alexanderadam commented on GitHub (Oct 26, 2020):

Thank you Zeri, your last comment indeed explains your thoughts pretty good. 👍

@alexanderadam commented on GitHub (Oct 26, 2020): Thank you Zeri, your last comment indeed explains your thoughts pretty good. :+1:
Author
Owner

@stale[bot] commented on GitHub (Dec 25, 2020):

This issue has been automatically marked as stale because it has not had recent activity. I am here to help clear issues left open even if solved or waiting for more insight. This issue will be closed if no further activity occurs during the next 2 weeks. If the issue is still valid just add a comment to keep it alive. Thank you for your contributions.

@stale[bot] commented on GitHub (Dec 25, 2020): This issue has been automatically marked as stale because it has not had recent activity. I am here to help clear issues left open even if solved or waiting for more insight. This issue will be closed if no further activity occurs during the next 2 weeks. If the issue is still valid just add a comment to keep it alive. Thank you for your contributions.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/gitea#6205