Fix Byte Order Mark (BOM) handling in markdown display and editor. #3230

Closed
opened 2025-11-02 05:04:51 -06:00 by GiteaMirror · 14 comments
Owner

Originally created by @Corey-M on GitHub (Apr 23, 2019).

Description

When the README.md file contains a Unicode Byte Order Mark the file the first line of the file is
formatted incorrectly. This happens for all BOMs that I have tested - UTF8, Unicode and Unicode
Big Endian. The text of the file loads correctly in all cases, the only problem is with the BOM
being treated as file content and incorrectly displayed in both the rendered view and the editor.

This is a particular problem for developers using Visual Studio and other environments which
insert a BOM by default, and can be difficult to change without add-ons to the environment.

The various Unicode Byte Order Markers are metadata and should not be treated as file content. At a minimum the markdown renderer should detect and discard BOMs in the content. The markdown editor should likewise detect and remove BOMs during load and either replace them during save or save without markers.

I don't understand Go so I won't be submitting a pull request. From 10 minutes casting about in the code it looks like ToUTF8WithErr and ToUTF8WithFallback might be a place to
start. It looks like those are the main methods used to massage text file content into UTF8 for
both render and edit.

Screenshots

Screenshots taken from try.gitea.io site mentioned above.

  • Bad first line formatting, treats BOM as unknown non-printable character:
    image
  • Editor showing BOM as unknown/bad character:
    image
Originally created by @Corey-M on GitHub (Apr 23, 2019). - Gitea version (or commit ref): <= 1.8.0 - Operating system: All. - Database (use `[x]`): All. - Can you reproduce the bug at https://try.gitea.io: - [X] Yes (provide example URL) - https://try.gitea.io/emonk/Markdown-BOM-Test - [ ] No - [ ] Not relevant ## Description When the README.md file contains a Unicode Byte Order Mark the file the first line of the file is formatted incorrectly. This happens for all BOMs that I have tested - UTF8, Unicode and Unicode Big Endian. The text of the file loads correctly in all cases, the only problem is with the BOM being treated as file content and incorrectly displayed in both the rendered view and the editor. This is a particular problem for developers using Visual Studio and other environments which insert a BOM by default, and can be difficult to change without add-ons to the environment. The various Unicode Byte Order Markers are metadata and should not be treated as file content. At a minimum the markdown renderer should detect and discard BOMs in the content. The markdown editor should likewise detect and remove BOMs during load and either replace them during save or save without markers. I don't understand Go so I won't be submitting a pull request. From 10 minutes casting about in the code it looks like [`ToUTF8WithErr`][1] and [`ToUTF8WithFallback`][2] might be a place to start. It looks like those are the main methods used to massage text file content into UTF8 for both render and edit. ## Screenshots Screenshots taken from try.gitea.io site mentioned above. * Bad first line formatting, treats BOM as unknown non-printable character: ![image](https://user-images.githubusercontent.com/5035006/56549402-0434a600-65c6-11e9-8e23-1a7a7af6b170.png) * Editor showing BOM as unknown/bad character: ![image](https://user-images.githubusercontent.com/5035006/56549429-14e51c00-65c6-11e9-8507-c3d977be2f1a.png) [1]: https://github.com/go-gitea/gitea/blob/704da08fdc6bae6fdd6bf1b892ebe12afeef5eca/modules/templates/helper.go#L265 [2]: https://github.com/go-gitea/gitea/blob/704da08fdc6bae6fdd6bf1b892ebe12afeef5eca/modules/templates/helper.go#L289
GiteaMirror added the type/bug label 2025-11-02 05:04:51 -06:00
Author
Owner

@silverwind commented on GitHub (Apr 23, 2019):

Would be interested in how GitHub solves this (or not). Generally, I think all BOMs should be abolished on save, but I guess some broken software might rely on it being present (cough Excel), so there might be some value in trying to preserving it (but never rendering it).

@silverwind commented on GitHub (Apr 23, 2019): Would be interested in how GitHub solves this (or not). Generally, I think all BOMs should be abolished on save, but I guess some broken software might rely on it being present (cough Excel), so there might be some value in trying to preserving it (but never rendering it).
Author
Owner

@zeripath commented on GitHub (Apr 23, 2019):

It's probably through some use of the chardet or similar.

It's probably reasonable to pass the stuff through this - but the trouble with the chardet library is that it's not nearly perfect.

@zeripath commented on GitHub (Apr 23, 2019): It's probably through some use of the chardet or similar. It's probably reasonable to pass the stuff through this - but the trouble with the chardet library is that it's not nearly perfect.
Author
Owner

@silverwind commented on GitHub (Apr 23, 2019):

BOM should be easily detectable via regex \uFEFF, no need for a library.

@silverwind commented on GitHub (Apr 23, 2019): BOM should be easily detectable via regex `\uFEFF`, no need for a library.
Author
Owner

@zeripath commented on GitHub (Apr 23, 2019):

Ah but I bet that the next problem will be cp1252 related

@zeripath commented on GitHub (Apr 23, 2019): Ah but I bet that the next problem will be cp1252 related
Author
Owner

@lafriks commented on GitHub (Apr 23, 2019):

Please do not use regexp for this, just check first two bytes

@lafriks commented on GitHub (Apr 23, 2019): Please do not use regexp for this, just check first two bytes
Author
Owner

@silverwind commented on GitHub (Apr 23, 2019):

BOM is actually three bytes in UTF-8 and two bytes in UTF-16. Still think it's best to regex-test the unicode string instead like ^\uFEFF, it is a single character in unicode-aware regexp.

@silverwind commented on GitHub (Apr 23, 2019): BOM is actually three bytes in UTF-8 and two bytes in UTF-16. Still think it's best to regex-test the unicode string instead like `^\uFEFF`, it is a single character in unicode-aware regexp.
Author
Owner

@lafriks commented on GitHub (Apr 23, 2019):

It does not matter it still be faster than doing regexp and converting byte array to string

@lafriks commented on GitHub (Apr 23, 2019): It does not matter it still be faster than doing regexp and converting byte array to string
Author
Owner

@zeripath commented on GitHub (Apr 23, 2019):

OK so we are passing this content through a chardet.

OK so we need to adjust our "decoder" for utf-8 to simply remove the BOM if present.

@zeripath commented on GitHub (Apr 23, 2019): OK so we are passing this content through a chardet. OK so we need to adjust our "decoder" for utf-8 to simply remove the BOM if present.
Author
Owner

@silverwind commented on GitHub (Apr 23, 2019):

I think the most elegant solution would be:

  1. Strip BOM from rendered markdown and web editor.
  2. When user saves a file in the web editor check if the previous file had a BOM and if so, add it again.

As I said, BOM preserval is important when dealing with certain Microsoft products which use the BOM as indicator whether a file is UTF-8 or ASCII, like Excel does.

@silverwind commented on GitHub (Apr 23, 2019): I think the most elegant solution would be: 1. Strip BOM from rendered markdown and web editor. 2. When user saves a file in the web editor check if the previous file had a BOM and if so, add it again. As I said, BOM preserval is important when dealing with certain Microsoft products which use the BOM as indicator whether a file is UTF-8 or ASCII, like Excel does.
Author
Owner

@zeripath commented on GitHub (Apr 23, 2019):

@silverwind - OK I've got removal of the BOM on decoding sorted out. In terms of keeping the previous encoding that's a bit more difficult - we don't currently do that at all - all data created on the editor is assumed to be UTF-8 AFAIU (I certainly didn't write any encoding gadgets - doing it properly is a horrible experience.)

@zeripath commented on GitHub (Apr 23, 2019): @silverwind - OK I've got removal of the BOM on decoding sorted out. In terms of keeping the previous encoding that's a bit more difficult - we don't currently do that at all - all data created on the editor is assumed to be UTF-8 AFAIU (I certainly didn't write any encoding gadgets - doing it properly is a horrible experience.)
Author
Owner

@silverwind commented on GitHub (Apr 23, 2019):

We're making the assumption that everything is UTF-8 so adding it back when committing from the UI shouldn't be too hard. Read the old file, check if its first three bytes match (maybe create a shared hasBOM function to use in the template renderer as well) and if they do, add them to the saved content.

Thought I won't block on this, stripping BOM is already an improvement.

@silverwind commented on GitHub (Apr 23, 2019): We're making the assumption that everything is UTF-8 so adding it back when committing from the UI shouldn't be too hard. Read the old file, check if its first three bytes match (maybe create a shared hasBOM function to use in the template renderer as well) and if they do, add them to the saved content. Thought I won't block on this, stripping BOM is already an improvement.
Author
Owner

@zeripath commented on GitHub (Apr 23, 2019):

OK done.

@zeripath commented on GitHub (Apr 23, 2019): OK done.
Author
Owner

@zeripath commented on GitHub (Apr 23, 2019):

The attached pr will attempt to reencode to the detected charset and upon failure will default to utf8 with or without BOM as per original charset.

@zeripath commented on GitHub (Apr 23, 2019): The attached pr will attempt to reencode to the detected charset and upon failure will default to utf8 with or without BOM as per original charset.
Author
Owner

@silverwind commented on GitHub (Apr 23, 2019):

Thanks, will likely test this tomorrow.

@silverwind commented on GitHub (Apr 23, 2019): Thanks, will likely test this tomorrow.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/gitea#3230