mirror of
https://github.com/go-gitea/gitea.git
synced 2026-03-12 10:39:38 -05:00
Fix Byte Order Mark (BOM) handling in markdown display and editor. #3230
Closed
opened 2025-11-02 05:04:51 -06:00 by GiteaMirror
·
14 comments
No Branch/Tag Specified
main
release/v1.25
release/v1.24
release/v1.23
release/v1.22
release/v1.21
release/v1.20
release/v1.19
release/v1.18
release/v1.17
release/v1.16
release/v1.15
release/v1.14
release/v1.13
release/v1.12
release/v1.11
release/v1.10
release/v1.9
release/v1.8
v1.25.3
v1.25.2
v1.25.1
v1.25.0
v1.24.7
v1.25.0-rc0
v1.26.0-dev
v1.24.6
v1.24.5
v1.24.4
v1.24.3
v1.24.2
v1.24.1
v1.24.0
v1.23.8
v1.24.0-rc0
v1.25.0-dev
v1.23.7
v1.23.6
v1.23.5
v1.23.4
v1.23.3
v1.23.2
v1.23.1
v1.23.0
v1.23.0-rc0
v1.24.0-dev
v1.22.6
v1.22.5
v1.22.4
v1.22.3
v1.22.2
v1.22.1
v1.22.0
v1.23.0-dev
v1.22.0-rc1
v1.21.11
v1.22.0-rc0
v1.21.10
v1.21.9
v1.21.8
v1.21.7
v1.21.6
v1.21.5
v1.21.4
v1.21.3
v1.21.2
v1.20.6
v1.21.1
v1.21.0
v1.21.0-rc2
v1.21.0-rc1
v1.20.5
v1.22.0-dev
v1.21.0-rc0
v1.20.4
v1.20.3
v1.20.2
v1.20.1
v1.20.0
v1.19.4
v1.21.0-dev
v1.20.0-rc2
v1.20.0-rc1
v1.20.0-rc0
v1.19.3
v1.19.2
v1.19.1
v1.19.0
v1.19.0-rc1
v1.20.0-dev
v1.19.0-rc0
v1.18.5
v1.18.4
v1.18.3
v1.18.2
v1.18.1
v1.18.0
v1.17.4
v1.18.0-rc1
v1.19.0-dev
v1.18.0-rc0
v1.17.3
v1.17.2
v1.17.1
v1.17.0
v1.17.0-rc2
v1.16.9
v1.17.0-rc1
v1.18.0-dev
v1.16.8
v1.16.7
v1.16.6
v1.16.5
v1.16.4
v1.16.3
v1.16.2
v1.16.1
v1.16.0
v1.15.11
v1.17.0-dev
v1.16.0-rc1
v1.15.10
v1.15.9
v1.15.8
v1.15.7
v1.15.6
v1.15.5
v1.15.4
v1.15.3
v1.15.2
v1.15.1
v1.14.7
v1.15.0
v1.15.0-rc3
v1.14.6
v1.15.0-rc2
v1.14.5
v1.16.0-dev
v1.15.0-rc1
v1.14.4
v1.14.3
v1.14.2
v1.14.1
v1.14.0
v1.13.7
v1.14.0-rc2
v1.13.6
v1.13.5
v1.14.0-rc1
v1.15.0-dev
v1.13.4
v1.13.3
v1.13.2
v1.13.1
v1.13.0
v1.12.6
v1.13.0-rc2
v1.14.0-dev
v1.13.0-rc1
v1.12.5
v1.12.4
v1.12.3
v1.12.2
v1.12.1
v1.11.8
v1.12.0
v1.11.7
v1.12.0-rc2
v1.11.6
v1.12.0-rc1
v1.13.0-dev
v1.11.5
v1.11.4
v1.11.3
v1.10.6
v1.12.0-dev
v1.11.2
v1.10.5
v1.11.1
v1.10.4
v1.11.0
v1.11.0-rc2
v1.10.3
v1.11.0-rc1
v1.10.2
v1.10.1
v1.10.0
v1.9.6
v1.9.5
v1.10.0-rc2
v1.11.0-dev
v1.10.0-rc1
v1.9.4
v1.9.3
v1.9.2
v1.9.1
v1.9.0
v1.9.0-rc2
v1.10.0-dev
v1.9.0-rc1
v1.8.3
v1.8.2
v1.8.1
v1.8.0
v1.8.0-rc3
v1.7.6
v1.8.0-rc2
v1.7.5
v1.8.0-rc1
v1.9.0-dev
v1.7.4
v1.7.3
v1.7.2
v1.7.1
v1.7.0
v1.7.0-rc3
v1.6.4
v1.7.0-rc2
v1.6.3
v1.7.0-rc1
v1.7.0-dev
v1.6.2
v1.6.1
v1.6.0
v1.6.0-rc2
v1.5.3
v1.6.0-rc1
v1.6.0-dev
v1.5.2
v1.5.1
v1.5.0
v1.5.0-rc2
v1.5.0-rc1
v1.5.0-dev
v1.4.3
v1.4.2
v1.4.1
v1.4.0
v1.4.0-rc3
v1.4.0-rc2
v1.3.3
v1.4.0-rc1
v1.3.2
v1.3.1
v1.3.0
v1.3.0-rc2
v1.3.0-rc1
v1.2.3
v1.2.2
v1.2.1
v1.2.0
v1.2.0-rc3
v1.2.0-rc2
v1.1.4
v1.2.0-rc1
v1.1.3
v1.1.2
v1.1.1
v1.1.0
v1.0.2
v1.0.1
v1.0.0
v0.9.99
Labels
Clear labels
$20
$250
$50
$500
backport/done
💎 Bounty
docs-update-needed
good first issue
hacktoberfest
issue/bounty
issue/confirmed
issue/critical
issue/duplicate
issue/needs-feedback
issue/not-a-bug
issue/regression
issue/stale
issue/workaround
lgtm/need 2
modifies/api
modifies/translation
outdated/backport/v1.18
outdated/theme/markdown
outdated/theme/timetracker
performance/bigrepo
performance/cpu
performance/memory
performance/speed
pr/breaking
proposal/accepted
proposal/rejected
pr/wip
pull-request
reviewed/wontfix
💰 Rewarded
skip-changelog
status/blocked
topic/accessibility
topic/api
topic/authentication
topic/build
topic/code-linting
topic/commit-signing
topic/content-rendering
topic/deployment
topic/distribution
topic/federation
topic/gitea-actions
topic/issues
topic/lfs
topic/mobile
topic/moderation
topic/packages
topic/pr
topic/projects
topic/repo
topic/repo-migration
topic/security
topic/theme
topic/ui
topic/ui-interaction
topic/ux
topic/webhooks
topic/wiki
type/bug
type/deprecation
type/docs
type/enhancement
type/feature
type/miscellaneous
type/proposal
type/question
type/refactoring
type/summary
type/testing
type/upstream
Mirrored from GitHub Pull Request
No Label
type/bug
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/gitea#3230
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @Corey-M on GitHub (Apr 23, 2019).
[x]): All.Description
When the README.md file contains a Unicode Byte Order Mark the file the first line of the file is
formatted incorrectly. This happens for all BOMs that I have tested - UTF8, Unicode and Unicode
Big Endian. The text of the file loads correctly in all cases, the only problem is with the BOM
being treated as file content and incorrectly displayed in both the rendered view and the editor.
This is a particular problem for developers using Visual Studio and other environments which
insert a BOM by default, and can be difficult to change without add-ons to the environment.
The various Unicode Byte Order Markers are metadata and should not be treated as file content. At a minimum the markdown renderer should detect and discard BOMs in the content. The markdown editor should likewise detect and remove BOMs during load and either replace them during save or save without markers.
I don't understand Go so I won't be submitting a pull request. From 10 minutes casting about in the code it looks like
ToUTF8WithErrandToUTF8WithFallbackmight be a place tostart. It looks like those are the main methods used to massage text file content into UTF8 for
both render and edit.
Screenshots
Screenshots taken from try.gitea.io site mentioned above.
@silverwind commented on GitHub (Apr 23, 2019):
Would be interested in how GitHub solves this (or not). Generally, I think all BOMs should be abolished on save, but I guess some broken software might rely on it being present (cough Excel), so there might be some value in trying to preserving it (but never rendering it).
@zeripath commented on GitHub (Apr 23, 2019):
It's probably through some use of the chardet or similar.
It's probably reasonable to pass the stuff through this - but the trouble with the chardet library is that it's not nearly perfect.
@silverwind commented on GitHub (Apr 23, 2019):
BOM should be easily detectable via regex
\uFEFF, no need for a library.@zeripath commented on GitHub (Apr 23, 2019):
Ah but I bet that the next problem will be cp1252 related
@lafriks commented on GitHub (Apr 23, 2019):
Please do not use regexp for this, just check first two bytes
@silverwind commented on GitHub (Apr 23, 2019):
BOM is actually three bytes in UTF-8 and two bytes in UTF-16. Still think it's best to regex-test the unicode string instead like
^\uFEFF, it is a single character in unicode-aware regexp.@lafriks commented on GitHub (Apr 23, 2019):
It does not matter it still be faster than doing regexp and converting byte array to string
@zeripath commented on GitHub (Apr 23, 2019):
OK so we are passing this content through a chardet.
OK so we need to adjust our "decoder" for utf-8 to simply remove the BOM if present.
@silverwind commented on GitHub (Apr 23, 2019):
I think the most elegant solution would be:
As I said, BOM preserval is important when dealing with certain Microsoft products which use the BOM as indicator whether a file is UTF-8 or ASCII, like Excel does.
@zeripath commented on GitHub (Apr 23, 2019):
@silverwind - OK I've got removal of the BOM on decoding sorted out. In terms of keeping the previous encoding that's a bit more difficult - we don't currently do that at all - all data created on the editor is assumed to be UTF-8 AFAIU (I certainly didn't write any encoding gadgets - doing it properly is a horrible experience.)
@silverwind commented on GitHub (Apr 23, 2019):
We're making the assumption that everything is UTF-8 so adding it back when committing from the UI shouldn't be too hard. Read the old file, check if its first three bytes match (maybe create a shared hasBOM function to use in the template renderer as well) and if they do, add them to the saved content.
Thought I won't block on this, stripping BOM is already an improvement.
@zeripath commented on GitHub (Apr 23, 2019):
OK done.
@zeripath commented on GitHub (Apr 23, 2019):
The attached pr will attempt to reencode to the detected charset and upon failure will default to utf8 with or without BOM as per original charset.
@silverwind commented on GitHub (Apr 23, 2019):
Thanks, will likely test this tomorrow.