Sanitizer exception for IMG SRC attribute not being applied #7407

Closed
opened 2025-11-02 07:25:03 -06:00 by GiteaMirror · 6 comments
Owner

Originally created by @mjfs on GitHub (May 29, 2021).

  • Gitea version (or commit ref): 1.13.7
  • Git version: 2.31.1
  • Operating system: Linux (Gitea installed from Arch repository)
  • Database (use [x]):
    • PostgreSQL
    • MySQL
    • MSSQL
    • SQLite
  • Can you reproduce the bug at https://try.gitea.io: Not Applicable (custom configuration)
  • Log gist: Not Applicable (not visible in log)

Description

When using external markup renderer, sanitizer exception is not being applied. The attribute is consequently removed from output.

I am using Pandoc to render Office Open XML document (docx extension). No matter what combination of sanitizer configuration and markup renderer I choose, the data URI value of src attribute on img element is always removed from Gitea's final HTML output for any docx file previewed in browser (i.e. only <img/> remains).

As I understand the Gitea documentation (as well as cheat sheet), the configuration bellow should work:

[markup.sanitizer.docx]
ELEMENT = img
ALLOW_ATTR = src
REGEXP = ^.*$

[markup.docx]
ENABLED = true
FILE_EXTENSIONS = .docx
RENDER_COMMAND = "pandoc --from docx --to html --self-contained"
IS_INPUT_FILE = false

I was not able to found any workaround for this scenario (that could achieve desired end result) in the documentation, so if any other solution is generally used as an alternative for this use case (e.g. such as externalizing document resources), that will also do.

Originally created by @mjfs on GitHub (May 29, 2021). - Gitea version (or commit ref): **1.13.7** - Git version: **2.31.1** - Operating system: **Linux** (Gitea installed from Arch repository) - Database (use `[x]`): - [ ] PostgreSQL - [ ] MySQL - [ ] MSSQL - [x] SQLite - Can you reproduce the bug at https://try.gitea.io: **Not Applicable** (custom configuration) - Log gist: **Not Applicable** (not visible in log) ## Description When using external markup renderer, sanitizer exception is not being applied. The attribute is consequently removed from output. I am using `Pandoc` to render Office Open XML document (`docx` extension). No matter what combination of sanitizer configuration and markup renderer I choose, the data URI value of `src` attribute on `img` element is always removed from Gitea's final HTML output for any `docx` file previewed in browser (i.e. only `<img/>` remains). As I understand the Gitea [documentation](https://docs.gitea.io/en-us/external-renderers/#appini-file-configuration) (as well as [cheat sheet](https://docs.gitea.io/en-us/config-cheat-sheet/#markup-markup)), the configuration bellow should work: ```ini [markup.sanitizer.docx] ELEMENT = img ALLOW_ATTR = src REGEXP = ^.*$ [markup.docx] ENABLED = true FILE_EXTENSIONS = .docx RENDER_COMMAND = "pandoc --from docx --to html --self-contained" IS_INPUT_FILE = false ``` I was not able to found any workaround for this scenario (that could achieve desired end result) in the documentation, so if any other solution is generally used as an alternative for this use case (e.g. such as externalizing document resources), that will also do.
GiteaMirror added the type/bug label 2025-11-02 07:25:03 -06:00
Author
Owner

@matthewlootens commented on GitHub (Jun 1, 2021):

I'm having the same issue as described by @mjfs to get src attributes on img elements through the sanitizer. In my case, I'm rendering Jupyter Notebook files (.ipynb) by nbconvert. In this case, src values are base64-encoded data URI scheme, and so I also added the data URI scheme in the app.ini config:

[markup.sanitizer.rule1]
ELEMENT = img
ALLOW_ATTR = src
REGEXP = 

[markdown]
CUSTOM_URL_SCHEMES = data

[markup.jupyter]
ENABLED = true
FILE_EXTENSIONS = .ipynb
RENDER_COMMAND = "/home/user/.venv/bin/jupyter-nbconvert --stdout --to html --template basic "
IS_INPUT_FILE = true
  • Gitea version: 1.14.2
@matthewlootens commented on GitHub (Jun 1, 2021): I'm having the same issue as described by @mjfs to get `src` attributes on `img` elements through the sanitizer. In my case, I'm rendering Jupyter Notebook files (`.ipynb`) by `nbconvert`. In this case, `src` values are base64-encoded data URI scheme, and so I also added the `data` URI scheme in the `app.ini` config: ``` [markup.sanitizer.rule1] ELEMENT = img ALLOW_ATTR = src REGEXP = [markdown] CUSTOM_URL_SCHEMES = data [markup.jupyter] ENABLED = true FILE_EXTENSIONS = .ipynb RENDER_COMMAND = "/home/user/.venv/bin/jupyter-nbconvert --stdout --to html --template basic " IS_INPUT_FILE = true ``` * Gitea version: **1.14.2**
Author
Owner

@Eugene-1984 commented on GitHub (Jun 6, 2021):

The following issue for the bluemonday https://github.com/microcosm-cc/bluemonday/issues/51#issuecomment-352395433 suggest that the implementation for the src allowing policy must be something like

	p := bluemonday.NewPolicy()
	p.AllowImages()
	p.AllowDataURIImages()

rather than the straightforward b3ef6a61e5/modules/markup/sanitizer.go (L114)

And this issue suggest the the valid configuration exists https://github.com/go-gitea/gitea/issues/3025 and has a request for the example to be added to the docs. Would be greate if the solution (now or after a bugfix) will be added as an example to https://docs.gitea.io/en-us/external-renderers/#appini-file-configuration (now it has only TeX example)

@Eugene-1984 commented on GitHub (Jun 6, 2021): The following issue for the bluemonday https://github.com/microcosm-cc/bluemonday/issues/51#issuecomment-352395433 suggest that the implementation for the src allowing policy must be something like ``` p := bluemonday.NewPolicy() p.AllowImages() p.AllowDataURIImages() ``` rather than the straightforward https://github.com/go-gitea/gitea/blob/b3ef6a61e5fc3f9d64e3a9f61fb0027cbb48ba73/modules/markup/sanitizer.go#L114 And this issue suggest the the valid configuration exists https://github.com/go-gitea/gitea/issues/3025 and has a request for the example to be added to the docs. Would be greate if the solution (now or after a bugfix) will be added as an example to https://docs.gitea.io/en-us/external-renderers/#appini-file-configuration (now it has only TeX example)
Author
Owner

@KN4CK3R commented on GitHub (Jun 6, 2021):

This works for me:

[markdown]
CUSTOM_URL_SCHEMES = data

[markup.docx]
ENABLED = true
FILE_EXTENSIONS = .docx
RENDER_COMMAND = "pandoc --from docx --to html --self-contained"
IS_INPUT_FILE = false

The src attribute is not blocked but the data url. Now the images are there but not rendered for me in Firefox. The standalone pandoc output works but not embedded into Gitea. But that may be another problem.

@KN4CK3R commented on GitHub (Jun 6, 2021): This works for me: ```ini [markdown] CUSTOM_URL_SCHEMES = data [markup.docx] ENABLED = true FILE_EXTENSIONS = .docx RENDER_COMMAND = "pandoc --from docx --to html --self-contained" IS_INPUT_FILE = false ``` The src attribute is not blocked but the data url. Now the images are there but not rendered for me in Firefox. The standalone pandoc output works but not embedded into Gitea. But that may be another problem.
Author
Owner

@mjfs commented on GitHub (Jun 6, 2021):

@KN4CK3R: Your proposal does actually produce a non-empty IMG SRC attribute. Unfortunately, the data URI gets corrupted, probably at the sanitizing phase. Therefore this results in an invalid image format since the content can not be Base64 decoded into a valid JPG (or any other format used as input). It appears that the payload is still considered as a valid uri during processing therefore shortened (e.g. multiple slashes get reduced to a single one).

Instructions bellow are not directly related to the open issue, but might be helpful to someone else trying to determine how to use Pandoc as a filter or during testing of the setup.

To avoid composing entire HTML document when we just need the BODY for the preview, you can define an empty template and reference that as well in Gitea configuration. In addition, to avoid the warning, also set the TITLE attribute:

pandoc --from docx --to html --metadata title=" " --self-contained --template /usr/bin/Blank.html

HTML file Blank.html at /usr/bin/ (use more appropriate location) contains just the following content:

$body$

To test it outside in command line you can use the following (with Sample.docx and Sample.html being the input and output):

cat Sample.docx | pandoc --from docx --to html --metadata title=" " --self-contained --template /usr/bin/Blank.html > Sample.html

Instead of the above one could also cut redundant lines from the Pandoc output in a wrapper (which I used before). The alternative with an empty template was suggested by @jgm as a workaround in a somewhat related Pandoc issue (jgm/pandoc#7331)

@mjfs commented on GitHub (Jun 6, 2021): @KN4CK3R: Your proposal does actually produce a non-empty IMG SRC attribute. Unfortunately, the data URI gets corrupted, probably at the sanitizing phase. Therefore this results in an invalid image format since the content can not be Base64 decoded into a valid JPG (or any other format used as input). It appears that the payload is still considered as a valid uri during processing therefore shortened (e.g. multiple slashes get reduced to a single one). Instructions bellow are not directly related to the open issue, but might be helpful to someone else trying to determine how to use `Pandoc` as a filter or during testing of the setup. To avoid composing entire `HTML` document when we just need the `BODY` for the preview, you can define an empty template and reference that as well in `Gitea` configuration. In addition, to avoid the warning, also set the `TITLE` attribute: ``` pandoc --from docx --to html --metadata title=" " --self-contained --template /usr/bin/Blank.html ``` HTML file `Blank.html` at `/usr/bin/` (use more appropriate location) contains just the following content: ``` $body$ ``` To test it outside in command line you can use the following (with `Sample.docx` and `Sample.html` being the input and output): ```bash cat Sample.docx | pandoc --from docx --to html --metadata title=" " --self-contained --template /usr/bin/Blank.html > Sample.html ``` Instead of the above one could also cut redundant lines from the `Pandoc` output in a wrapper (which I used before). The alternative with an empty template was suggested by @jgm as a workaround in a somewhat related `Pandoc` issue (jgm/pandoc#7331)
Author
Owner

@KN4CK3R commented on GitHub (Jun 8, 2021):

fyi #16098 and #16110

The problem with some jupyter files are the invalid data uri images. If the input file contains images in base64 format with lines separated by newlines they will be dropped by the sanitizer because a data uri should not contain control characters. You may need to convert the jupyter input or output and strip those newlines.

Sample input with \n in the image data:

"outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYAAAAEZCAYAAACervI0AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAGdBJREFUeJzt3Xu0lXWd+PH3B7xfR5vUUrTMrPw1SjlpiubJ8lJqeBlN\nx5Hy15AzK9JxrRydzGAcXWo3tVrmXUETUUsZNRNdejQUEkmDStL6KZQheUnxkqjw+f3xbOLiAfbB\ns/fz7P28X2vtxT777IfzYQPfz..."
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],

You could use a wrapper script which replaces the newlines before passing the file to nbconvert.

@KN4CK3R commented on GitHub (Jun 8, 2021): fyi #16098 and #16110 The problem with some jupyter files are the invalid data uri images. If the input file contains images in base64 format with lines separated by newlines they will be dropped by the sanitizer because a data uri should not contain control characters. You may need to convert the jupyter input or output and strip those newlines. Sample input with `\n` in the image data: ``` "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYAAAAEZCAYAAACervI0AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAGdBJREFUeJzt3Xu0lXWd+PH3B7xfR5vUUrTMrPw1SjlpiubJ8lJqeBlN\nx5Hy15AzK9JxrRydzGAcXWo3tVrmXUETUUsZNRNdejQUEkmDStL6KZQheUnxkqjw+f3xbOLiAfbB\ns/fz7P28X2vtxT777IfzYQPfz..." }, "metadata": {}, "output_type": "display_data" } ], ``` You could use a wrapper script which replaces the newlines before passing the file to nbconvert.
Author
Owner

@KN4CK3R commented on GitHub (Jun 16, 2021):

A wrapper is not needed anymore after we upgrade bluemonday (see https://github.com/microcosm-cc/bluemonday/pull/123)

@KN4CK3R commented on GitHub (Jun 16, 2021): A wrapper is not needed anymore after we upgrade bluemonday (see https://github.com/microcosm-cc/bluemonday/pull/123)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/gitea#7407