Repository indexer clogs with file with multi-byte character sets #3758

Closed
opened 2025-11-02 05:24:20 -06:00 by GiteaMirror · 5 comments
Owner

Originally created by @guillep2k on GitHub (Aug 10, 2019).

  • Gitea version (or commit ref): release/v1.9
  • Git version: 2.22.0
  • Operating system: LInux - CentOS 7
  • Database (use [x]):
    • PostgreSQL
    • MySQL
    • MSSQL
    • SQLite
  • Can you reproduce the bug at https://try.gitea.io:
    • Yes (provide example URL)
    • No
    • Not relevant
  • Log gist:

Description

When using the repository indexer, files with multi-byte character sets don't get correctly indexed. This happens when characters look like valid utf-8 code points but they are not. Once a bad sequence is encontered the rest of the file is indexed as a single token; e.g. if the file is 100KB and the bad sequence is at the middle of it, the indexer gets the first half of the file OK, and the rest as one "word" which is 50KB long (and certainly not searchable).

To reproduce this issue, files with the folloging content can be tested using utf-8 and Latin1 character sets:

sailorvenus
áéíóú
sailormoon

Note: to test properly the files must be commited through git, not Gitea's web interface.

Searching for sailorvenus brings results, as it is the first word. In the Latin1 encoded file the rest of the context is garbled.
image

Searching for sailormoon doesn't bring results from the Latin1 encoded file, as the indexing for the rest of the file is garbled:
image

Originally created by @guillep2k on GitHub (Aug 10, 2019). - Gitea version (or commit ref): release/v1.9 - Git version: 2.22.0 - Operating system: LInux - CentOS 7 - Database (use `[x]`): - [x] PostgreSQL - [ ] MySQL - [ ] MSSQL - [ ] SQLite - Can you reproduce the bug at https://try.gitea.io: - [ ] Yes (provide example URL) - [x] No - [ ] Not relevant - Log gist: ## Description When using the repository indexer, files with multi-byte character sets don't get correctly indexed. This happens when characters look like valid utf-8 code points but they are not. Once a bad sequence is encontered the rest of the file is indexed as **a single token**; e.g. if the file is 100KB and the bad sequence is at the middle of it, the indexer gets the first half of the file OK, and the rest as one "word" which is 50KB long (and certainly not searchable). To reproduce this issue, files with the folloging content can be tested using utf-8 and Latin1 character sets: ``` sailorvenus áéíóú sailormoon ``` **Note:** to test properly the files must be commited through git, not Gitea's web interface. Searching for `sailorvenus` brings results, as it is the first word. In the Latin1 encoded file the rest of the context is garbled. ![image](https://user-images.githubusercontent.com/18600385/62816051-f1ec6280-baf7-11e9-87e9-a6d0c9576d36.png) Searching for `sailormoon` doesn't bring results from the Latin1 encoded file, as the indexing for the rest of the file is garbled: ![image](https://user-images.githubusercontent.com/18600385/62816059-121c2180-baf8-11e9-8751-18c5628ad930.png)
GiteaMirror added the type/bug label 2025-11-02 05:24:20 -06:00
Author
Owner

@guillep2k commented on GitHub (Aug 10, 2019):

I think some kind of encoding fallback could be used, perhaps pre-set in app.ini.

@guillep2k commented on GitHub (Aug 10, 2019): I think some kind of encoding fallback could be used, perhaps pre-set in app.ini.
Author
Owner

@silverwind commented on GitHub (Aug 10, 2019):

Sounds like a bug in Bleve to me.

@silverwind commented on GitHub (Aug 10, 2019): Sounds like a bug in Bleve to me.
Author
Owner

@lafriks commented on GitHub (Aug 10, 2019):

I think just like for display we currently detect encoding and convert to utf-8 for display we need to do same before giving content to bleve

@lafriks commented on GitHub (Aug 10, 2019): I think just like for display we currently detect encoding and convert to utf-8 for display we need to do same before giving content to bleve
Author
Owner

@guillep2k commented on GitHub (Aug 10, 2019):

Sounds like a bug in Bleve to me.

@silverwind, It's more like the way it's being used but yes, it's not much robust when invalid data is presented to it. This is the current set of filters instantiated in Gitea for the repositories:

const unicodeNormalizeName = "unicodeNormalize"

func addUnicodeNormalizeTokenFilter(m *mapping.IndexMappingImpl) error {
    return m.AddCustomTokenFilter(unicodeNormalizeName, map[string]interface{}{
        "type": unicodenorm.Name,
        "form": unicodenorm.NFC,
    })
}

[...]
    textFieldMapping := bleve.NewTextFieldMapping()
    textFieldMapping.IncludeInAll = false
    docMapping.AddFieldMappingsAt("Content", textFieldMapping)

    mapping := bleve.NewIndexMapping()
    if err = addUnicodeNormalizeTokenFilter(mapping); err != nil {
        return err
    } else if err = mapping.AddCustomAnalyzer(repoIndexerAnalyzer, map[string]interface{}{
        "type":          custom.Name,
        "char_filters":  []string{},
        "tokenizer":     unicode.Name,
        "token_filters": []string{unicodeNormalizeName, lowercase.Name, unique.Name},
    }); err != nil {
        return err
    }
    mapping.DefaultAnalyzer = repoIndexerAnalyzer
    mapping.AddDocumentMapping(repoIndexerDocType, docMapping)
    mapping.AddDocumentMapping("_all", bleve.NewDocumentDisabledMapping())
[...]

And then the queue is filled with:

	fileContents, err := git.NewCommand("cat-file", "blob", update.BlobSha).
		RunInDirBytes(repo.RepoPath())
	if err != nil {
		return err
	} else if !base.IsTextFile(fileContents) {
		return nil
	}
	indexerUpdate := indexer.RepoIndexerUpdate{
		Filepath: update.Filename,
		Op:       indexer.RepoIndexerOpUpdate,
		Data: &indexer.RepoIndexerData{
			RepoID:  repo.ID,
			Content: string(fileContents),
		},
	}
	return indexerUpdate.AddToFlushingBatch(batch)

The indexer is passed the original data nonchanantly, even if it's binary.

This code was probably copied from the issue indexer, and issue texts are always utf-8 encoded.

I agree with @lafriks, detect encoding is the way to go, but that only goes so far. I'd add a filter to deal with invalid cases, because if one invalid code point gets through, the index gets filled with weird data.

I'll try to look into this in a couple of days. I'm very glad I've finally found the reason my indexes were only partially useful.

@guillep2k commented on GitHub (Aug 10, 2019): > > > Sounds like a bug in Bleve to me. @silverwind, It's more like the way it's being used but yes, it's not much robust when invalid data is presented to it. This is the current set of filters instantiated in Gitea for the repositories: ``` const unicodeNormalizeName = "unicodeNormalize" func addUnicodeNormalizeTokenFilter(m *mapping.IndexMappingImpl) error { return m.AddCustomTokenFilter(unicodeNormalizeName, map[string]interface{}{ "type": unicodenorm.Name, "form": unicodenorm.NFC, }) } [...] textFieldMapping := bleve.NewTextFieldMapping() textFieldMapping.IncludeInAll = false docMapping.AddFieldMappingsAt("Content", textFieldMapping) mapping := bleve.NewIndexMapping() if err = addUnicodeNormalizeTokenFilter(mapping); err != nil { return err } else if err = mapping.AddCustomAnalyzer(repoIndexerAnalyzer, map[string]interface{}{ "type": custom.Name, "char_filters": []string{}, "tokenizer": unicode.Name, "token_filters": []string{unicodeNormalizeName, lowercase.Name, unique.Name}, }); err != nil { return err } mapping.DefaultAnalyzer = repoIndexerAnalyzer mapping.AddDocumentMapping(repoIndexerDocType, docMapping) mapping.AddDocumentMapping("_all", bleve.NewDocumentDisabledMapping()) [...] ``` And then the queue is filled with: ``` fileContents, err := git.NewCommand("cat-file", "blob", update.BlobSha). RunInDirBytes(repo.RepoPath()) if err != nil { return err } else if !base.IsTextFile(fileContents) { return nil } indexerUpdate := indexer.RepoIndexerUpdate{ Filepath: update.Filename, Op: indexer.RepoIndexerOpUpdate, Data: &indexer.RepoIndexerData{ RepoID: repo.ID, Content: string(fileContents), }, } return indexerUpdate.AddToFlushingBatch(batch) ``` The indexer is passed the original data nonchanantly, even if it's binary. This code was probably copied from the issue indexer, and issue texts are always utf-8 encoded. I agree with @lafriks, detect encoding is the way to go, but that only goes so far. I'd add a filter to deal with invalid cases, because if one invalid code point gets through, the index gets filled with weird data. I'll try to look into this in a couple of days. I'm very glad I've finally found the reason my indexes were only partially useful.
Author
Owner

@guillep2k commented on GitHub (Aug 16, 2019):

Fixed by #7814

@guillep2k commented on GitHub (Aug 16, 2019): Fixed by #7814
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/gitea#3758