Use a more sane tokenizer for source code search #13577

Closed
opened 2025-11-02 10:46:37 -06:00 by GiteaMirror · 4 comments
Owner

Originally created by @bsofiato on GitHub (Oct 8, 2024).

Feature Description

As of today, the elastic search search uses the default analizer when indexing the source code contents. This implementation uses whitespaces to break the tokens.

I feel this approach is not particularly suitable for source code search. To illustrate the issue, let us consider the code snippet below:

public baz(Foo foo) {
   return foo.bar();
}

It is fair to think that searching for bar returns the code above. As of today, however, this is not the case: ES will assume that foo.bar() is a single token. As such, ES will not match the criterion bar.

I suggest we use the pattern tokenizer instead. It uses regular expressions to separate tokens. By default, it uses any (non-word character as a token separator). In such a case, the snippet foo.bar() would yield two tokens -- foo and bar (the second token will match the given criterion).

What do you guys think?

Screenshots

No response

Originally created by @bsofiato on GitHub (Oct 8, 2024). ### Feature Description As of today, the elastic search search uses the [default analizer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html) when indexing the source code contents. This implementation uses whitespaces to break the tokens. I feel this approach is not particularly suitable for source code search. To illustrate the issue, let us consider the code snippet below: ``` public baz(Foo foo) { return foo.bar(); } ``` It is fair to think that searching for `bar` returns the code above. As of today, however, this is not the case: ES will assume that `foo.bar()` is a single token. As such, ES will not match the criterion `bar`. I suggest we use the [pattern tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-analyzer.html) instead. It uses regular expressions to separate tokens. By default, it uses any (non-word character as a token separator). In such a case, the snippet `foo.bar()` would yield two tokens -- `foo` and `bar` (the second token will match the given criterion). What do you guys think? ### Screenshots _No response_
GiteaMirror added the type/proposal label 2025-11-02 10:46:37 -06:00
Author
Owner

@bsofiato commented on GitHub (Oct 22, 2024):

Hey guys, an update on this issue.

At my workplace, we run a Gitea instance with about 3K repositories. Our L2 and L3 support teams (about 200 people) rely heavily on the code search feature on Gitea. They asked me if I could make the code search case insensitive.

I was thinking of allowing this kind of search. In such a case, the results would have less relevance than matches on with the cases match.

What do you guys think? Do you think it is worthwhile to have this in Gitea's main line? If you guys are cool with it, I'll change the PR #32261 to handle it as well.

@bsofiato commented on GitHub (Oct 22, 2024): Hey guys, an update on this issue. At my workplace, we run a Gitea instance with about 3K repositories. Our L2 and L3 support teams (about 200 people) rely heavily on the code search feature on Gitea. They asked me if I could make the code search case insensitive. I was thinking of allowing this kind of search. In such a case, the results would have less relevance than matches on with the cases match. What do you guys think? Do you think it is worthwhile to have this in Gitea's main line? If you guys are cool with it, I'll change the PR #32261 to handle it as well.
Author
Owner

@lunny commented on GitHub (Oct 22, 2024):

I think currently it's already Case insensitive? At least for bleve engine. Maybe you mean Case Sensitive? If that, maybe we can have an option or filter for that. Looks like Github code search doesn't support case sensitive search.

@lunny commented on GitHub (Oct 22, 2024): I think currently it's already Case insensitive? At least for bleve engine. Maybe you mean Case Sensitive? If that, maybe we can have an option or filter for that. Looks like Github code search doesn't support case sensitive search.
Author
Owner

@bsofiato commented on GitHub (Oct 22, 2024):

Yeah, as a matter of fact, in bleve's case it is indeed already case insensitive (as the screenshot below shows).

image

However, the ES backend the content field is not normalized. So for ES (which is our case) it is indeed case sensitive (as shown below).

image

If it is ok with you guys, I'll update the #32261 to make the ES content field case insensitive like bleve's. What do you think ?

P.S. Another option would be to create another PR just for this fix. But I think that rebuilding the index might be too much for a patch version.

@bsofiato commented on GitHub (Oct 22, 2024): Yeah, as a matter of fact, in bleve's case it is indeed already case insensitive (as the screenshot below shows). ![image](https://github.com/user-attachments/assets/a287fe61-50c5-4bc0-9b7c-d1c3186b328e) However, the ES backend the content field is not normalized. So for ES (which is our case) it is indeed case sensitive (as shown below). ![image](https://github.com/user-attachments/assets/9c68e428-fb3c-4773-8920-9ff4bf72b881) If it is ok with you guys, I'll update the #32261 to make the ES content field case insensitive like bleve's. What do you think ? P.S. Another option would be to create another PR just for this fix. But I think that rebuilding the index might be too much for a patch version.
Author
Owner

@bsofiato commented on GitHub (Oct 23, 2024):

@lunny I've pushed some changes to #32261 to make ES' search case insensitive :)

@bsofiato commented on GitHub (Oct 23, 2024): @lunny I've pushed some changes to #32261 to make ES' search case insensitive :)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/gitea#13577