[GH-ISSUE #11448] feat: enable language configuration for tika OCR like as what paperless-ngx does #54898

Closed
opened 2026-05-05 16:51:34 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @rakehell1986 on GitHub (Mar 9, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/11448

Check Existing Issues

Problem Description

After i integrated tika with openwebui, the chinese scanned document are all converted to English alphabet。 for example

仲裁申请书 申请人:额家的,女,汉族,身份证号码:3332324324525452426 住所:圣诞节啊客服经理撒打 has been converted to AFH a FA Fig A: REND . 2, DK, AGES: 3332324324525452426 FERT: PINTS RIAN 4 HEE 105 Bs ZSERBA: WWIII SS KEM. KAKA RIM. WA AHBL

so how to enable chinese OCR in tika ?

### Desired Solution you'd like

what paperless-ngx do as following :

#      The default language to use for OCR. Set this to the language most of your
#      documents are written in.
PAPERLESS_OCR_LANGUAGE=eng+chi_sim

#      Additional languages to install for text recognition, separated by a whitespace.
#      Note that this is different from PAPERLESS_OCR_LANGUAGE (default=eng), which defines
#       the language used for OCR.
#    The container installs English, German, Italian, Spanish and French by default.
#     See https://packages.debian.org/search?keywords=tesseract-ocr-&searchon=names&suite=buster
#     for available languages.

Originally created by @rakehell1986 on GitHub (Mar 9, 2025). Original GitHub issue: https://github.com/open-webui/open-webui/issues/11448 ### Check Existing Issues - [x] #11449 ### Problem Description After i integrated tika with openwebui, the chinese scanned document are all converted to English alphabet。 for example `仲裁申请书 申请人:额家的,女,汉族,身份证号码:3332324324525452426 住所:圣诞节啊客服经理撒打` has been converted to `AFH a FA Fig A: REND . 2, DK, AGES: 3332324324525452426 FERT: PINTS RIAN 4 HEE 105 Bs ZSERBA: WWIII SS KEM. KAKA RIM. WA AHBL` so how to enable chinese OCR in tika ? ``` ### Desired Solution you'd like what paperless-ngx do as following : # The default language to use for OCR. Set this to the language most of your # documents are written in. PAPERLESS_OCR_LANGUAGE=eng+chi_sim # Additional languages to install for text recognition, separated by a whitespace. # Note that this is different from PAPERLESS_OCR_LANGUAGE (default=eng), which defines # the language used for OCR. # The container installs English, German, Italian, Spanish and French by default. # See https://packages.debian.org/search?keywords=tesseract-ocr-&searchon=names&suite=buster # for available languages. ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#54898