feat Unsupported Filter: /JBIG2Decode in PDF Files #774

Closed
opened 2025-11-11 14:30:59 -06:00 by GiteaMirror · 0 comments
Owner

Originally created by @Yanyutin753 on GitHub (May 1, 2024).

Is your feature request related to a problem? Please describe.

I have encountered a problem when attempting to load PDF files that contain images which are using the /JBIG2Decode filter. JBIG2 is a compression method for monochrome (black and white) images which is especially important when dealing with scanned documents that require high compression rates.

Specifically, the error that is raised is as follows:

Copy Code
NotImplementedError: unsupported filter /JBIG2Decode
This problem stems from the pypdf library not currently supporting the /JBIG2Decode filter. When it tries to decode the image data that uses this filter, it throws the aforementioned error.

Traceback (most recent call last):
  File "/app/backend/apps/rag/main.py", line 663, in store_doc
    data = loader.load()
           ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/langchain_core/document_loaders/base.py", line 29, in load
    return list(self.lazy_load())
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/langchain_community/document_loaders/pdf.py", line 193, in lazy_load
    yield from self.parser.parse(blob)
               ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/langchain_core/document_loaders/base.py", line 125, in parse
    return list(self.lazy_parse(blob))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 96, in lazy_parse
    yield from [
               ^
  File "/usr/local/lib/python3.11/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 99, in <listcomp>
    + self._extract_images_from_page(page),
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 118, in _extract_images_from_page
    np.frombuffer(xObject[obj].get_data(), dtype=np.uint8).reshape(
                  ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pypdf/generic/_data_structures.py", line 970, in get_data
    decoded.set_data(b_(decode_stream_data(self)))
                        ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pypdf/filters.py", line 711, in decode_stream_data
    raise NotImplementedError(f"unsupported filter {filter_type}")
NotImplementedError: unsupported filter /JBIG2Decode

image

Additional context
This issue hampers the processing and manipulation of many PDFs that utilize this common compression technique. It would be greatly appreciated if support for /JBIG2Decode filter could be considered for introduction in a future update.

In the meantime, if you could provide any guidance or workarounds to properly load these types of PDFs, it would be very helpful.

Thank you for your time and consideration.

Originally created by @Yanyutin753 on GitHub (May 1, 2024). **Is your feature request related to a problem? Please describe.** I have encountered a problem when attempting to load PDF files that contain images which are using the /JBIG2Decode filter. JBIG2 is a compression method for monochrome (black and white) images which is especially important when dealing with scanned documents that require high compression rates. Specifically, the error that is raised is as follows: Copy Code NotImplementedError: unsupported filter /JBIG2Decode This problem stems from the pypdf library not currently supporting the /JBIG2Decode filter. When it tries to decode the image data that uses this filter, it throws the aforementioned error. ``` Traceback (most recent call last): File "/app/backend/apps/rag/main.py", line 663, in store_doc data = loader.load() ^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/langchain_core/document_loaders/base.py", line 29, in load return list(self.lazy_load()) ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/langchain_community/document_loaders/pdf.py", line 193, in lazy_load yield from self.parser.parse(blob) ^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/langchain_core/document_loaders/base.py", line 125, in parse return list(self.lazy_parse(blob)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 96, in lazy_parse yield from [ ^ File "/usr/local/lib/python3.11/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 99, in <listcomp> + self._extract_images_from_page(page), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 118, in _extract_images_from_page np.frombuffer(xObject[obj].get_data(), dtype=np.uint8).reshape( ^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/pypdf/generic/_data_structures.py", line 970, in get_data decoded.set_data(b_(decode_stream_data(self))) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/pypdf/filters.py", line 711, in decode_stream_data raise NotImplementedError(f"unsupported filter {filter_type}") NotImplementedError: unsupported filter /JBIG2Decode ``` ![image](https://github.com/open-webui/open-webui/assets/132346501/3509e2ca-f028-4b51-bd2d-48ef63169bb9) **Additional context** This issue hampers the processing and manipulation of many PDFs that utilize this common compression technique. It would be greatly appreciated if support for /JBIG2Decode filter could be considered for introduction in a future update. In the meantime, if you could provide any guidance or workarounds to properly load these types of PDFs, it would be very helpful. Thank you for your time and consideration.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#774