[GH-ISSUE #90] [Request] Extract More From A Webpage #1033

Open
opened 2026-05-03 01:54:33 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @zero77 on GitHub (Dec 3, 2025).
Original GitHub issue: https://github.com/reconurge/flowsint/issues/90

The current extract text from webpage option is quite limited and doesn't provide much useful text.
Could it be improved to extract more useful text.

iocsearcher could be used as it is very good at extracting useful text from raw HTML and documents

Example command:
pip install iocsearcher

curl -s https://example.com | tee page.html
iocsearcher -r -f page.html
Originally created by @zero77 on GitHub (Dec 3, 2025). Original GitHub issue: https://github.com/reconurge/flowsint/issues/90 The current extract text from webpage option is quite limited and doesn't provide much useful text. Could it be improved to extract more useful text. [iocsearcher](https://github.com/malicialab/iocsearcher) could be used as it is very good at extracting useful text from raw HTML and documents Example command: `pip install iocsearcher` ``` curl -s https://example.com | tee page.html iocsearcher -r -f page.html ```
GiteaMirror added the enhancement label 2026-05-03 01:54:33 -05:00
Author
Owner

@dextmorgn commented on GitHub (Dec 3, 2025):

hey @zero77,

Thanks for this suggestion. I've added some more infos to extract : title, description, status_code, content, headers, technologies. And these can be extracted by domain_to_website.

iocsearcher is a good suggestion for the future.

<!-- gh-comment-id:3606541195 --> @dextmorgn commented on GitHub (Dec 3, 2025): hey @zero77, Thanks for this suggestion. I've added some more infos to extract : title, description, status_code, content, headers, technologies. And these can be extracted by `domain_to_website`. [iocsearcher](https://github.com/malicialab/iocsearcher) is a good suggestion for the future.
Author
Owner

@zero77 commented on GitHub (Dec 3, 2025):

domain_to_website is a good option, but it extracts a lot and can be slow.
Can there be an option to exclude things from being extracted.

<!-- gh-comment-id:3606608880 --> @zero77 commented on GitHub (Dec 3, 2025): domain_to_website is a good option, but it extracts a lot and can be slow. Can there be an option to exclude things from being extracted.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/flowsint#1033