[GH-ISSUE #90] [Request] Extract More From A Webpage #1033

New Issue

GiteaMirror · 2026-05-03T01:54:33-05:00

GiteaMirror commented

2026-05-03 01:54:33 -05:00

Originally created by @zero77 on GitHub (Dec 3, 2025).
Original GitHub issue: https://github.com/reconurge/flowsint/issues/90

The current extract text from webpage option is quite limited and doesn't provide much useful text.
Could it be improved to extract more useful text.

iocsearcher could be used as it is very good at extracting useful text from raw HTML and documents

Example command:
pip install iocsearcher

curl -s https://example.com | tee page.html
iocsearcher -r -f page.html

Originally created by @zero77 on GitHub (Dec 3, 2025). Original GitHub issue: https://github.com/reconurge/flowsint/issues/90 The current extract text from webpage option is quite limited and doesn't provide much useful text. Could it be improved to extract more useful text. [iocsearcher](https://github.com/malicialab/iocsearcher) could be used as it is very good at extracting useful text from raw HTML and documents Example command: `pip install iocsearcher` ``` curl -s https://example.com | tee page.html iocsearcher -r -f page.html ```

GiteaMirror added the enhancement label 2026-05-03 01:54:33 -05:00

GiteaMirror commented

2026-05-03 01:54:38 -05:00

@dextmorgn commented on GitHub (Dec 3, 2025):

hey @zero77,

Thanks for this suggestion. I've added some more infos to extract : title, description, status_code, content, headers, technologies. And these can be extracted by domain_to_website.

iocsearcher is a good suggestion for the future.

@dextmorgn commented on GitHub (Dec 3, 2025): hey @zero77, Thanks for this suggestion. I've added some more infos to extract : title, description, status_code, content, headers, technologies. And these can be extracted by `domain_to_website`. [iocsearcher](https://github.com/malicialab/iocsearcher) is a good suggestion for the future.

GiteaMirror commented

2026-05-03 01:54:39 -05:00

@zero77 commented on GitHub (Dec 3, 2025):

domain_to_website is a good option, but it extracts a lot and can be slow.
Can there be an option to exclude things from being extracted.

@zero77 commented on GitHub (Dec 3, 2025): domain_to_website is a good option, but it extracts a lot and can be slow. Can there be an option to exclude things from being extracted.

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/flowsint#1033