[PR #113] [MERGED] fix(enricher): graceful handling of invalid phone/email in website_to_crawler #1110

Closed
opened 2026-05-03 01:58:50 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/reconurge/flowsint/pull/113
Author: @AlexanderLueftl
Created: 1/28/2026
Status: Merged
Merged: 1/28/2026
Merged by: @dextmorgn

Base: mainHead: fix/crawler-validation


📝 Commits (1)

  • 12ef177 fix(enricher): handle invalid phone/email validation gracefully in website_to_crawler

📊 Changes

1 file changed (+14 additions, -2 deletions)

View changed files

📝 flowsint-enrichers/src/flowsint_enrichers/website/to_crawler.py (+14 -2)

📄 Description

Summary

  • Fix crash when reconcrawl extracts phone numbers or emails that fail Pydantic validation
  • Invalid items are now logged as warnings and skipped instead of crashing the entire crawl
  • All valid emails/phones are still collected and graph nodes are created

Problem

When crawling a website, if reconcrawl extracts a phone number or email that fails type validation, the exception bubbles up and:

  1. Crashes the crawl for that website
  2. Loses ALL previously collected emails and phones
  3. Returns an empty result

Solution

Wrap Phone() and Email() construction in individual try-except blocks:

  • Invalid items → logged with Logger.warn() and skipped
  • Valid items → collected normally
  • Crawl continues regardless of individual validation failures

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/reconurge/flowsint/pull/113 **Author:** [@AlexanderLueftl](https://github.com/AlexanderLueftl) **Created:** 1/28/2026 **Status:** ✅ Merged **Merged:** 1/28/2026 **Merged by:** [@dextmorgn](https://github.com/dextmorgn) **Base:** `main` ← **Head:** `fix/crawler-validation` --- ### 📝 Commits (1) - [`12ef177`](https://github.com/reconurge/flowsint/commit/12ef1771276a15b0695cd3d2544e3adaccd66db9) fix(enricher): handle invalid phone/email validation gracefully in website_to_crawler ### 📊 Changes **1 file changed** (+14 additions, -2 deletions) <details> <summary>View changed files</summary> 📝 `flowsint-enrichers/src/flowsint_enrichers/website/to_crawler.py` (+14 -2) </details> ### 📄 Description ## Summary - Fix crash when `reconcrawl` extracts phone numbers or emails that fail Pydantic validation - Invalid items are now logged as warnings and skipped instead of crashing the entire crawl - All valid emails/phones are still collected and graph nodes are created ## Problem When crawling a website, if `reconcrawl` extracts a phone number or email that fails type validation, the exception bubbles up and: 1. Crashes the crawl for that website 2. Loses ALL previously collected emails and phones 3. Returns an empty result ## Solution Wrap `Phone()` and `Email()` construction in individual try-except blocks: - Invalid items → logged with `Logger.warn()` and skipped - Valid items → collected normally - Crawl continues regardless of individual validation failures --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-05-03 01:58:50 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/flowsint#1110