mirror of
https://github.com/reconurge/flowsint.git
synced 2026-04-27 02:48:02 -05:00
flowsint-api
Installation
- Install Python dependencies:
pip install -r requirements.txt
- Install spaCy models for person recognition:
python install_spacy_models.py
Or manually install the models:
python -m pip install fr_core_news_md # French model (preferred)
python -m pip install en_core_web_md # English model (fallback)
Features
Website Crawler (to_crawler.py)
The website crawler scans websites to extract:
- Emails: Email addresses found in the website content
- Phone Numbers: Phone numbers in various formats
- Individuals: Person names using spaCy Named Entity Recognition (NER)
The crawler:
- Follows internal links within the same domain
- Respects robots.txt and implements delays between requests
- Extracts visible text content from HTML pages
- Creates Neo4j relationships between websites and found entities
Person Recognition
The crawler uses spaCy to identify person names in website content:
- Supports both French (
fr_core_news_md) and English (en_core_web_md) models - Automatically falls back to English if French model is not available
- Creates
Individualobjects with first name, last name, and full name - Establishes
MENTIONS_INDIVIDUALrelationships in Neo4j
Configuration
The crawler can be configured with:
max_pages: Maximum number of pages to crawl (default: 50)timeout: Request timeout in seconds (default: 30)delay: Delay between requests in seconds (default: 1.0)
Usage
The API provides REST endpoints for various scanning operations. See the individual scanner modules for specific usage examples.