github-starred/flowsint

Fork 0

mirror of https://github.com/reconurge/flowsint.git synced 2026-08-02 16:02:37 -05:00

Files

T

History

…

alembic

…

app

…

.gitignore

…

alembic.ini

…

Dockerfile

…

entrypoint.sh

…

poetry.lock

…

pyproject.toml

…

pytest.ini

…

README.md

…

resume.cfg

…

README.md

flowsint-api

Installation

Install Python dependencies:

pip install -r requirements.txt

Install spaCy models for person recognition:

python install_spacy_models.py

Or manually install the models:

python -m pip install fr_core_news_md  # French model (preferred)
python -m pip install en_core_web_md   # English model (fallback)

Features

Website Crawler (`to_crawler.py`)

The website crawler scans websites to extract:

Emails: Email addresses found in the website content
Phone Numbers: Phone numbers in various formats
Individuals: Person names using spaCy Named Entity Recognition (NER)

The crawler:

Follows internal links within the same domain
Respects robots.txt and implements delays between requests
Extracts visible text content from HTML pages
Creates Neo4j relationships between websites and found entities

Person Recognition

The crawler uses spaCy to identify person names in website content:

Supports both French (fr_core_news_md) and English (en_core_web_md) models
Automatically falls back to English if French model is not available
Creates Individual objects with first name, last name, and full name
Establishes MENTIONS_INDIVIDUAL relationships in Neo4j

Configuration

The crawler can be configured with:

max_pages: Maximum number of pages to crawl (default: 50)
timeout: Request timeout in seconds (default: 30)
delay: Delay between requests in seconds (default: 1.0)

Usage

The API provides REST endpoints for various scanning operations. See the individual scanner modules for specific usage examples.

README.md

flowsint-api

Installation

Features

Website Crawler (to_crawler.py)

Person Recognition

Configuration

Usage

Website Crawler (`to_crawler.py`)