Files
flowsint/flowsint-api/README.md
2025-06-24 00:14:28 +02:00

54 lines
1.6 KiB
Markdown

# flowsint-api
## Installation
1. Install Python dependencies:
```bash
pip install -r requirements.txt
```
2. Install spaCy models for person recognition:
```bash
python install_spacy_models.py
```
Or manually install the models:
```bash
python -m pip install fr_core_news_md # French model (preferred)
python -m pip install en_core_web_md # English model (fallback)
```
## Features
### Website Crawler (`to_crawler.py`)
The website crawler scans websites to extract:
- **Emails**: Email addresses found in the website content
- **Phone Numbers**: Phone numbers in various formats
- **Individuals**: Person names using spaCy Named Entity Recognition (NER)
The crawler:
- Follows internal links within the same domain
- Respects robots.txt and implements delays between requests
- Extracts visible text content from HTML pages
- Creates Neo4j relationships between websites and found entities
#### Person Recognition
The crawler uses spaCy to identify person names in website content:
- Supports both French (`fr_core_news_md`) and English (`en_core_web_md`) models
- Automatically falls back to English if French model is not available
- Creates `Individual` objects with first name, last name, and full name
- Establishes `MENTIONS_INDIVIDUAL` relationships in Neo4j
#### Configuration
The crawler can be configured with:
- `max_pages`: Maximum number of pages to crawl (default: 50)
- `timeout`: Request timeout in seconds (default: 30)
- `delay`: Delay between requests in seconds (default: 1.0)
## Usage
The API provides REST endpoints for various scanning operations. See the individual scanner modules for specific usage examples.