mirror of
https://github.com/reconurge/flowsint.git
synced 2026-05-01 11:48:38 -05:00
54 lines
1.6 KiB
Markdown
54 lines
1.6 KiB
Markdown
# flowsint-api
|
|
|
|
## Installation
|
|
|
|
1. Install Python dependencies:
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
2. Install spaCy models for person recognition:
|
|
```bash
|
|
python install_spacy_models.py
|
|
```
|
|
|
|
Or manually install the models:
|
|
```bash
|
|
python -m pip install fr_core_news_md # French model (preferred)
|
|
python -m pip install en_core_web_md # English model (fallback)
|
|
```
|
|
|
|
## Features
|
|
|
|
### Website Crawler (`to_crawler.py`)
|
|
|
|
The website crawler scans websites to extract:
|
|
- **Emails**: Email addresses found in the website content
|
|
- **Phone Numbers**: Phone numbers in various formats
|
|
- **Individuals**: Person names using spaCy Named Entity Recognition (NER)
|
|
|
|
The crawler:
|
|
- Follows internal links within the same domain
|
|
- Respects robots.txt and implements delays between requests
|
|
- Extracts visible text content from HTML pages
|
|
- Creates Neo4j relationships between websites and found entities
|
|
|
|
#### Person Recognition
|
|
|
|
The crawler uses spaCy to identify person names in website content:
|
|
- Supports both French (`fr_core_news_md`) and English (`en_core_web_md`) models
|
|
- Automatically falls back to English if French model is not available
|
|
- Creates `Individual` objects with first name, last name, and full name
|
|
- Establishes `MENTIONS_INDIVIDUAL` relationships in Neo4j
|
|
|
|
#### Configuration
|
|
|
|
The crawler can be configured with:
|
|
- `max_pages`: Maximum number of pages to crawl (default: 50)
|
|
- `timeout`: Request timeout in seconds (default: 30)
|
|
- `delay`: Delay between requests in seconds (default: 1.0)
|
|
|
|
## Usage
|
|
|
|
The API provides REST endpoints for various scanning operations. See the individual scanner modules for specific usage examples.
|