[PR #151] feat(enrichers): Arabic media enrichers (Sabq, Argaam, Al Arabiya, Nitter) #2621

Open
opened 2026-06-07 15:05:04 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/reconurge/flowsint/pull/151
Author: @SocialMDev
Created: 6/1/2026
Status: 🔄 Open

Base: mainHead: feat/arabic-osint-enrichers


📝 Commits (1)

  • adbcef2 feat(enrichers): add Arabic media enrichers (Sabq, Argaam, Al Arabiya, Nitter)

📊 Changes

20 files changed (+1732 additions, -0 deletions)

View changed files

📝 flowsint-enrichers/pyproject.toml (+1 -0)
flowsint-enrichers/src/flowsint_enrichers/individual/to_alarabiya.py (+134 -0)
flowsint-enrichers/src/flowsint_enrichers/individual/to_arabic_tweets.py (+136 -0)
flowsint-enrichers/src/flowsint_enrichers/individual/to_argaam.py (+134 -0)
flowsint-enrichers/src/flowsint_enrichers/individual/to_sabq.py (+134 -0)
flowsint-enrichers/src/flowsint_enrichers/phrase/__init__.py (+0 -0)
flowsint-enrichers/src/flowsint_enrichers/phrase/to_alarabiya.py (+127 -0)
flowsint-enrichers/src/flowsint_enrichers/phrase/to_arabic_tweets.py (+127 -0)
flowsint-enrichers/src/flowsint_enrichers/phrase/to_argaam.py (+127 -0)
flowsint-enrichers/src/flowsint_enrichers/phrase/to_sabq.py (+127 -0)
flowsint-enrichers/src/tools/arabic_media/__init__.py (+0 -0)
flowsint-enrichers/src/tools/arabic_media/alarabiya.py (+70 -0)
flowsint-enrichers/src/tools/arabic_media/argaam.py (+66 -0)
flowsint-enrichers/src/tools/arabic_media/nitter.py (+125 -0)
flowsint-enrichers/src/tools/arabic_media/sabq.py (+62 -0)
flowsint-enrichers/tests/enrichers/test_arabic_alarabiya.py (+70 -0)
flowsint-enrichers/tests/enrichers/test_arabic_argaam.py (+73 -0)
flowsint-enrichers/tests/enrichers/test_arabic_sabq.py (+133 -0)
flowsint-enrichers/tests/enrichers/test_arabic_tweets.py (+75 -0)
📝 uv.lock (+11 -0)

📄 Description

Summary

Adds 8 enrichers that surface Arabic-language mentions of an Individual or a Phrase (topic) and link them into the graph as Website nodes with source-specific relationship labels.

Enricher Input → Output Relationship Source
individual_to_sabq / phrase_to_sabq Individual / Phrase → Website MENTIONED_IN_SABQ sabq.org HTML search
individual_to_argaam / phrase_to_argaam Individual / Phrase → Website MENTIONED_IN_ARGAAM argaam.com HTML search
individual_to_alarabiya / phrase_to_alarabiya Individual / Phrase → Website MENTIONED_IN_ALARABIYA Google News RSS (site:alarabiya.net)
individual_to_arabic_tweets / phrase_to_arabic_tweets Individual / Phrase → Website MENTIONED_ON_TWITTER_AR Nitter mirrors → Google dork fallback

What changed

flowsint-enrichers/
├── pyproject.toml                    +1 dep: defusedxml
├── src/
│   ├── flowsint_enrichers/
│   │   ├── individual/
│   │   │   ├── to_sabq.py            NEW
│   │   │   ├── to_argaam.py          NEW
│   │   │   ├── to_alarabiya.py       NEW
│   │   │   └── to_arabic_tweets.py   NEW
│   │   └── phrase/                   NEW dir (Phrase-input variants)
│   │       ├── __init__.py
│   │       ├── to_sabq.py
│   │       ├── to_argaam.py
│   │       ├── to_alarabiya.py
│   │       └── to_arabic_tweets.py
│   └── tools/arabic_media/           NEW dir
│       ├── __init__.py
│       ├── sabq.py                   SabqTool
│       ├── argaam.py                 ArgaamTool
│       ├── alarabiya.py              AlArabiyaTool (uses defusedxml for RSS)
│       └── nitter.py                 NitterArabicTool
└── tests/enrichers/
    ├── test_arabic_sabq.py
    ├── test_arabic_argaam.py
    ├── test_arabic_alarabiya.py
    └── test_arabic_tweets.py

Why a new phrase/ category

Sabq / Argaam / Al Arabiya all support searching for topics, not just people. Phrase was already in flowsint-types but had no enrichers — this PR adds the first set. Topic search is useful for journalists / OSINT investigators tracking issues rather than individuals.

Security notes

  • defusedxml is used in AlArabiyaTool for parsing Google News RSS, to avoid XXE / billion-laughs attacks on untrusted XML. Added as a dependency (>=0.7,<0.8).
  • All four scrapers respect a 10s timeout and degrade gracefully on non-200 responses.
  • Nitter tool tries each mirror in NITTER_INSTANCES then falls back to a Google dork; tests cover both paths via mocking.

Demo

Brought up docker-compose.dev.yml infra (postgres + redis + neo4j) and ran individual_to_sabq against real Neo4j with HTTP mocked to return 3 fixture article hits for "Faisal Aldeghaither":

MATCH (i:individual)-[r:MENTIONED_IN_SABQ]->(w:website)
RETURN i.`nodeProperties.full_name` AS person, type(r) AS rel, w.`nodeProperties.url` AS url, w.`nodeProperties.title` AS title;
person, rel, url, title
"Faisal Aldeghaither", "MENTIONED_IN_SABQ", "https://sabq.org/news/202605/saudi-vision-update-1", "تحديثات رؤية المملكة 2030"
"Faisal Aldeghaither", "MENTIONED_IN_SABQ", "https://sabq.org/news/202605/saudi-vision-update-2", "مقابلة حصرية"
"Faisal Aldeghaither", "MENTIONED_IN_SABQ", "https://sabq.org/news/202605/saudi-vision-update-3", "تقرير اقتصادي"

The Neo4j Browser visualisation showing the Individual → 3 Website subgraph with MENTIONED_IN_SABQ edges and Arabic article titles is attached in the first PR comment.

Test plan

  • pytest tests/enrichers/test_arabic_*.py — 19 new tests, all green
  • Existing tests/enrichers/test_registry.py still passes (21/21 with new suite)
  • All 8 enrichers register via @flowsint_enricher and appear in ENRICHER_REGISTRY
  • Postprocess writes to real Neo4j (verified with cypher-shell on demo stack)
  • Dedup logic: re-running with duplicate URLs creates each Website + relationship only once
  • No live network calls in tests — SabqTool, ArgaamTool, AlArabiyaTool, NitterArabicTool are mocked

Notes for maintainer

  • The existing to_domains.py reference enricher served as the architectural template (preprocess / scan / postprocess split, @flowsint_enricher decorator, module-level InputType / OutputType re-export). I tried to match style and structure exactly; happy to adjust if you want different conventions for the new phrase/ category.
  • HTML selectors for sabq.org and argaam.com are defensive (multiple fallbacks via comma-separated CSS selectors) but will need maintenance if those sites change their markup.
  • Discord-friendly: I can open follow-ups to add similar enrichers for other Arabic media (Asharq, Okaz, Riyadh Daily) if the pattern lands well.

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/reconurge/flowsint/pull/151 **Author:** [@SocialMDev](https://github.com/SocialMDev) **Created:** 6/1/2026 **Status:** 🔄 Open **Base:** `main` ← **Head:** `feat/arabic-osint-enrichers` --- ### 📝 Commits (1) - [`adbcef2`](https://github.com/reconurge/flowsint/commit/adbcef271405e34d5da89d8c45c5d1ec2935307a) feat(enrichers): add Arabic media enrichers (Sabq, Argaam, Al Arabiya, Nitter) ### 📊 Changes **20 files changed** (+1732 additions, -0 deletions) <details> <summary>View changed files</summary> 📝 `flowsint-enrichers/pyproject.toml` (+1 -0) ➕ `flowsint-enrichers/src/flowsint_enrichers/individual/to_alarabiya.py` (+134 -0) ➕ `flowsint-enrichers/src/flowsint_enrichers/individual/to_arabic_tweets.py` (+136 -0) ➕ `flowsint-enrichers/src/flowsint_enrichers/individual/to_argaam.py` (+134 -0) ➕ `flowsint-enrichers/src/flowsint_enrichers/individual/to_sabq.py` (+134 -0) ➕ `flowsint-enrichers/src/flowsint_enrichers/phrase/__init__.py` (+0 -0) ➕ `flowsint-enrichers/src/flowsint_enrichers/phrase/to_alarabiya.py` (+127 -0) ➕ `flowsint-enrichers/src/flowsint_enrichers/phrase/to_arabic_tweets.py` (+127 -0) ➕ `flowsint-enrichers/src/flowsint_enrichers/phrase/to_argaam.py` (+127 -0) ➕ `flowsint-enrichers/src/flowsint_enrichers/phrase/to_sabq.py` (+127 -0) ➕ `flowsint-enrichers/src/tools/arabic_media/__init__.py` (+0 -0) ➕ `flowsint-enrichers/src/tools/arabic_media/alarabiya.py` (+70 -0) ➕ `flowsint-enrichers/src/tools/arabic_media/argaam.py` (+66 -0) ➕ `flowsint-enrichers/src/tools/arabic_media/nitter.py` (+125 -0) ➕ `flowsint-enrichers/src/tools/arabic_media/sabq.py` (+62 -0) ➕ `flowsint-enrichers/tests/enrichers/test_arabic_alarabiya.py` (+70 -0) ➕ `flowsint-enrichers/tests/enrichers/test_arabic_argaam.py` (+73 -0) ➕ `flowsint-enrichers/tests/enrichers/test_arabic_sabq.py` (+133 -0) ➕ `flowsint-enrichers/tests/enrichers/test_arabic_tweets.py` (+75 -0) 📝 `uv.lock` (+11 -0) </details> ### 📄 Description ## Summary Adds 8 enrichers that surface Arabic-language mentions of an `Individual` or a `Phrase` (topic) and link them into the graph as `Website` nodes with source-specific relationship labels. | Enricher | Input → Output | Relationship | Source | |---|---|---|---| | `individual_to_sabq` / `phrase_to_sabq` | Individual / Phrase → Website | `MENTIONED_IN_SABQ` | sabq.org HTML search | | `individual_to_argaam` / `phrase_to_argaam` | Individual / Phrase → Website | `MENTIONED_IN_ARGAAM` | argaam.com HTML search | | `individual_to_alarabiya` / `phrase_to_alarabiya` | Individual / Phrase → Website | `MENTIONED_IN_ALARABIYA` | Google News RSS (`site:alarabiya.net`) | | `individual_to_arabic_tweets` / `phrase_to_arabic_tweets` | Individual / Phrase → Website | `MENTIONED_ON_TWITTER_AR` | Nitter mirrors → Google dork fallback | ## What changed ``` flowsint-enrichers/ ├── pyproject.toml +1 dep: defusedxml ├── src/ │ ├── flowsint_enrichers/ │ │ ├── individual/ │ │ │ ├── to_sabq.py NEW │ │ │ ├── to_argaam.py NEW │ │ │ ├── to_alarabiya.py NEW │ │ │ └── to_arabic_tweets.py NEW │ │ └── phrase/ NEW dir (Phrase-input variants) │ │ ├── __init__.py │ │ ├── to_sabq.py │ │ ├── to_argaam.py │ │ ├── to_alarabiya.py │ │ └── to_arabic_tweets.py │ └── tools/arabic_media/ NEW dir │ ├── __init__.py │ ├── sabq.py SabqTool │ ├── argaam.py ArgaamTool │ ├── alarabiya.py AlArabiyaTool (uses defusedxml for RSS) │ └── nitter.py NitterArabicTool └── tests/enrichers/ ├── test_arabic_sabq.py ├── test_arabic_argaam.py ├── test_arabic_alarabiya.py └── test_arabic_tweets.py ``` ## Why a new `phrase/` category Sabq / Argaam / Al Arabiya all support searching for topics, not just people. `Phrase` was already in `flowsint-types` but had no enrichers — this PR adds the first set. Topic search is useful for journalists / OSINT investigators tracking issues rather than individuals. ## Security notes - **defusedxml** is used in `AlArabiyaTool` for parsing Google News RSS, to avoid XXE / billion-laughs attacks on untrusted XML. Added as a dependency (`>=0.7,<0.8`). - All four scrapers respect a 10s timeout and degrade gracefully on non-200 responses. - Nitter tool tries each mirror in `NITTER_INSTANCES` then falls back to a Google dork; tests cover both paths via mocking. ## Demo Brought up `docker-compose.dev.yml` infra (postgres + redis + neo4j) and ran `individual_to_sabq` against real Neo4j with HTTP mocked to return 3 fixture article hits for "Faisal Aldeghaither": ```cypher MATCH (i:individual)-[r:MENTIONED_IN_SABQ]->(w:website) RETURN i.`nodeProperties.full_name` AS person, type(r) AS rel, w.`nodeProperties.url` AS url, w.`nodeProperties.title` AS title; ``` ``` person, rel, url, title "Faisal Aldeghaither", "MENTIONED_IN_SABQ", "https://sabq.org/news/202605/saudi-vision-update-1", "تحديثات رؤية المملكة 2030" "Faisal Aldeghaither", "MENTIONED_IN_SABQ", "https://sabq.org/news/202605/saudi-vision-update-2", "مقابلة حصرية" "Faisal Aldeghaither", "MENTIONED_IN_SABQ", "https://sabq.org/news/202605/saudi-vision-update-3", "تقرير اقتصادي" ``` The Neo4j Browser visualisation showing the Individual → 3 Website subgraph with `MENTIONED_IN_SABQ` edges and Arabic article titles is attached in the first PR comment. ## Test plan - [x] `pytest tests/enrichers/test_arabic_*.py` — 19 new tests, all green - [x] Existing `tests/enrichers/test_registry.py` still passes (21/21 with new suite) - [x] All 8 enrichers register via `@flowsint_enricher` and appear in `ENRICHER_REGISTRY` - [x] Postprocess writes to real Neo4j (verified with cypher-shell on demo stack) - [x] Dedup logic: re-running with duplicate URLs creates each Website + relationship only once - [x] No live network calls in tests — `SabqTool`, `ArgaamTool`, `AlArabiyaTool`, `NitterArabicTool` are mocked ## Notes for maintainer - The existing `to_domains.py` reference enricher served as the architectural template (preprocess / scan / postprocess split, `@flowsint_enricher` decorator, module-level `InputType` / `OutputType` re-export). I tried to match style and structure exactly; happy to adjust if you want different conventions for the new `phrase/` category. - HTML selectors for `sabq.org` and `argaam.com` are defensive (multiple fallbacks via comma-separated CSS selectors) but will need maintenance if those sites change their markup. - Discord-friendly: I can open follow-ups to add similar enrichers for other Arabic media (Asharq, Okaz, Riyadh Daily) if the pattern lands well. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-06-07 15:05:04 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/flowsint#2621