mirror of
https://github.com/reconurge/flowsint.git
synced 2026-05-07 04:09:49 -05:00
[GH-ISSUE #132] [DRAFT] RFC: Schema.org as the Core Semantic Type System for Flowsint #93
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @gustavorps on GitHub (Mar 16, 2026).
Original GitHub issue: https://github.com/reconurge/flowsint/issues/132
Abstract
This RFC proposes adopting Schema.org as the foundational semantic vocabulary for Flowsint's type system (
flowsint-types). By grounding every entity —Person,Organization,Domain,Email,Phone,WebSite, and others — in Schema.org's well-established, machine-readable vocabulary, Flowsint gains a common language that is interoperable with the broader web, unambiguous for contributors, and extensible without breaking existing contracts. This document explains the motivation, defines a mapping strategy from current Pydantic models to Schema.org types, proposes a concrete implementation path, and addresses known limitations and open questions.1. Motivation and Problem Statement
Flowsint currently defines its entity types in
flowsint-typesas standalone Pydantic models —Domain,IP,ASN,CIDR,Individual,Organization,Email,Phone,Website,SocialProfile,Credential,CryptoWallet,Transaction,NFT, and more. These models are purpose-built and work well within the tool's existing scope, but they carry several structural limitations as the project grows:1.1 Semantic ambiguity. There is no canonical definition of what "Individual" means in relation to "Organization", or how a "Website" differs from a "Domain". New contributors must read source code to understand entity semantics rather than relying on a shared vocabulary. This friction slows onboarding and increases the likelihood of mismatches between enrichers.
1.2 No cross-tool interoperability. OSINT workflows increasingly span multiple tools — Maltego, SpiderFoot, OpenCTI, and others. Flowsint data exported or exchanged with these tools requires ad-hoc translation layers because there is no common ontological footing. A shared vocabulary would let Flowsint speak the same language as any tool that also maps to Schema.org.
1.3 Limited discoverability and graph reasoning. Flowsint uses Neo4j as its graph backend. Without typed, semantically named relationships and entities, graph queries are brittle and tooling like graph-based knowledge reasoning cannot leverage type hierarchies (e.g., knowing that
schema:Personis a subtype ofschema:Thinglets a query engine traverse more intelligently).1.4 Reinventing solved problems. Schema.org is a W3C-endorsed, community-maintained vocabulary used by billions of web pages and a growing number of knowledge graphs. It already defines types for
Person,Organization,WebSite,ContactPoint,PostalAddress, and more that directly overlap with Flowsint's entity set. Maintaining parallel definitions duplicates work that the Schema.org community handles.2. Proposal Summary
This RFC proposes the following:
Adopt Schema.org URIs as the canonical
@typefor every Flowsint entity. Each Pydantic model inflowsint-typeswill declare aschema_typeclass variable pointing to its Schema.org equivalent (e.g.,"https://schema.org/Person").Extend Schema.org where no suitable type exists. Types with no Schema.org equivalent (e.g.,
ASN,CIDR,CryptoWallet) will be defined as Flowsint-specific extensions under a custom namespace (https://schema.flowsint.io/) following Schema.org's own extension convention.Emit JSON-LD context by default from the Flowsint API so that every entity response is semantically typed and directly consumable by any JSON-LD-aware tool.
Store
@typeas a node label in Neo4j so that graph traversals benefit from semantic type hierarchies.Keep Pydantic models as the internal contract — this is a non-breaking, additive change. Schema.org alignment is expressed through metadata and serialization, not by replacing the model layer.
3. Schema.org Background
Schema.org is a collaborative, community-driven project founded by Google, Microsoft, Yahoo, and Yandex in 2011 and now maintained under the W3C Schema.org Community Group. It defines a hierarchy of types rooted at
schema:Thing, with property definitions that express relationships between types.Key properties of Schema.org relevant to this proposal:
https://schema.org.The current Schema.org release is v29.4 (2025-12-08).
4. Type Mapping
The following table maps existing Flowsint types to their Schema.org equivalents. Where no direct equivalent exists, a Flowsint extension type is proposed.
4.1 Core Entity Mappings
Individualschema:PersonPersoncovers name, identifier, affiliation, email, telephone.Organizationschema:OrganizationEmailschema:ContactPoint(contactType: email)schema:email) or aContactPoint. Flowsint's richerEmailentity maps cleanly toContactPointwithcontactType = "email".Phoneschema:ContactPoint(contactType: phone)Websiteschema:WebSiteurl,name,description. Sub-pages areschema:WebPage.Domainschema:WebSite+ Flowsint extensionDomaintype. Useschema:WebSitefor the root representation andflowsint:Domainas a refinement carrying DNS-specific properties.SocialProfileschema:ProfilePageProfilePage. Maps to social media profile pages directly.Credentialflowsint:Credential(extension)schema:Thing.IPflowsint:IPAddress(extension)schema:Thing.ASNflowsint:AutonomousSystem(extension)schema:Thing.CIDRflowsint:CIDRBlock(extension)schema:Thing.CryptoWalletflowsint:CryptoWallet(extension)schema:MoneyAccountas a distant analogy, but a clean extension is preferred for accuracy.Transactionschema:MoneyTransferNFTflowsint:NFT(extension)schema:CreativeWorkgiven NFTs often represent digital assets.4.2 Relationship Mappings
Schema.org also defines relationships (properties) that map to Flowsint's graph edges:
Individual → Organizationschema:memberOf/schema:worksForIndividual → Emailschema:email(viaContactPoint)Individual → Phoneschema:telephone(viaContactPoint)Organization → Domainschema:url+flowsint:ownsDomainDomain → IPflowsint:resolvesToIP → ASNflowsint:belongsToASNASN → CIDRflowsint:announcesPrefixCryptoWallet → Transactionflowsint:hasTransaction5. Proposed Implementation
5.1 New Namespace:
flowsint.typing.schema_orgRather than annotating the existing Pydantic models in
flowsint-typesdirectly, a dedicated new sub-package is introduced:The existing models in
flowsint-typesare not modified. Theschema_orgnamespace is a parallel layer that wraps or mirrors them, providing Schema.org-aware serialization, validation, and identity. This separation guarantees that enrichers and API routes that depend on the current models continue to function without any changes.5.1.1 Base Mixin (
_base.py)All Schema.org-typed models share a common mixin that handles JSON-LD serialization and identity:
5.1.2 Example Entity (
entities/person.py)5.1.3 Example Extension (
extensions/ip_address.py)5.1.4 JSON-LD Context (
context.jsonld)The context file is published at a stable, versioned URL and bundled inside the package at
flowsint/typing/schema_org/context.jsonld:5.2 API Integration
The
flowsint-apilayer gains a thin content-negotiation adapter. No existing route signatures change:Standard
application/jsonresponses are completely unaffected.5.3 PostgreSQL Migration Script
The following Alembic-compatible migration adds a
schema_typecolumn to every entity table and backfills it from the type mapping defined in Section 4.1. This column acts as the durable, queryable semantic tag inside the relational store.5.4 Neo4j Migration Script
The Cypher migration below runs as a one-off script via the
neo4j-adminCLI or through the Neo4j Python driver. It adds aschemaTypeproperty and a secondary Schema.org label to every existing node, and creates an index for efficient type-based traversal.A Python helper is also provided for running this migration programmatically within the Flowsint startup sequence:
6. Benefits
Interoperability. Any system that understands Schema.org or JSON-LD can consume Flowsint entities without a custom adapter. This makes it straightforward to pipe Flowsint output into SIEM platforms, knowledge graphs, or other OSINT tools.
Contributor clarity. When a new enricher author needs to add a type, they have a canonical reference to check first rather than inventing a schema in isolation. The question "does this entity already exist?" has a definitive answer.
Richer graph queries. Semantic type labels in Neo4j enable queries like "find all entities that are subtypes of
schema:Organization" — a query that would otherwise require maintaining an explicit type hierarchy in application code.Future-proofing. Schema.org is actively maintained. As the web evolves — and as OSINT increasingly intersects with structured data on the web — Flowsint's type system evolves with it at no extra cost.
SEO and documentation. If Flowsint ever exposes a public API or documentation site, Schema.org types are directly understood by search engines, improving discoverability of API documentation.
7. Drawbacks and Limitations
7.1 Schema.org is not designed for OSINT. Schema.org's primary audience is web publishers marking up content for search engines. Many OSINT-specific concepts (IP geolocation, ASN routing, credential exposure) are outside its scope and must be extensions. This means Flowsint cannot fully delegate to Schema.org — it must maintain its own namespace for a significant portion of its type system.
7.2 Property naming differences. Schema.org uses camelCase property names (
givenName,legalName,foundingDate) that may differ from Flowsint's current snake_case Pydantic fields. Mapping between the two requires care to avoid confusion. Theto_jsonld()method must handle this translation explicitly.7.3 Schema.org evolves. A type present in Schema.org v29 may be renamed, deprecated, or restructured in a future release. Flowsint would need a policy for tracking Schema.org releases and updating mappings accordingly. This is manageable but not zero-cost.
7.4 Adds conceptual overhead for contributors. Contributors unfamiliar with Schema.org, JSON-LD, or RDF concepts may find the new machinery confusing. Good documentation and a clear "quick start for adding a new type" guide would mitigate this.
8. Alternatives Considered
8.1 STIX/TAXII. The Structured Threat Information eXpression (STIX) standard is the de facto vocabulary in cyber threat intelligence. It covers many of the same entities (IP, Domain, Email, Person, Organization) in a security-native way. STIX was not chosen as the primary mapping for two reasons: (a) it is significantly more verbose and complex than needed for Flowsint's graph-based exploration model, and (b) Schema.org is more broadly understood outside the security community, which Flowsint's journalist and researcher audience will appreciate. A secondary STIX serialization could be offered in a future RFC.
8.2 Wikidata / Linked Open Data. Wikidata's property and type system is extremely expressive but also extremely granular and requires deep familiarity with Q-codes. It is not appropriate as a primary vocabulary for an early-stage project.
8.3 Custom Flowsint ontology (status quo++). Simply documenting the existing types more thoroughly does not solve the interoperability or semantic ambiguity problems. A custom ontology from scratch would replicate work already done by Schema.org for the overlapping types.
9. Migration Path
This change is intended to be non-breaking and incremental:
flowsint.typing.schema_orgnamespace andSchemaOrgMixincontext.jsonldinside the package and at a stable URL0010_add_schema_org_type_column0010_add_schema_org_labels.cypherAccept: application/ld+jsoncontent negotiation to the APIflowsint-typesto align with Schema.org property namesPhases 1–6 can ship together as a minor version since they are entirely additive — existing models, database schemas, and graph queries are untouched. Phase 7 is a separate, explicitly opt-in migration behind a semver major bump and should only be pursued if the community concludes the naming alignment is worth the breaking change.
10. Open Questions
The following questions are raised for community discussion:
Should
flowsint:extension types also be submitted upstream to Schema.org? Types likeIPAddressandAutonomousSystemare general enough to be useful beyond OSINT. Submitting proposals to the W3C Schema.org Community Group is feasible but requires sustained engagement.Where should the
context.jsonldbe hosted? Options include the repository itself (requiring a versioned URL likehttps://raw.githubusercontent.com/reconurge/flowsint/v1.2.0/schema/context.jsonld), a dedicated domain (https://schema.flowsint.io/), or a GitHub Pages deployment. Each has trade-offs in terms of stability and maintenance.Should STIX be offered as a secondary serialization format? Given Flowsint's cybersecurity audience, a
to_stix()method alongsideto_jsonld()could significantly improve interoperability with professional threat intelligence platforms. This is out of scope for this RFC but recommended as a follow-on.How should
CryptoWalletrelate to Schema.org's financial types? Schema.org hasFinancialProduct,BankAccount, andMoneyAccount. None are a clean fit for a cryptographic wallet. A thorough review of Schema.org's financial extension vocabulary is warranted before finalising the extension definition.11. Reference Implementation
A reference branch demonstrating the proposed changes across
flowsint-types(newflowsint.typing.schema_orgnamespace),flowsint-core(Neo4j migration helper), andflowsint-api(content negotiation middleware) will be linked here once available:12. References
flowsint-types/13. Acknowledgements
This RFC was prepared for the Flowsint community. Feedback from maintainers, enricher authors, and OSINT practitioners is actively sought. Please open a GitHub issue referencing
RFC-002to comment, or submit a pull request against this document.RFC-002 · Draft · 2026-03-16