[GH-ISSUE #45] TXT Import of SHA1, SHA256, MD5 files as File type #45

Open
opened 2026-04-11 08:40:49 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @BongoKnight on GitHub (Nov 12, 2025).
Original GitHub issue: https://github.com/reconurge/flowsint/issues/45

It would be great if hashes could be imported as a file or as a new type like hashes. This would open the possibility of adding transforms from VT, MalwareBazaar, and various sandbox solutions. This would help pivot from such data to related domains, URLs, and IPs.

SHA1, SHA256, and MD5 are also sometimes related to certificates. It would be great to have a way to change the data type of all selected nodes in the import window to handle this case as well.

It could be implemented with something like in flowsint-core/src/flowsint_core/imports/entity_detection.py:

"""
Entity type detection utilities for import feature.
Provides basic pattern matching for common entity types.
"""

import re
from typing import Optional
import ipaddress


# Ordered registry of detectors. Order matters to avoid false positives
# (e.g., URL should be checked before Domain).
# Each entry is a tuple of (entity_type, predicate_function)
DETECTORS = [
    ("ASN", lambda v: is_asn(v)),
    ("Domain", lambda v: is_domain(v)),
    ("Email", lambda v: is_email(v)),
    ("File", lambda v: is_file(v)),
    ("IP", lambda v: is_ip_address(v)),
    ("Phone", lambda v: is_phone(v)),
    ("Username", lambda v: is_username(v)),
    ("Website", lambda v: is_website(v)),
]


def detect_entity_type(value: str) -> Optional[str]:
    """
    Detect entity type based on value pattern.
    Returns the detected entity type or None if no match.

    Args:
        value: The string value to analyze

    Returns:
        Entity type string (e.g., "Email", "Domain", "IP") or None
    """
    if not value or not isinstance(value, str):
        return None

    value = value.strip()

    # Iterate through ordered detectors to find the first match
    for entity_type, predicate in DETECTORS:
        try:
            if predicate(value):
                return entity_type
        except Exception:
            # If a predicate raises, skip it to avoid breaking detection
            # (predicates should generally be safe and return bool)
            continue

    return None


def is_email(value: str) -> bool:
    """Check if value matches email pattern."""
    email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return bool(re.match(email_pattern, value))


def is_domain(value: str) -> bool:
    """Check if value matches domain pattern."""
    # Basic domain pattern: contains dots and valid characters
    domain_pattern = r'^(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,}$'
    return bool(re.match(domain_pattern, value))

def is_sha1(value: str) -> bool:
    """Check if value matches sha1 pattern."""
    sha1_pattern = r'^([0-9a-f]{40}|[0-9A-F]{40})$'
    return bool(re.match(sha1_pattern, value))


def is_sha256(value: str) -> bool:
    """Check if value matches SHA256 pattern."""
    sha256_pattern = r'^([0-9a-f]{64}|[0-9A-F]{64})$'
    return bool(re.match(sha256_pattern, value))

def is_file(value: str) -> bool:
    return is_sha1(value) or is_sha_256(value)

def is_ip_address(value: str) -> bool:
    """Check if value is a valid IPv4 or IPv6 address."""
    try:
        ipaddress.ip_address(value)
        return True
    except ValueError:
        return False


def is_website(value: str) -> bool:
    """Check if value matches URL/website pattern."""
    url_pattern = r'^https?://'
    return bool(re.match(url_pattern, value, re.IGNORECASE))


def is_phone(value: str) -> bool:
    """Check if value matches phone number pattern."""
    # Remove common separators for checking
    cleaned = re.sub(r'[\s\-\(\)\.]', '', value)
    # Check if it's mostly digits and has reasonable length
    phone_pattern = r'^\+?[0-9]{7,15}$'
    return bool(re.match(phone_pattern, cleaned))



def is_asn(value: str) -> bool:
    """Check if value matches ASN pattern."""
    asn_pattern = r'^AS\d+$'
    return bool(re.match(asn_pattern, value, re.IGNORECASE))


def is_username(value: str) -> bool:
    """Check if value matches username pattern (social media style)."""
    # Matches @username format or simple alphanumeric with underscores
    username_pattern = r'^@?[a-zA-Z0-9_]{3,30}$'
    if re.match(username_pattern, value):
        # Additional check: starts with @ or is not purely numeric
        return value.startswith('@') or not value.lstrip('@').isdigit()
    return False


def get_default_label(entity_type: str, value: str) -> str:
    """
    Get default label for an entity based on its type and value.

    Args:
        entity_type: The detected or selected entity type
        value: The entity value

    Returns:
        Default label string
    """
    # For most types, the value itself is a good label
    type_defaults = {
        "Email": value,
        "Domain": value,
        "IP": value,
        "File": value,
        "Website": value,
        "Phone": value,
        "ASN": value,
        "Username": value.lstrip('@'),  # Remove @ prefix for label
        "Organization": value,
        "Individual": value,
    }

    return type_defaults.get(entity_type, value)

Originally created by @BongoKnight on GitHub (Nov 12, 2025). Original GitHub issue: https://github.com/reconurge/flowsint/issues/45 It would be great if hashes could be imported as a file or as a new type like hashes. This would open the possibility of adding transforms from VT, MalwareBazaar, and various sandbox solutions. This would help pivot from such data to related domains, URLs, and IPs. SHA1, SHA256, and MD5 are also sometimes related to certificates. It would be great to have a way to change the data type of all selected nodes in the import window to handle this case as well. It could be implemented with something like in `flowsint-core/src/flowsint_core/imports/entity_detection.py`: ```python """ Entity type detection utilities for import feature. Provides basic pattern matching for common entity types. """ import re from typing import Optional import ipaddress # Ordered registry of detectors. Order matters to avoid false positives # (e.g., URL should be checked before Domain). # Each entry is a tuple of (entity_type, predicate_function) DETECTORS = [ ("ASN", lambda v: is_asn(v)), ("Domain", lambda v: is_domain(v)), ("Email", lambda v: is_email(v)), ("File", lambda v: is_file(v)), ("IP", lambda v: is_ip_address(v)), ("Phone", lambda v: is_phone(v)), ("Username", lambda v: is_username(v)), ("Website", lambda v: is_website(v)), ] def detect_entity_type(value: str) -> Optional[str]: """ Detect entity type based on value pattern. Returns the detected entity type or None if no match. Args: value: The string value to analyze Returns: Entity type string (e.g., "Email", "Domain", "IP") or None """ if not value or not isinstance(value, str): return None value = value.strip() # Iterate through ordered detectors to find the first match for entity_type, predicate in DETECTORS: try: if predicate(value): return entity_type except Exception: # If a predicate raises, skip it to avoid breaking detection # (predicates should generally be safe and return bool) continue return None def is_email(value: str) -> bool: """Check if value matches email pattern.""" email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$' return bool(re.match(email_pattern, value)) def is_domain(value: str) -> bool: """Check if value matches domain pattern.""" # Basic domain pattern: contains dots and valid characters domain_pattern = r'^(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,}$' return bool(re.match(domain_pattern, value)) def is_sha1(value: str) -> bool: """Check if value matches sha1 pattern.""" sha1_pattern = r'^([0-9a-f]{40}|[0-9A-F]{40})$' return bool(re.match(sha1_pattern, value)) def is_sha256(value: str) -> bool: """Check if value matches SHA256 pattern.""" sha256_pattern = r'^([0-9a-f]{64}|[0-9A-F]{64})$' return bool(re.match(sha256_pattern, value)) def is_file(value: str) -> bool: return is_sha1(value) or is_sha_256(value) def is_ip_address(value: str) -> bool: """Check if value is a valid IPv4 or IPv6 address.""" try: ipaddress.ip_address(value) return True except ValueError: return False def is_website(value: str) -> bool: """Check if value matches URL/website pattern.""" url_pattern = r'^https?://' return bool(re.match(url_pattern, value, re.IGNORECASE)) def is_phone(value: str) -> bool: """Check if value matches phone number pattern.""" # Remove common separators for checking cleaned = re.sub(r'[\s\-\(\)\.]', '', value) # Check if it's mostly digits and has reasonable length phone_pattern = r'^\+?[0-9]{7,15}$' return bool(re.match(phone_pattern, cleaned)) def is_asn(value: str) -> bool: """Check if value matches ASN pattern.""" asn_pattern = r'^AS\d+$' return bool(re.match(asn_pattern, value, re.IGNORECASE)) def is_username(value: str) -> bool: """Check if value matches username pattern (social media style).""" # Matches @username format or simple alphanumeric with underscores username_pattern = r'^@?[a-zA-Z0-9_]{3,30}$' if re.match(username_pattern, value): # Additional check: starts with @ or is not purely numeric return value.startswith('@') or not value.lstrip('@').isdigit() return False def get_default_label(entity_type: str, value: str) -> str: """ Get default label for an entity based on its type and value. Args: entity_type: The detected or selected entity type value: The entity value Returns: Default label string """ # For most types, the value itself is a good label type_defaults = { "Email": value, "Domain": value, "IP": value, "File": value, "Website": value, "Phone": value, "ASN": value, "Username": value.lstrip('@'), # Remove @ prefix for label "Organization": value, "Individual": value, } return type_defaults.get(entity_type, value) ```
Author
Owner

@dextmorgn commented on GitHub (Nov 12, 2025):

Hey @BongoKnight,

Thanks for this feedback ! If I understand correctly, the idea would be to update the entity_detection.py mechanisms to also be able to detect patterns for SHA1, SHA256, and MD5 ?

Seems like a good idea. Ideally, there should be a detection pattern for every type of entity available. What was done for entity_detection.py works ok for now but it not viable on the long run. I'll try to find a way to create one single detection pattern but it's definetly a tricky feature.

In the meanwhile we can update entity_detection.py to support SHA1, SHA256, and MD5 detection, while we think of a better more robust solution.

Also, I'll add the possibility to "apply" a type to all entries in the import view, instead of having to manually select for each entity to import.

<!-- gh-comment-id:3523912502 --> @dextmorgn commented on GitHub (Nov 12, 2025): Hey @BongoKnight, Thanks for this feedback ! If I understand correctly, the idea would be to update the `entity_detection.py` mechanisms to also be able to detect patterns for SHA1, SHA256, and MD5 ? Seems like a good idea. Ideally, there should be a detection pattern for every type of entity available. What was done for `entity_detection.py` works ok for now but it not viable on the long run. I'll try to find a way to create one single detection pattern but it's definetly a tricky feature. In the meanwhile we can update `entity_detection.py` to support SHA1, SHA256, and MD5 detection, while we think of a better more robust solution. Also, I'll add the possibility to "apply" a type to all entries in the import view, instead of having to manually select for each entity to import.
Author
Owner

@BongoKnight commented on GitHub (Nov 13, 2025):

Please let me know if I can help for something. I'm especially interested in trying to implement transform to external providers such as VirusTotal, UrlScan and capitalization platforms such as MISP, OpenCTI, TheHive, etc...

<!-- gh-comment-id:3527056138 --> @BongoKnight commented on GitHub (Nov 13, 2025): Please let me know if I can help for something. I'm especially interested in trying to implement transform to external providers such as VirusTotal, UrlScan and capitalization platforms such as MISP, OpenCTI, TheHive, etc...
Author
Owner

@dextmorgn commented on GitHub (Nov 13, 2025):

@BongoKnight I'm absolutly down for some help on this ! You could help me define a plan to implement this step by step, or if you're already confortable with the transform system and python you can absolutly submit some pull requests.

<!-- gh-comment-id:3528895777 --> @dextmorgn commented on GitHub (Nov 13, 2025): @BongoKnight I'm absolutly down for some help on this ! You could help me define a plan to implement this step by step, or if you're already confortable with the transform system and python you can absolutly submit some pull requests.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/flowsint#45