mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-08 02:28:25 -05:00
Ensures consistency between section IDs in QMD files and corresponding references in quiz JSON files. The script now automatically updates these references when section IDs change, maintaining synchronization between content and assessments.
1116 lines
47 KiB
Python
1116 lines
47 KiB
Python
# section_id_manager.py
|
||
|
||
"""
|
||
Comprehensive Section ID Management Script for Quarto/Markdown Book Projects
|
||
---------------------------------------------------------------------------
|
||
|
||
This script provides a complete toolkit for managing section IDs in your Markdown/Quarto book project.
|
||
It ensures that all section headers have unique, clean, and consistent section IDs while preserving
|
||
cross-references and other attributes. It also automatically updates corresponding quiz JSON files
|
||
when section IDs change to maintain synchronization between content and assessments.
|
||
|
||
Special Handling for Unnumbered Headers:
|
||
---------------------------------------
|
||
- Any header with the {.unnumbered} class is always skipped for section ID management.
|
||
- Unnumbered headers will never have a section ID added, updated, or required (including in verify mode).
|
||
- Only numbered headers (without {.unnumbered}) require section IDs.
|
||
|
||
Smart Block Detection:
|
||
---------------------
|
||
- **Code Blocks:** Headers inside code blocks (```...```) are automatically ignored
|
||
- **Div Blocks:** Headers inside Quarto divs (::: {.class}...:::) are automatically ignored
|
||
- **Callouts:** Headers inside callout divs are automatically ignored
|
||
- **Comments:** R/Python comments like `## Section Name` inside code blocks are not treated as headers
|
||
|
||
This prevents the script from incorrectly processing code comments or headers that are part of
|
||
documentation examples rather than actual section headers.
|
||
|
||
Workflow Philosophy:
|
||
--------------------
|
||
- **Write Freely:**
|
||
- As you write, you can leave out section IDs or make up quick/guess IDs (e.g., {#sec-my-section}) for new sections.
|
||
- Focus on content, not on perfecting section IDs.
|
||
|
||
- **Automated Management:**
|
||
- Before committing or publishing, run this script (manually, via pre-commit, or in CI).
|
||
- The script can:
|
||
- Add missing section IDs (except for unnumbered headers)
|
||
- Repair existing IDs to match the new format
|
||
- Remove all IDs for a fresh start
|
||
- Verify all IDs are present and properly formatted (skipping unnumbered headers)
|
||
- List all IDs for reference
|
||
- Create backups before making changes
|
||
- Update cross-references when IDs change
|
||
|
||
- **Referencing Sections:**
|
||
- While writing, use your best guess for section IDs in cross-references.
|
||
- After running the script, you can look up the actual IDs using the --list mode.
|
||
|
||
- **Safety First:**
|
||
- Use --backup to create timestamped backups before making changes
|
||
- Use --dry-run to preview changes without modifying files
|
||
- Use --verify to check for issues before committing (unnumbered headers are always ignored)
|
||
|
||
ID Scheme:
|
||
----------
|
||
- IDs are of the form: sec-{chapter-title}-{section-title}-{hash}
|
||
- Chapter and section titles have stopwords removed for cleaner IDs
|
||
- The hash is generated from: file path + chapter title + section title + parent section hierarchy
|
||
- Section content is NOT included in the hash to ensure IDs remain stable when content changes
|
||
- This ensures GLOBAL UNIQUENESS across the entire book project
|
||
- Different files with identical section names and hierarchies will have different IDs
|
||
- Parent sections are included in the hash to handle duplicate section names naturally
|
||
- The visible part of the ID remains short and human-readable
|
||
- IDs are stable and won't change if sections are reordered (as long as hierarchy doesn't change)
|
||
|
||
Stable ID Generation:
|
||
---------------------
|
||
- Section IDs are based on structural information (file path, chapter title, section title, hierarchy)
|
||
- Section content is NOT included in the hash to ensure stability when content changes
|
||
- Running --repair multiple times will not change IDs unless structure actually changes
|
||
- This prevents ID churn and ensures cross-references remain valid when content is modified
|
||
|
||
Global Uniqueness Guarantee:
|
||
----------------------------
|
||
The hash generation includes the file path to ensure that sections with identical names
|
||
and hierarchies in different files will have different IDs. This prevents conflicts when:
|
||
|
||
- Multiple chapters have sections with the same name (e.g., "Introduction" in different files)
|
||
- Different files have identical section hierarchies (e.g., "Techniques > Advanced > Optimization")
|
||
- The same section name appears in multiple contexts across the book
|
||
|
||
Example hash inputs:
|
||
- File A: "contents/chapter1.qmd|Getting Started|Introduction"
|
||
- File B: "contents/chapter2.qmd|Getting Started|Introduction"
|
||
- Result: Different 4-character hashes ensure unique IDs
|
||
|
||
Available Modes:
|
||
----------------
|
||
- **Add Mode (default):** Add missing section IDs to headers (skips unnumbered headers and code blocks)
|
||
- **Repair Mode (--repair):** Fix existing section IDs to match the new format (stable across multiple runs)
|
||
- **Remove Mode (--remove):** Remove all section IDs (use with --backup)
|
||
- **Verify Mode (--verify):** Check that all section IDs are present and properly formatted (skips unnumbered headers and code blocks)
|
||
- **List Mode (--list):** Display all section IDs found in files (skips code blocks)
|
||
|
||
Safety Features:
|
||
----------------
|
||
- **Backup System:** --backup creates .backup.{timestamp} files before changes
|
||
- **Dry Run:** --dry-run shows what would change without modifying files
|
||
- **Interactive Prompts:** Asks for confirmation before making changes
|
||
- **Force Mode:** --force automatically accepts all confirmations without prompting
|
||
- **Attribute Preservation:** Maintains other attributes when modifying section IDs
|
||
- **Cross-reference Updates:** Automatically updates references when IDs change
|
||
- **Stable IDs:** IDs remain consistent across multiple repair runs
|
||
|
||
Best Practices:
|
||
---------------
|
||
- Use --backup when making bulk changes
|
||
- Use --verify before commits to ensure ID integrity (unnumbered headers and code blocks are always ignored)
|
||
- Use --list to audit existing section IDs
|
||
- Use --dry-run to preview changes before applying them
|
||
- Consider using this in pre-commit hooks or CI pipelines
|
||
- Run --repair as many times as needed - IDs will remain stable
|
||
|
||
Key Features:
|
||
- Comprehensive section ID management (add, repair, remove, verify, list)
|
||
- Hierarchy-based ID generation that reflects document structure
|
||
- Natural handling of duplicate section names through parent section context
|
||
- Global uniqueness guaranteed through file path inclusion in hash
|
||
- Stable IDs that don't change when sections are reordered
|
||
- Smart attribute preservation (e.g., {.class #sec-id .other-class})
|
||
- Cross-reference updating when IDs change
|
||
- Backup creation for safety
|
||
- Detailed summaries and progress reporting
|
||
- Support for both single files (-f) and directories (-d)
|
||
- Stopword removal for cleaner, more readable IDs
|
||
- **Unnumbered headers are always skipped for section IDs in all modes**
|
||
- **Code blocks and divs are automatically detected and skipped**
|
||
- **Stable ID generation prevents unnecessary changes**
|
||
|
||
Code Quality:
|
||
-------------
|
||
- Shared functions eliminate code duplication
|
||
- Consistent block detection logic across all modes
|
||
- Modular design with clear separation of concerns
|
||
- Comprehensive error handling and validation
|
||
|
||
Typical Usage:
|
||
# Add missing IDs
|
||
python section_id_manager.py -d contents/
|
||
python section_id_manager.py -f contents/chapter.qmd
|
||
|
||
# Repair existing IDs (stable across multiple runs)
|
||
python section_id_manager.py -d contents/ --repair --backup
|
||
python section_id_manager.py -d contents/ --repair --force
|
||
|
||
# Verify all IDs (skips unnumbered headers and code blocks)
|
||
python section_id_manager.py -d contents/ --verify
|
||
|
||
# List all IDs (skips code blocks)
|
||
python section_id_manager.py -d contents/ --list
|
||
|
||
# Remove all IDs (dangerous!)
|
||
python section_id_manager.py -d contents/ --remove --backup
|
||
|
||
# Preview changes
|
||
python section_id_manager.py -d contents/ --repair --dry-run
|
||
|
||
Author: [Your Name]
|
||
"""
|
||
|
||
import argparse
|
||
import re
|
||
import hashlib
|
||
from pathlib import Path
|
||
import logging
|
||
import difflib
|
||
import nltk
|
||
from nltk.corpus import stopwords
|
||
import sys
|
||
import os
|
||
import glob
|
||
import time
|
||
import json
|
||
|
||
# Download NLTK stopwords if not already downloaded
|
||
try:
|
||
nltk.data.find('corpora/stopwords')
|
||
except LookupError:
|
||
nltk.download('stopwords')
|
||
|
||
# Setup logging
|
||
logging.basicConfig(
|
||
level=logging.INFO,
|
||
format='%(message)s',
|
||
handlers=[
|
||
logging.StreamHandler(sys.stdout)
|
||
]
|
||
)
|
||
|
||
# Global variable to track ID replacements
|
||
id_replacements = {}
|
||
|
||
# Shared regex patterns - defined once to avoid duplication
|
||
HEADER_PATTERN = re.compile(r'^(#{1,6})\s+(.+?)(?:\s*\{[^}]*\})?$')
|
||
DIV_START_PATTERN = re.compile(r'^:::\s*\{\.([^"]+)')
|
||
DIV_END_PATTERN = re.compile(r'^:::\s*$')
|
||
CODE_BLOCK_PATTERN = re.compile(r'^```[^`]*$') # Matches code block start/end
|
||
|
||
def initialize_block_tracking():
|
||
"""Initialize block tracking state variables."""
|
||
return {
|
||
'inside_skip_div': False,
|
||
'inside_code_block': False
|
||
}
|
||
|
||
def update_block_state(line, state):
|
||
"""
|
||
Update block tracking state based on the current line.
|
||
|
||
Args:
|
||
line: The current line being processed
|
||
state: Dictionary with 'inside_skip_div' and 'inside_code_block' keys
|
||
|
||
Returns:
|
||
Updated state dictionary
|
||
"""
|
||
line_stripped = line.strip()
|
||
|
||
# Check for code block boundaries
|
||
if CODE_BLOCK_PATTERN.match(line_stripped):
|
||
state['inside_code_block'] = not state['inside_code_block']
|
||
return state
|
||
|
||
# Check for div boundaries
|
||
if DIV_START_PATTERN.match(line_stripped):
|
||
state['inside_skip_div'] = True
|
||
elif DIV_END_PATTERN.match(line_stripped):
|
||
state['inside_skip_div'] = False
|
||
|
||
return state
|
||
|
||
def should_process_header(line, state):
|
||
"""
|
||
Determine if a header should be processed based on current block state.
|
||
|
||
Args:
|
||
line: The current line
|
||
state: Block tracking state dictionary
|
||
|
||
Returns:
|
||
True if the header should be processed, False otherwise
|
||
"""
|
||
match = HEADER_PATTERN.match(line)
|
||
if match and not state['inside_skip_div'] and not state['inside_code_block']:
|
||
return True, match
|
||
return False, None
|
||
|
||
def simple_slugify(text):
|
||
"""Convert header text to a slug format, removing stopwords."""
|
||
# Get English stopwords
|
||
stop_words = set(stopwords.words('english'))
|
||
|
||
# Convert to lowercase and split into words
|
||
words = text.lower().split()
|
||
|
||
# Remove stopwords and non-alphanumeric characters
|
||
filtered_words = []
|
||
for word in words:
|
||
# Remove non-alphanumeric characters
|
||
word = re.sub(r'[^\w\s]', '', word)
|
||
# Skip if word is empty or a stopword
|
||
if word and word not in stop_words:
|
||
filtered_words.append(word)
|
||
|
||
# Join with hyphens
|
||
return '-'.join(filtered_words)
|
||
|
||
def clean_text_for_id(text):
|
||
"""Clean text for use in section IDs."""
|
||
# Convert to lowercase
|
||
text = text.lower()
|
||
|
||
# Replace spaces and special characters with hyphens
|
||
text = re.sub(r'[^a-z0-9]+', '-', text)
|
||
|
||
# Remove leading/trailing hyphens
|
||
text = text.strip('-')
|
||
|
||
# Replace multiple hyphens with single hyphen
|
||
text = re.sub(r'-+', '-', text)
|
||
|
||
return text
|
||
|
||
def normalize_content_for_hash(content):
|
||
"""
|
||
Normalize content for hashing to reduce noise from minor formatting changes.
|
||
|
||
This function removes or normalizes formatting that doesn't change the semantic meaning
|
||
of the content, so that minor formatting changes don't cause section IDs to change.
|
||
|
||
Args:
|
||
content: Raw content string from the section
|
||
|
||
Returns:
|
||
Normalized content string suitable for hashing
|
||
"""
|
||
if not content:
|
||
return ""
|
||
|
||
# Remove extra whitespace and normalize
|
||
normalized = re.sub(r'\s+', ' ', content.strip())
|
||
|
||
# Remove basic markdown formatting that doesn't change meaning
|
||
normalized = re.sub(r'[*_`]', '', normalized) # Remove emphasis markers
|
||
normalized = re.sub(r'\[([^\]]+)\]\([^)]+\)', r'\1', normalized) # Convert links to text
|
||
normalized = re.sub(r'!\[([^\]]*)\]\([^)]+\)', r'\1', normalized) # Convert images to alt text
|
||
|
||
# Remove HTML tags (basic)
|
||
normalized = re.sub(r'<[^>]+>', '', normalized)
|
||
|
||
# Remove code block markers (multiline)
|
||
normalized = re.sub(r'```[^`]*```', '', normalized, flags=re.DOTALL)
|
||
normalized = re.sub(r'`[^`]+`', '', normalized)
|
||
|
||
# Remove blockquote markers (multiline)
|
||
normalized = re.sub(r'^>\s*', '', normalized, flags=re.MULTILINE)
|
||
|
||
# Remove list markers (multiline)
|
||
normalized = re.sub(r'^[\s]*[-*+]\s+', '', normalized, flags=re.MULTILINE)
|
||
normalized = re.sub(r'^[\s]*\d+\.\s+', '', normalized, flags=re.MULTILINE)
|
||
|
||
# Clean up any remaining extra whitespace
|
||
normalized = re.sub(r'\s+', ' ', normalized).strip()
|
||
|
||
return normalized
|
||
|
||
def normalize_section_id(section_id):
|
||
"""Normalize a section ID to ensure consistent format."""
|
||
if not section_id:
|
||
return None
|
||
|
||
# Ensure the ID starts with sec-
|
||
if not section_id.startswith('sec-'):
|
||
return None
|
||
|
||
# Split into parts
|
||
parts = section_id.split('-')
|
||
if len(parts) < 3: # Need at least: sec, chapter, section
|
||
return None
|
||
|
||
# Clean each part
|
||
cleaned_parts = [clean_text_for_id(part) for part in parts]
|
||
|
||
# Rejoin with hyphens
|
||
normalized = '-'.join(cleaned_parts)
|
||
|
||
return normalized
|
||
|
||
def is_properly_formatted_id(section_id, title, file_path, chapter_title, section_counter):
|
||
"""Check if a section ID follows the correct format."""
|
||
# Check if the ID has the required parts
|
||
if not section_id.startswith('sec-'):
|
||
return False, None
|
||
|
||
# Split into parts
|
||
parts = section_id.split('-')
|
||
if len(parts) < 4: # Need at least: sec, chapter, section, hash
|
||
return False, None
|
||
|
||
# Check if it has a hash part (4 hex chars)
|
||
if not re.search(r'-[a-f0-9]{4}$', section_id):
|
||
return False, None
|
||
|
||
# If it passes all format checks, it's properly formatted
|
||
return True, section_id
|
||
|
||
def generate_section_id(title, file_path, chapter_title, section_counter, parent_sections=None, section_content=None):
|
||
"""
|
||
Generate a unique section ID based on the section title and hierarchy.
|
||
|
||
The hash includes file path, chapter title, section title, and parent section hierarchy
|
||
to ensure uniqueness across the entire book project. Content is NOT included in the hash
|
||
to ensure IDs remain stable when content changes (e.g., when quizzes are added/removed).
|
||
|
||
Args:
|
||
title: The section title
|
||
file_path: The file path (included in hash for location tracking)
|
||
chapter_title: The chapter title
|
||
section_counter: Counter for this section (not used in hash)
|
||
parent_sections: List of parent section titles (included in hash)
|
||
section_content: The content of the section (ignored - not used in hash)
|
||
|
||
Returns:
|
||
A unique section ID in the format: sec-{chapter-slug}-{section-slug}-{4-char-hash}
|
||
|
||
Example:
|
||
Same section name in different files:
|
||
- File A: "contents/chapter1.qmd|Getting Started|Introduction" → hash: d212
|
||
- File B: "contents/chapter2.qmd|Getting Started|Introduction" → hash: 8435
|
||
Result: Different IDs ensure uniqueness based on location and hierarchy only
|
||
"""
|
||
clean_title = simple_slugify(title)
|
||
clean_chapter_title = simple_slugify(chapter_title)
|
||
|
||
# Build hierarchy string from parent sections
|
||
hierarchy = ""
|
||
if parent_sections:
|
||
# Create a hierarchy string from all parent sections
|
||
hierarchy_parts = []
|
||
for parent in parent_sections:
|
||
hierarchy_parts.append(simple_slugify(parent))
|
||
hierarchy = "|".join(hierarchy_parts)
|
||
|
||
# Hash includes file path, chapter title, section title, and parent hierarchy only
|
||
# Content is excluded to ensure IDs remain stable when content changes
|
||
hash_input = f"{file_path}|{chapter_title}|{title}|{hierarchy}".encode('utf-8')
|
||
hash_suffix = hashlib.sha1(hash_input).hexdigest()[:4] # Keep 4 chars
|
||
return f"sec-{clean_chapter_title}-{clean_title}-{hash_suffix}"
|
||
|
||
def list_section_ids(filepath):
|
||
"""List all section IDs found in a single file."""
|
||
logging.info(f"\n📋 Section IDs in: {filepath}")
|
||
logging.info(f"{'='*60}")
|
||
|
||
with open(filepath, 'r', encoding='utf-8') as file:
|
||
lines = file.readlines()
|
||
|
||
state = initialize_block_tracking()
|
||
section_count = 0
|
||
|
||
for i, line in enumerate(lines, 1):
|
||
# Update block state
|
||
state = update_block_state(line, state)
|
||
|
||
# Check if we should process this header
|
||
should_process, match = should_process_header(line, state)
|
||
if should_process:
|
||
hashes, title = match.groups()
|
||
if len(hashes) > 1: # Skip chapter title
|
||
section_count += 1
|
||
existing_id_matches = re.findall(r'\{#(sec-[^}]+)\}', line)
|
||
if existing_id_matches:
|
||
section_id = existing_id_matches[0]
|
||
logging.info(f" {section_count:2d}. {title.strip()}")
|
||
logging.info(f" ID: #{section_id}")
|
||
else:
|
||
logging.info(f" {section_count:2d}. {title.strip()} (NO ID)")
|
||
|
||
if section_count == 0:
|
||
logging.info(" No sections found")
|
||
else:
|
||
logging.info(f"\n Total sections: {section_count}")
|
||
|
||
def list_all_section_ids(directory):
|
||
"""List all section IDs found in all files in a directory."""
|
||
path = Path(directory)
|
||
if not path.exists():
|
||
logging.error(f"Directory does not exist: {directory}")
|
||
return
|
||
|
||
all_files = list(path.rglob("*.md")) + list(path.rglob("*.qmd"))
|
||
if not all_files:
|
||
logging.warning(f"No markdown files found in directory: {directory}")
|
||
return
|
||
|
||
total_sections = 0
|
||
total_with_ids = 0
|
||
|
||
for file_path in all_files:
|
||
with open(file_path, 'r', encoding='utf-8') as file:
|
||
lines = file.readlines()
|
||
|
||
state = initialize_block_tracking()
|
||
file_sections = 0
|
||
file_with_ids = 0
|
||
|
||
for line in lines:
|
||
# Update block state
|
||
state = update_block_state(line, state)
|
||
|
||
# Check if we should process this header
|
||
should_process, match = should_process_header(line, state)
|
||
if should_process:
|
||
hashes, title = match.groups()
|
||
if len(hashes) > 1: # Skip chapter title
|
||
file_sections += 1
|
||
if re.search(r'\{#sec-[^}]+\}', line):
|
||
file_with_ids += 1
|
||
|
||
if file_sections > 0:
|
||
logging.info(f"📄 {file_path}: {file_with_ids}/{file_sections} sections have IDs")
|
||
total_sections += file_sections
|
||
total_with_ids += file_with_ids
|
||
|
||
logging.info(f"\n📊 SUMMARY:")
|
||
logging.info(f" Total files: {len(all_files)}")
|
||
logging.info(f" Total sections: {total_sections}")
|
||
logging.info(f" Sections with IDs: {total_with_ids}")
|
||
logging.info(f" Sections missing IDs: {total_sections - total_with_ids}")
|
||
|
||
def extract_section_content(lines, section_start_index, header_level):
|
||
"""
|
||
Extract the content of a section from the markdown file.
|
||
|
||
Args:
|
||
lines: List of lines in the file
|
||
section_start_index: Index of the section header line
|
||
header_level: Level of the section header (2-6)
|
||
|
||
Returns:
|
||
String containing the section content (normalized)
|
||
"""
|
||
content_lines = []
|
||
i = section_start_index + 1
|
||
state = initialize_block_tracking() # Track code/div blocks
|
||
|
||
while i < len(lines):
|
||
line = lines[i]
|
||
state = update_block_state(line, state)
|
||
line_stripped = line.strip()
|
||
|
||
# Only treat ## as section end if not inside a code or div block
|
||
if not state['inside_code_block'] and not state['inside_skip_div']:
|
||
if line_stripped.startswith('#'):
|
||
next_header_level = len(line_stripped) - len(line_stripped.lstrip('#'))
|
||
if next_header_level <= header_level:
|
||
break
|
||
# If this is a header, strip attributes after '{'
|
||
if '{' in line_stripped:
|
||
line_stripped = line_stripped[:line_stripped.find('{')].strip()
|
||
|
||
# Stop if we hit a div boundary (but only if not inside code/div block)
|
||
# (This is now handled by block tracking)
|
||
# Stop if we hit a code block boundary (also handled by block tracking)
|
||
|
||
# For all lines, if '{' is present, strip everything after it
|
||
if '{' in line_stripped:
|
||
line_stripped = line_stripped[:line_stripped.find('{')].strip()
|
||
|
||
# Add non-empty lines to content
|
||
if line_stripped:
|
||
content_lines.append(line_stripped)
|
||
|
||
i += 1
|
||
|
||
return ' '.join(content_lines)
|
||
|
||
def create_backup(file_path):
|
||
"""Create a backup of the file before making changes."""
|
||
backup_path = f"{file_path}.backup.{int(time.time())}"
|
||
import shutil
|
||
shutil.copy2(file_path, backup_path)
|
||
logging.info(f"💾 Created backup: {backup_path}")
|
||
return backup_path
|
||
|
||
def process_markdown_file(file_path, auto_yes=False, force=False, dry_run=False, repair_mode=False, remove_mode=False, backup_mode=False):
|
||
"""Process a single Markdown file."""
|
||
global id_replacements
|
||
logging.info(f"\n📄 Processing: {file_path}")
|
||
|
||
# Create backup if requested
|
||
if backup_mode and not dry_run:
|
||
create_backup(file_path)
|
||
|
||
path = Path(file_path)
|
||
lines = path.read_text(encoding="utf-8").splitlines(keepends=True)
|
||
|
||
state = initialize_block_tracking()
|
||
modified = False
|
||
changes = []
|
||
section_counter = 0
|
||
chapter_title = None
|
||
existing_sections = []
|
||
|
||
# Track section hierarchy
|
||
section_hierarchy = [] # Stack of parent sections
|
||
|
||
file_summary = {
|
||
'file_path': file_path,
|
||
'added_ids': [],
|
||
'updated_ids': [],
|
||
'removed_ids': [],
|
||
'existing_sections': [],
|
||
'modified': False
|
||
}
|
||
|
||
# Find chapter title
|
||
for line in lines:
|
||
should_process, match = should_process_header(line, state)
|
||
if should_process and len(match.group(1)) == 1:
|
||
chapter_title = match.group(2).strip()
|
||
break
|
||
|
||
if not chapter_title:
|
||
raise ValueError(f"No chapter title found in {file_path}")
|
||
|
||
# Reset state for main processing
|
||
state = initialize_block_tracking()
|
||
|
||
for i, line in enumerate(lines):
|
||
# Update block state
|
||
state = update_block_state(line, state)
|
||
|
||
# Check if we should process this header
|
||
should_process, match = should_process_header(line, state)
|
||
if should_process:
|
||
hashes, title = match.groups()
|
||
header_level = len(hashes)
|
||
|
||
if header_level > 1: # Skip chapter title (level 1)
|
||
# Update section hierarchy based on header level
|
||
while len(section_hierarchy) >= header_level - 1:
|
||
section_hierarchy.pop()
|
||
|
||
# Add current section to hierarchy (will be used for next section)
|
||
section_hierarchy.append(title.strip())
|
||
|
||
# Get parent sections for current section (exclude the current section itself)
|
||
parent_sections = section_hierarchy[:-1] if len(section_hierarchy) > 1 else []
|
||
|
||
# Extract existing attributes if any
|
||
existing_attrs = ""
|
||
if "{" in line:
|
||
attrs_start = line.find("{")
|
||
attrs_end = line.rfind("}")
|
||
if attrs_end > attrs_start:
|
||
existing_attrs = line[attrs_start:attrs_end+1]
|
||
# Skip headers with {.unnumbered}
|
||
if ".unnumbered" in existing_attrs:
|
||
# Remove any existing section ID from unnumbered headers
|
||
existing_id_matches = re.findall(r'\{#(sec-[^}]+)\}', line)
|
||
if existing_id_matches:
|
||
existing_id = existing_id_matches[0]
|
||
# Remove the section ID while preserving other attributes
|
||
new_attrs = re.sub(r'#sec-[^}\s]+', '', existing_attrs)
|
||
# Remove duplicate .unnumbered
|
||
new_attrs = re.sub(r'(\.unnumbered)(?=.*\.unnumbered)', '', new_attrs)
|
||
# Remove extra whitespace
|
||
new_attrs = re.sub(r'\s+', ' ', new_attrs).strip()
|
||
# Remove empty braces or braces with only whitespace
|
||
if new_attrs in ["{}", "{ }", ""]:
|
||
new_line = f"{hashes} {title}\n"
|
||
else:
|
||
new_line = f"{hashes} {title} {new_attrs}\n"
|
||
lines[i] = new_line
|
||
modified = True
|
||
file_summary['modified'] = True
|
||
file_summary['removed_ids'].append((title.strip(), existing_id))
|
||
logging.info(f" 🗑️ Removed ID from unnumbered header: {title}")
|
||
logging.info(f" {line.strip()}")
|
||
logging.info(f" → {new_line.strip()}")
|
||
continue # Skip this header
|
||
|
||
existing_id_matches = re.findall(r'\{#(sec-[^}]+)\}', line)
|
||
if existing_id_matches:
|
||
existing_id = existing_id_matches[0]
|
||
existing_sections.append((title.strip(), existing_id))
|
||
file_summary['existing_sections'].append((title.strip(), existing_id))
|
||
|
||
if remove_mode:
|
||
# Remove the section ID
|
||
if auto_yes or force or input(f"\n🗑️ Remove ID for '{title}': {existing_id}? [Y/n]: ").lower() != 'n':
|
||
# Remove only the sec- part while preserving other attributes
|
||
new_attrs = re.sub(r'#sec-[^}\s]+', '', existing_attrs)
|
||
# Clean up any double spaces or empty braces
|
||
new_attrs = re.sub(r'\s+', ' ', new_attrs).strip()
|
||
if new_attrs == "{}":
|
||
new_line = f"{hashes} {title}\n"
|
||
else:
|
||
new_line = f"{hashes} {title} {new_attrs}\n"
|
||
lines[i] = new_line
|
||
modified = True
|
||
file_summary['modified'] = True
|
||
file_summary['removed_ids'].append((title.strip(), existing_id))
|
||
logging.info(f" 🗑️ Removed: {title}")
|
||
logging.info(f" {line.strip()}")
|
||
logging.info(f" → {new_line.strip()}")
|
||
else:
|
||
# Extract section content for content-aware ID generation
|
||
section_content = extract_section_content(lines, i, header_level)
|
||
|
||
# Generate the new ID in the standard format with parent hierarchy and content
|
||
new_id = generate_section_id(title, file_path, chapter_title, section_counter, parent_sections, section_content)
|
||
section_counter += 1
|
||
|
||
# Check if the existing ID needs to be repaired/updated
|
||
is_proper, expected_id = is_properly_formatted_id(existing_id, title, file_path, chapter_title, section_counter)
|
||
|
||
# In repair mode, always update to the new format
|
||
# In normal mode, only update if the format is improper
|
||
should_update = repair_mode or not is_proper
|
||
|
||
if should_update:
|
||
if existing_id == new_id:
|
||
continue # No change needed, skip
|
||
if auto_yes or force or input(f"\n🔄 Update ID for '{title}':\n From: {existing_id}\n To: {new_id}\n Proceed? [Y/n]: ").lower() != 'n':
|
||
# Store the replacement
|
||
id_replacements[existing_id] = new_id
|
||
# Replace only the sec- part while preserving other attributes
|
||
# This handles cases like: {.class #sec-old-id .other-class}
|
||
new_attrs = re.sub(r'#sec-[^}\s]+', f'#{new_id}', existing_attrs)
|
||
new_line = f"{hashes} {title} {new_attrs}\n"
|
||
lines[i] = new_line
|
||
modified = True
|
||
file_summary['modified'] = True
|
||
file_summary['updated_ids'].append((title.strip(), existing_id, new_id))
|
||
logging.info(f" ✓ Updated: {title}")
|
||
logging.info(f" {line.strip()}")
|
||
logging.info(f" → {new_line.strip()}")
|
||
else:
|
||
if not remove_mode: # Only add IDs if not in remove mode
|
||
# Extract section content for content-aware ID generation
|
||
section_content = extract_section_content(lines, i, header_level)
|
||
|
||
# Generate the new ID in the standard format with parent hierarchy and content
|
||
new_id = generate_section_id(title, file_path, chapter_title, section_counter, parent_sections, section_content)
|
||
section_counter += 1
|
||
# Add ID while preserving other attributes
|
||
if existing_attrs:
|
||
# Remove any existing ID if present
|
||
attrs_without_id = re.sub(r'#sec-[^}]+', '', existing_attrs)
|
||
attrs_without_id = attrs_without_id.strip()
|
||
if attrs_without_id == "{}":
|
||
new_line = f"{hashes} {title} {{#{new_id}}}\n"
|
||
else:
|
||
new_line = f"{hashes} {title} {attrs_without_id} {{#{new_id}}}\n"
|
||
else:
|
||
new_line = f"{hashes} {title} {{#{new_id}}}\n"
|
||
lines[i] = new_line
|
||
modified = True
|
||
file_summary['modified'] = True
|
||
file_summary['added_ids'].append((title.strip(), new_id))
|
||
logging.info(f" + Added: {title}")
|
||
logging.info(f" {line.strip()}")
|
||
logging.info(f" → {new_line.strip()}")
|
||
|
||
# Show existing sections even if no changes were made
|
||
if existing_sections:
|
||
logging.info(f" 📋 Existing sections:")
|
||
for title, section_id in existing_sections:
|
||
logging.info(f" • {title} → #{section_id}")
|
||
|
||
if modified and not dry_run:
|
||
path.write_text(''.join(lines), encoding="utf-8")
|
||
logging.info(f"✅ Saved changes to {file_path}")
|
||
elif not modified:
|
||
logging.info(f"✓ No changes needed for {file_path}")
|
||
|
||
return file_summary
|
||
|
||
def process_directory(directory, auto_yes=False, force=False, dry_run=False, repair_mode=False, remove_mode=False, backup_mode=False):
|
||
"""
|
||
Recursively process all Markdown and Quarto files in a directory.
|
||
"""
|
||
path = Path(directory)
|
||
if not path.exists():
|
||
logging.error(f"Directory does not exist: {directory}")
|
||
return
|
||
|
||
all_files = list(path.rglob("*.md")) + list(path.rglob("*.qmd"))
|
||
if not all_files:
|
||
logging.warning(f"No markdown files found in directory: {directory}")
|
||
return
|
||
|
||
logging.info(f"\n{'='*60}")
|
||
logging.info(f"🔍 PROCESSING DIRECTORY: {directory}")
|
||
logging.info(f"📁 Found {len(all_files)} files to process")
|
||
logging.info(f"{'='*60}")
|
||
|
||
# Collect summaries from all files
|
||
all_summaries = []
|
||
for i, file_path in enumerate(all_files, 1):
|
||
logging.info(f"\n📄 [{i}/{len(all_files)}] Processing: {file_path}")
|
||
logging.info(f"{'-'*60}")
|
||
|
||
file_summary = process_markdown_file(file_path, auto_yes=auto_yes, force=force, dry_run=dry_run, repair_mode=repair_mode, remove_mode=remove_mode, backup_mode=backup_mode)
|
||
all_summaries.append(file_summary)
|
||
|
||
# Add a separator between files
|
||
if i < len(all_files):
|
||
logging.info(f"{'-'*60}")
|
||
|
||
# Print overall summary
|
||
print_summary(all_summaries)
|
||
|
||
def print_summary(all_summaries):
|
||
"""Print a comprehensive summary of all changes made across files."""
|
||
total_files = len(all_summaries)
|
||
files_modified = sum(1 for summary in all_summaries if summary['modified'])
|
||
total_added = sum(len(summary['added_ids']) for summary in all_summaries)
|
||
total_updated = sum(len(summary['updated_ids']) for summary in all_summaries)
|
||
total_removed = sum(len(summary['removed_ids']) for summary in all_summaries)
|
||
total_existing = sum(len(summary['existing_sections']) for summary in all_summaries)
|
||
|
||
logging.info(f"\n{'='*60}")
|
||
logging.info(f"📊 FINAL SUMMARY")
|
||
logging.info(f"{'='*60}")
|
||
logging.info(f"📁 Files processed: {total_files}")
|
||
logging.info(f"✅ Files modified: {files_modified}")
|
||
logging.info(f"➕ Section IDs added: {total_added}")
|
||
logging.info(f"🔄 Section IDs updated: {total_updated}")
|
||
logging.info(f"🗑️ Section IDs removed: {total_removed}")
|
||
logging.info(f"📋 Existing sections found: {total_existing}")
|
||
|
||
if total_added > 0 or total_updated > 0 or total_removed > 0:
|
||
logging.info(f"\n FILES WITH CHANGES:")
|
||
logging.info(f"{'-'*60}")
|
||
|
||
for summary in all_summaries:
|
||
if summary['added_ids'] or summary['updated_ids'] or summary['removed_ids']:
|
||
logging.info(f"\n📄 {summary['file_path']}:")
|
||
|
||
if summary['added_ids']:
|
||
logging.info(f" ➕ Added {len(summary['added_ids'])} section IDs")
|
||
|
||
if summary['updated_ids']:
|
||
logging.info(f" 🔄 Updated {len(summary['updated_ids'])} section IDs")
|
||
# Show first few updates as examples
|
||
for i, (title, old_id, new_id) in enumerate(summary['updated_ids'][:3]):
|
||
logging.info(f" • {title}: {old_id} → {new_id}")
|
||
if len(summary['updated_ids']) > 3:
|
||
logging.info(f" ... and {len(summary['updated_ids']) - 3} more")
|
||
|
||
if summary['removed_ids']:
|
||
logging.info(f" 🗑️ Removed {len(summary['removed_ids'])} section IDs")
|
||
|
||
logging.info(f"\n{'='*60}")
|
||
|
||
def verify_section_ids(filepath):
|
||
"""Verify that all headers have proper section IDs, skipping unnumbered headers."""
|
||
missing_ids = []
|
||
with open(filepath, 'r', encoding='utf-8') as file:
|
||
lines = file.readlines()
|
||
|
||
state = initialize_block_tracking()
|
||
for i, line in enumerate(lines, 1):
|
||
# Update block state
|
||
state = update_block_state(line, state)
|
||
|
||
# Check if we should process this header
|
||
should_process, match = should_process_header(line, state)
|
||
if should_process:
|
||
hashes, title = match.groups()
|
||
if len(hashes) > 1: # Skip chapter title
|
||
# Extract existing attributes if any
|
||
existing_attrs = ""
|
||
if "{" in line:
|
||
attrs_start = line.find("{")
|
||
attrs_end = line.rfind("}")
|
||
if attrs_end > attrs_start:
|
||
existing_attrs = line[attrs_start:attrs_end+1]
|
||
# Skip headers with {.unnumbered}
|
||
if ".unnumbered" in existing_attrs:
|
||
continue # Skip this header
|
||
if not re.search(r'\{#sec-[^}]+\}', line):
|
||
missing_ids.append({
|
||
'line': i,
|
||
'title': title.strip()
|
||
})
|
||
|
||
return missing_ids
|
||
|
||
def update_cross_references(file_path, id_map):
|
||
"""Update cross-references in a file using the ID mapping."""
|
||
global id_replacements
|
||
|
||
logging.info(f"\n🔍 Checking references in: {file_path}")
|
||
path = Path(file_path)
|
||
|
||
# Handle JSON files differently
|
||
if path.suffix.lower() == '.json':
|
||
return update_quiz_json(file_path, id_map)
|
||
|
||
# Handle text files (QMD, MD, etc.)
|
||
content = path.read_text(encoding="utf-8")
|
||
|
||
# Track changes
|
||
changes = []
|
||
modified = False
|
||
|
||
# Update each reference
|
||
for old_id, new_id in id_map.items():
|
||
# Update both @ and # references with word boundary
|
||
pattern = rf'([@#]){re.escape(old_id)}\b'
|
||
new_content = re.sub(pattern, rf'\1{new_id}', content)
|
||
if new_content != content:
|
||
content = new_content
|
||
changes.append((old_id, new_id))
|
||
modified = True
|
||
|
||
if modified:
|
||
path.write_text(content, encoding="utf-8")
|
||
logging.info(f"✅ Updated {len(changes)} references:")
|
||
for old, new in changes:
|
||
logging.info(f" - {old} → {new}")
|
||
return True
|
||
else:
|
||
logging.info(f" ✓ No references found to update")
|
||
|
||
return False
|
||
|
||
def update_quiz_json(file_path, id_map):
|
||
"""Update section IDs in a quiz JSON file."""
|
||
global id_replacements
|
||
|
||
logging.info(f"\n🔍 Checking quiz JSON in: {file_path}")
|
||
path = Path(file_path)
|
||
try:
|
||
with open(path, 'r', encoding='utf-8') as f:
|
||
quiz_data = json.load(f)
|
||
except json.JSONDecodeError as e:
|
||
logging.error(f"Error decoding JSON from {file_path}: {e}")
|
||
return False
|
||
|
||
# Track changes
|
||
changes = []
|
||
modified = False
|
||
|
||
# First, update section_id fields in the structure
|
||
for section in quiz_data.get('sections', []):
|
||
old_section_id = section.get('section_id')
|
||
if old_section_id and old_section_id in id_map:
|
||
new_section_id = id_map[old_section_id]
|
||
section['section_id'] = new_section_id
|
||
changes.append((old_section_id, new_section_id))
|
||
modified = True
|
||
|
||
# Then, search for any other occurrences of old IDs in the entire JSON content
|
||
# Convert to string, replace, then parse back
|
||
json_str = json.dumps(quiz_data, indent=2)
|
||
original_json_str = json_str
|
||
|
||
for old_id, new_id in id_map.items():
|
||
if old_id in json_str:
|
||
json_str = json_str.replace(old_id, new_id)
|
||
if json_str != original_json_str:
|
||
modified = True
|
||
# Only add to changes if not already added from section_id field
|
||
if (old_id, new_id) not in changes:
|
||
changes.append((old_id, new_id))
|
||
|
||
if modified:
|
||
# Parse back to JSON to ensure it's valid
|
||
try:
|
||
updated_quiz_data = json.loads(json_str)
|
||
with open(path, 'w', encoding='utf-8') as f:
|
||
json.dump(updated_quiz_data, f, indent=2)
|
||
logging.info(f"✅ Updated {len(changes)} section IDs in {file_path}")
|
||
for old, new in changes:
|
||
logging.info(f" - {old} → {new}")
|
||
return True
|
||
except json.JSONDecodeError as e:
|
||
logging.error(f"Error after updating JSON in {file_path}: {e}")
|
||
return False
|
||
else:
|
||
logging.info(f" ✓ No section IDs found to update in {file_path}")
|
||
|
||
return False
|
||
|
||
def main():
|
||
"""Main function to process files."""
|
||
global id_replacements
|
||
# Reset id_replacements at the start of each run
|
||
id_replacements = {}
|
||
|
||
parser = argparse.ArgumentParser(
|
||
description="Comprehensive Section ID Management for Quarto/Markdown Book Projects",
|
||
epilog="""
|
||
Section IDs are critical for cross-referencing and navigation. This tool helps maintain them.
|
||
|
||
MODE EXAMPLES:
|
||
|
||
Add missing IDs:
|
||
python section_id_manager.py -d contents/
|
||
python section_id_manager.py -f contents/chapter.qmd
|
||
|
||
|
||
Repair existing IDs:
|
||
python section_id_manager.py -d contents/ --repair
|
||
python section_id_manager.py -f contents/chapter.qmd --repair
|
||
|
||
|
||
Force repair (no prompts):
|
||
python section_id_manager.py -d contents/ --repair --force
|
||
python section_id_manager.py -f contents/chapter.qmd --repair --force
|
||
|
||
|
||
Remove all IDs:
|
||
python section_id_manager.py -d contents/ --remove
|
||
python section_id_manager.py -f contents/chapter.qmd --remove
|
||
|
||
|
||
Verify all IDs:
|
||
python section_id_manager.py -d contents/ --verify
|
||
python section_id_manager.py -f contents/chapter.qmd --verify
|
||
|
||
|
||
List all IDs:
|
||
python section_id_manager.py -d contents/ --list
|
||
python section_id_manager.py -f contents/chapter.qmd --list
|
||
|
||
|
||
Safe repair (with backup):
|
||
python section_id_manager.py -d contents/ --repair --backup
|
||
python section_id_manager.py -f contents/chapter.qmd --repair --backup
|
||
""",
|
||
formatter_class=argparse.RawDescriptionHelpFormatter
|
||
)
|
||
parser.add_argument("-f", "--file", help="Process a single file")
|
||
parser.add_argument("-d", "--directory", help="Process all .qmd files in directory")
|
||
parser.add_argument("-y", "--yes", action="store_true", help="Auto-approve all changes (use with caution)")
|
||
parser.add_argument("--force", action="store_true", help="Force all operations without confirmation prompts")
|
||
parser.add_argument("--dry-run", action="store_true", help="Show changes without writing them")
|
||
parser.add_argument("--verify", action="store_true", help="Verify all section IDs are present (⚠️ does NOT check format)")
|
||
parser.add_argument("--repair", action="store_true", help="Repair existing section IDs to match the new format (preserves other attributes)")
|
||
parser.add_argument("--remove", action="store_true", help="Remove all section IDs (use with --backup for safety)")
|
||
parser.add_argument("--list", action="store_true", help="List all section IDs found in files")
|
||
parser.add_argument("--backup", action="store_true", help="Create backup files before making changes")
|
||
parser.add_argument("--debug", action="store_true", help="Enable debug logging")
|
||
args = parser.parse_args()
|
||
|
||
# Configure logging
|
||
log_level = logging.DEBUG if args.debug else logging.INFO
|
||
logging.basicConfig(
|
||
level=log_level,
|
||
format="%(message)s"
|
||
)
|
||
|
||
# Validate mode combinations
|
||
mode_count = sum([args.verify, args.repair, args.remove, args.list])
|
||
if mode_count > 1:
|
||
parser.error("Only one mode can be specified: --verify, --repair, --remove, or --list")
|
||
|
||
if args.verify:
|
||
logging.warning("⚠️ VERIFY MODE: This only checks if section IDs are present, not if they follow the correct format.")
|
||
logging.warning(" Use --repair to fix IDs that don't match the expected format.")
|
||
if not (args.yes or args.force) and input("Continue with format-agnostic verification? [Y/n]: ").lower() == 'n':
|
||
logging.info("Verification cancelled.")
|
||
sys.exit(0)
|
||
|
||
if args.file:
|
||
missing_ids = verify_section_ids(args.file)
|
||
if missing_ids:
|
||
logging.warning(f"❌ {args.file}")
|
||
for header in missing_ids:
|
||
logging.warning(f" Line {header['line']}: {header['title']} (missing ID)")
|
||
sys.exit(1)
|
||
else:
|
||
logging.info(f"✅ {args.file}")
|
||
sys.exit(0)
|
||
elif args.directory:
|
||
all_missing = []
|
||
for filepath in glob.glob(os.path.join(args.directory, "**/*.qmd"), recursive=True):
|
||
missing_ids = verify_section_ids(filepath)
|
||
if missing_ids:
|
||
logging.warning(f"❌ {filepath}")
|
||
all_missing.append((filepath, missing_ids))
|
||
else:
|
||
logging.info(f"✅ {filepath}")
|
||
if all_missing:
|
||
# After all files, print details for each file with missing IDs
|
||
for filepath, missing_ids in all_missing:
|
||
for header in missing_ids:
|
||
logging.warning(f" {filepath}: Line {header['line']}: {header['title']} (missing ID)")
|
||
sys.exit(1)
|
||
else:
|
||
sys.exit(0)
|
||
else:
|
||
parser.error("--verify requires either --file or --directory")
|
||
elif args.list:
|
||
if args.file:
|
||
list_section_ids(args.file)
|
||
elif args.directory:
|
||
list_all_section_ids(args.directory)
|
||
else:
|
||
parser.error("--list requires either --file or --directory")
|
||
else:
|
||
# First phase: Update all section IDs and build mapping
|
||
if args.file:
|
||
file_summary = process_markdown_file(args.file, args.yes, args.force, args.dry_run, args.repair, args.remove, args.backup)
|
||
# Print summary for single file
|
||
print_summary([file_summary])
|
||
|
||
# Update cross-references in the same directory
|
||
if not args.dry_run and id_replacements:
|
||
logging.info("\n📝 Found the following ID replacements:")
|
||
for old_id, new_id in id_replacements.items():
|
||
logging.info(f" {old_id} → {new_id}")
|
||
|
||
if args.yes or args.force or input("\n🔄 Would you like to update cross-references with these new IDs? [Y/n]: ").lower() != 'n':
|
||
logging.info("\n🔍 Searching for cross-references...")
|
||
file_dir = Path(args.file).parent
|
||
update_cross_references(args.file, id_replacements)
|
||
# Also check other files in the same directory
|
||
for other_file in file_dir.glob("*.qmd"):
|
||
if other_file != Path(args.file):
|
||
update_cross_references(str(other_file), id_replacements)
|
||
|
||
# Update all quiz JSON files in the same directory
|
||
for quiz_file in file_dir.glob("*_quizzes.json"):
|
||
update_cross_references(str(quiz_file), id_replacements)
|
||
elif args.directory:
|
||
# Process all files with summary
|
||
process_directory(args.directory, args.yes, args.force, args.dry_run, args.repair, args.remove, args.backup)
|
||
|
||
# Then update cross-references if we have replacements
|
||
if not args.dry_run and id_replacements:
|
||
logging.info("\n📝 Found the following ID replacements:")
|
||
for old_id, new_id in id_replacements.items():
|
||
logging.info(f" {old_id} → {new_id}")
|
||
|
||
if args.yes or args.force or input("\n🔄 Would you like to update cross-references with these new IDs? [Y/n]: ").lower() != 'n':
|
||
logging.info("\n🔍 Searching for cross-references...")
|
||
# Update all files in the directory
|
||
for filepath in glob.glob(os.path.join(args.directory, "**/*.qmd"), recursive=True):
|
||
update_cross_references(filepath, id_replacements)
|
||
|
||
# Update all quiz JSON files in the directory
|
||
logging.info("\n📝 Updating quiz JSON files...")
|
||
for quiz_file in glob.glob(os.path.join(args.directory, "**/*_quizzes.json"), recursive=True):
|
||
update_cross_references(quiz_file, id_replacements)
|
||
else:
|
||
parser.error("Either --file or --directory is required")
|
||
|
||
if __name__ == "__main__":
|
||
main() |