Commit Graph

17 Commits

Author SHA1 Message Date
Vijay Janapa Reddi
d2b7ff04dd Improves figure and table caption formatting
Enhances figure and table caption formatting to ensure consistency and readability.

- Implements a comprehensive `apply_sentence_case` function that handles technical terms, acronyms, and proper nouns correctly.
- Refines the `format_bold_explanation_caption` function to use the improved sentence casing.
- Updates the caption update logic to support both figures and tables.
- Widens key phrase length for figures.
2025-07-22 18:21:29 -04:00
Vijay Janapa Reddi
7479bf52f7 Improves cross-reference generation with AI explanations
Enhances the cross-reference generation script to leverage local LLMs
(via Ollama) for generating natural language explanations, offering readers
contextual insights into the connections between document sections.

Refines prompts and adds retry logic for improved explanation quality.
Also adds command-line option to specify a ollama model.

Updates the cross-reference injection to display a better formated explanation.

Fixes the reference to the cross-reference data file in the config.
2025-07-22 07:38:10 -04:00
Vijay Janapa Reddi
fe579b28d4 Add sophisticated explanation cleanup and optimize model recommendations based on systematic experiments 2025-07-21 20:54:37 -04:00
Vijay Janapa Reddi
f187028ef5 Fix _quarto.yml ordering to process all chapters in correct sequence 2025-07-21 20:48:55 -04:00
Vijay Janapa Reddi
7ddf5d0b1e Optimize cross-reference generation: switch to llama3.1:8b with flexible 6-12 word explanations 2025-07-21 20:41:30 -04:00
Vijay Janapa Reddi
ad9f9c50ad Optimize explanation length based on experiments 2025-07-21 19:55:30 -04:00
Vijay Janapa Reddi
8fb5e43f00 Complete Cross-Reference System Ready for Production
🎉 FINAL SYSTEM FEATURES:
- Bold directional arrows (→ ← •) with fallback logic
- Academic § symbol with auto-generated section numbers
- AI-generated natural explanations
- Clean academic formatting
- Robust error handling and edge cases
- Professional textbook design standards

📚 READY FOR MODEL OPTIMIZATION EXPERIMENTS
2025-07-21 19:39:40 -04:00
Vijay Janapa Reddi
af2024a8fb 🚀 Revolutionary Feature: AI-Generated Cross-Reference Explanations
 WORLD-CLASS INNOVATION:
- Added --explain flag for AI-generated cross-reference explanations
- Uses local Ollama + qwen2.5:7b (private, no external API costs)
- Generates 8-12 word explanations of WHY sections connect

🛠 TECHNICAL IMPLEMENTATION:
- Interactive setup with model detection and installation guidance
- Graceful fallback if Ollama not available
- Self-contained in cross_refs.py with smart error handling
- Adds 'explanation' field to JSON output

📚 EXAMPLE OUTPUT:
- 'Understands AI's biological inspiration and neural network basics.'
- 'Understands the context for choosing the right ML framework.'
- 'Understands neural networks' role in AI and machine learning.'

🎯 RESULTS ACHIEVED:
- 13 cross-references with AI explanations generated
- 76% average similarity maintained
- Student-focused language ('Understands...' tells value)
- Professional textbook quality

🚀 USAGE:
python3 cross_refs.py -g -m model -o output.json -d contents/ --explain

This makes your textbook's cross-reference system truly revolutionary -
no other textbook provides contextualized AI explanations for connections
2025-07-21 17:32:46 -04:00
Vijay Janapa Reddi
dd9d1b4ade Update connection types: Foundation → Background
- Changed 'Foundation' to 'Background' for backward references
- Keeps 'Preview' for forward references
- More intuitive terminology for LLM understanding:
  - Background = earlier chapters providing context/prerequisites
  - Preview = later chapters showing applications/extensions
- Tested: 36 Background + 24 Preview connections generated successfully
2025-07-21 17:02:15 -04:00
Vijay Janapa Reddi
3d634f374b Update command line options: -t for threshold, --threshold
- Changed threshold option from --similarity-threshold to --threshold
- Changed short form from -threshold to -t
- Removed -t from --train (training now uses --train only)
- Updated documentation examples to use --train instead of -t
- More intuitive and concise command line usage
2025-07-21 16:43:09 -04:00
Vijay Janapa Reddi
f941fcaecd Add short -threshold option for similarity threshold
- Added -threshold as short form of --similarity-threshold
- Maintains backward compatibility with existing scripts
- Makes command line usage more concise
2025-07-21 16:39:14 -04:00
Vijay Janapa Reddi
212bdc76bf Update cross-references and filters with improved results
- Updated cross_refs.json with 96 sections and 74 cross-references
- User-tweaked filters.yml (removed content_filters section)
- Removed old cross_references.json file
- Results: 50% extraction rate, 66.3% similarity
2025-07-21 16:25:36 -04:00
Vijay Janapa Reddi
ed4c96f0a5 Fix pypandoc content loss from ASCII tables - 30% improvement in section extraction 2025-07-21 16:25:12 -04:00
Vijay Janapa Reddi
acadb19572 🚀 Optimize filtering for better cross-reference coverage
 Improved Section Extraction:
- Updated filters.yml to be less aggressive on substantial content
- Removed exclusions for 'overview', 'introduction', 'conclusion' sections
- These often contain valuable technical content, not just meta-content
- Kept exclusions for truly meta content like 'purpose', 'learning objectives'

 Relaxed Content Filters:
- Min length: 200 → 150 chars (allow shorter sections)
- Max length: 15000 → 20000 chars (allow longer sections)
- List ratio: 70% → 80% (allow more list-heavy content)
- Code ratio: 80% → 90% (allow more code examples)
- Citation ratio: 30% → 40% (allow more referenced content)

 Results with Domain-Adapted Model:
- Section extraction: 52 → 74 sections (42% improvement)
- Cross-references: 30 → 63 references (110% improvement)
- File coverage: 8 → 13 files (62% more files connected)
- Quality maintained: 65.6% average similarity

 Optimal Settings Identified:
- Similarity threshold: 0.6 (vs default 0.65)
- Max suggestions: 3 per section
- Balances quantity and quality effectively

This version provides much better coverage while maintaining high-quality
cross-references between legitimate technical sections.
2025-07-21 16:08:27 -04:00
Vijay Janapa Reddi
494d6b7b58 🎉 Working version: Fix fake headers + Enhanced filtering system
 CRITICAL FIX - Preserve Original Section IDs:
- Extract exact {#sec-...} identifiers from raw markdown
- Preserve original section titles without modification
- Eliminate artificial header reconstruction that created fake sections
- No more invalid section IDs like 'sec-introduction-then'
- Only process sections with legitimate {#sec-...} identifiers

 Enhanced File & Section Filtering:
- Added file-level regex filtering to exclude entire files
- Simplified section filtering to use only regex patterns
- Removed redundant exact/pattern distinction
- Support anchored patterns (^purpose$) and flexible patterns (.*quiz.*)
- Updated filters.yml with comprehensive filtering rules

 Improved Content Processing:
- Preserve original markdown structure during pypandoc cleaning
- Match cleaned content with original headers by title similarity
- Maintain authoritative section IDs throughout the pipeline
- Remove only Quarto artifacts while keeping real headers intact

 User Experience Enhancements:
- Added --quiet mode to reduce verbose output
- Better error handling and validation for YAML configuration
- Clear feedback about filtering and exclusions
- Comprehensive testing verified all functionality

 Results:
- Only legitimate cross-references between real sections
- Exact section IDs matching original markdown files
- High-quality embeddings from cleaned content
- Robust filtering system for production use

This version successfully addresses fake header generation and implements
the complete filtering system as requested. All section IDs and titles
are now preserved exactly as written in the original markdown files.
2025-07-21 16:00:25 -04:00
Vijay Janapa Reddi
3cb655c2a6 Improve YAML processing with proper Python library usage
 Enhanced YAML handling:
- Added PyYAML import validation with clear error messages
- Improved error handling for malformed YAML files
- Added UTF-8 encoding support for international characters
- Added config validation for required sections
- Better user feedback for missing/invalid configuration

 Configuration updates:
- Renamed to filters.yml for cleaner naming
- Added comprehensive section filtering rules
- Alphabetized exact matches for easier maintenance
- Enhanced documentation with library requirements

 Testing verified:
- YAML loading and validation
- Section filtering (meta-content exclusion)
- Error handling for malformed files
- End-to-end cross-reference generation

Addresses proper Python library usage for YAML processing.
2025-07-21 15:25:51 -04:00
Vijay Janapa Reddi
5540fc486e Rename cross_referencing to cross_refs
- Rename directory scripts/cross_referencing/ -> scripts/cross_refs/
- Rename cross_referencing.py -> cross_refs.py
- Update JSON output structure to use file/sections/targets hierarchy
- Update _quarto.yml path to look in current directory and exit if not found
2025-07-21 11:13:34 -04:00