cs249r_book

mirror of https://github.com/harvard-edge/cs249r_book.git synced 2026-05-10 15:49:25 -05:00

Author	SHA1	Message	Date
Vijay Janapa Reddi	d2b7ff04dd	Improves figure and table caption formatting Enhances figure and table caption formatting to ensure consistency and readability. - Implements a comprehensive `apply_sentence_case` function that handles technical terms, acronyms, and proper nouns correctly. - Refines the `format_bold_explanation_caption` function to use the improved sentence casing. - Updates the caption update logic to support both figures and tables. - Widens key phrase length for figures.	2025-07-22 18:21:29 -04:00
Vijay Janapa Reddi	7479bf52f7	Improves cross-reference generation with AI explanations Enhances the cross-reference generation script to leverage local LLMs (via Ollama) for generating natural language explanations, offering readers contextual insights into the connections between document sections. Refines prompts and adds retry logic for improved explanation quality. Also adds command-line option to specify a ollama model. Updates the cross-reference injection to display a better formated explanation. Fixes the reference to the cross-reference data file in the config.	2025-07-22 07:38:10 -04:00
Vijay Janapa Reddi	fe579b28d4	Add sophisticated explanation cleanup and optimize model recommendations based on systematic experiments	2025-07-21 20:54:37 -04:00
Vijay Janapa Reddi	f187028ef5	Fix _quarto.yml ordering to process all chapters in correct sequence	2025-07-21 20:48:55 -04:00
Vijay Janapa Reddi	7ddf5d0b1e	Optimize cross-reference generation: switch to llama3.1:8b with flexible 6-12 word explanations	2025-07-21 20:41:30 -04:00
Vijay Janapa Reddi	ad9f9c50ad	Optimize explanation length based on experiments	2025-07-21 19:55:30 -04:00
Vijay Janapa Reddi	8fb5e43f00	✅ Complete Cross-Reference System Ready for Production 🎉 FINAL SYSTEM FEATURES: - Bold directional arrows (→ ← •) with fallback logic - Academic § symbol with auto-generated section numbers - AI-generated natural explanations - Clean academic formatting - Robust error handling and edge cases - Professional textbook design standards 📚 READY FOR MODEL OPTIMIZATION EXPERIMENTS	2025-07-21 19:39:40 -04:00
Vijay Janapa Reddi	af2024a8fb	🚀 Revolutionary Feature: AI-Generated Cross-Reference Explanations ✨ WORLD-CLASS INNOVATION: - Added --explain flag for AI-generated cross-reference explanations - Uses local Ollama + qwen2.5:7b (private, no external API costs) - Generates 8-12 word explanations of WHY sections connect 🛠 TECHNICAL IMPLEMENTATION: - Interactive setup with model detection and installation guidance - Graceful fallback if Ollama not available - Self-contained in cross_refs.py with smart error handling - Adds 'explanation' field to JSON output 📚 EXAMPLE OUTPUT: - 'Understands AI's biological inspiration and neural network basics.' - 'Understands the context for choosing the right ML framework.' - 'Understands neural networks' role in AI and machine learning.' 🎯 RESULTS ACHIEVED: - 13 cross-references with AI explanations generated - 76% average similarity maintained - Student-focused language ('Understands...' tells value) - Professional textbook quality 🚀 USAGE: python3 cross_refs.py -g -m model -o output.json -d contents/ --explain This makes your textbook's cross-reference system truly revolutionary - no other textbook provides contextualized AI explanations for connections	2025-07-21 17:32:46 -04:00
Vijay Janapa Reddi	dd9d1b4ade	Update connection types: Foundation → Background - Changed 'Foundation' to 'Background' for backward references - Keeps 'Preview' for forward references - More intuitive terminology for LLM understanding: - Background = earlier chapters providing context/prerequisites - Preview = later chapters showing applications/extensions - Tested: 36 Background + 24 Preview connections generated successfully	2025-07-21 17:02:15 -04:00
Vijay Janapa Reddi	3d634f374b	Update command line options: -t for threshold, --threshold - Changed threshold option from --similarity-threshold to --threshold - Changed short form from -threshold to -t - Removed -t from --train (training now uses --train only) - Updated documentation examples to use --train instead of -t - More intuitive and concise command line usage	2025-07-21 16:43:09 -04:00
Vijay Janapa Reddi	f941fcaecd	Add short -threshold option for similarity threshold - Added -threshold as short form of --similarity-threshold - Maintains backward compatibility with existing scripts - Makes command line usage more concise	2025-07-21 16:39:14 -04:00
Vijay Janapa Reddi	212bdc76bf	Update cross-references and filters with improved results - Updated cross_refs.json with 96 sections and 74 cross-references - User-tweaked filters.yml (removed content_filters section) - Removed old cross_references.json file - Results: 50% extraction rate, 66.3% similarity	2025-07-21 16:25:36 -04:00
Vijay Janapa Reddi	ed4c96f0a5	Fix pypandoc content loss from ASCII tables - 30% improvement in section extraction	2025-07-21 16:25:12 -04:00
Vijay Janapa Reddi	acadb19572	🚀 Optimize filtering for better cross-reference coverage ✅ Improved Section Extraction: - Updated filters.yml to be less aggressive on substantial content - Removed exclusions for 'overview', 'introduction', 'conclusion' sections - These often contain valuable technical content, not just meta-content - Kept exclusions for truly meta content like 'purpose', 'learning objectives' ✅ Relaxed Content Filters: - Min length: 200 → 150 chars (allow shorter sections) - Max length: 15000 → 20000 chars (allow longer sections) - List ratio: 70% → 80% (allow more list-heavy content) - Code ratio: 80% → 90% (allow more code examples) - Citation ratio: 30% → 40% (allow more referenced content) ✅ Results with Domain-Adapted Model: - Section extraction: 52 → 74 sections (42% improvement) - Cross-references: 30 → 63 references (110% improvement) - File coverage: 8 → 13 files (62% more files connected) - Quality maintained: 65.6% average similarity ✅ Optimal Settings Identified: - Similarity threshold: 0.6 (vs default 0.65) - Max suggestions: 3 per section - Balances quantity and quality effectively This version provides much better coverage while maintaining high-quality cross-references between legitimate technical sections.	2025-07-21 16:08:27 -04:00
Vijay Janapa Reddi	494d6b7b58	🎉 Working version: Fix fake headers + Enhanced filtering system ✅ CRITICAL FIX - Preserve Original Section IDs: - Extract exact {#sec-...} identifiers from raw markdown - Preserve original section titles without modification - Eliminate artificial header reconstruction that created fake sections - No more invalid section IDs like 'sec-introduction-then' - Only process sections with legitimate {#sec-...} identifiers ✅ Enhanced File & Section Filtering: - Added file-level regex filtering to exclude entire files - Simplified section filtering to use only regex patterns - Removed redundant exact/pattern distinction - Support anchored patterns (^purpose$) and flexible patterns (.quiz.) - Updated filters.yml with comprehensive filtering rules ✅ Improved Content Processing: - Preserve original markdown structure during pypandoc cleaning - Match cleaned content with original headers by title similarity - Maintain authoritative section IDs throughout the pipeline - Remove only Quarto artifacts while keeping real headers intact ✅ User Experience Enhancements: - Added --quiet mode to reduce verbose output - Better error handling and validation for YAML configuration - Clear feedback about filtering and exclusions - Comprehensive testing verified all functionality ✅ Results: - Only legitimate cross-references between real sections - Exact section IDs matching original markdown files - High-quality embeddings from cleaned content - Robust filtering system for production use This version successfully addresses fake header generation and implements the complete filtering system as requested. All section IDs and titles are now preserved exactly as written in the original markdown files.	2025-07-21 16:00:25 -04:00
Vijay Janapa Reddi	3cb655c2a6	Improve YAML processing with proper Python library usage ✅ Enhanced YAML handling: - Added PyYAML import validation with clear error messages - Improved error handling for malformed YAML files - Added UTF-8 encoding support for international characters - Added config validation for required sections - Better user feedback for missing/invalid configuration ✅ Configuration updates: - Renamed to filters.yml for cleaner naming - Added comprehensive section filtering rules - Alphabetized exact matches for easier maintenance - Enhanced documentation with library requirements ✅ Testing verified: - YAML loading and validation - Section filtering (meta-content exclusion) - Error handling for malformed files - End-to-end cross-reference generation Addresses proper Python library usage for YAML processing.	2025-07-21 15:25:51 -04:00
Vijay Janapa Reddi	5540fc486e	Rename cross_referencing to cross_refs - Rename directory scripts/cross_referencing/ -> scripts/cross_refs/ - Rename cross_referencing.py -> cross_refs.py - Update JSON output structure to use file/sections/targets hierarchy - Update _quarto.yml path to look in current directory and exit if not found	2025-07-21 11:13:34 -04:00

17 Commits