cs249r_book

mirror of https://github.com/harvard-edge/cs249r_book.git synced 2026-05-23 07:23:03 -05:00

Author	SHA1	Message	Date
Vijay Janapa Reddi	cae8d061c9	feat: Comprehensive caption quality improvements ✨ Enhanced LLM prompt: - Added explicit instructions to avoid weak sentence starters - Discourage 'Illustrates', 'Shows', 'Demonstrates' etc. - Encourage direct, strong language with examples 🔧 Post-processing improvements: - Fix capitalization after periods (handle abbreviations) - Replace weak sentence starters with direct language - Ensure proper table format with ':' prefix - Comprehensive caption validation pipeline 📝 Quality enforcement: - Automatic detection and correction of weak language - Proper sentence case throughout explanations - Standardized table caption format: ': Bold: explanation' - Word-by-word improvements while preserving meaning ✅ Fully tested with edge cases and validation	2025-07-23 11:21:35 -04:00
Vijay Janapa Reddi	dd49365d29	fix: Preserve and standardize colon prefix for table captions - Preserve existing ':' prefix in old format table captions - Add ':' prefix to new format table captions for consistency - Standardize all table captions to ': Caption {#tbl-id}' format - Tested with both old and new caption formats	2025-07-23 11:08:14 -04:00
Vijay Janapa Reddi	2ade2df23f	fix: Ensure proper line breaks after table captions during updates - Add line break preservation logic to table caption replacement - Handle problematic case where content is stuck to caption line - Force line break insertion between caption and following content - Update TikZ figure caption replacement to preserve line breaks - Tested with problematic cases to ensure proper formatting	2025-07-23 10:50:12 -04:00
Vijay Janapa Reddi	6be54008b3	feat: Add retry logic and improve sentence case formatting - Implement 3-retry logic with exponential backoff (2s, 4s, 8s) - Smart retry only for recoverable errors (API/network, not content issues) - Enhanced sentence case formatting with comprehensive technical term preservation - Preserve spaces and punctuation correctly during caption formatting - Support for both fast models (qwen2.5:7b) and large models (gemma3:27b) - Robust error handling for production caption improvement workflow	2025-07-22 21:06:54 -04:00
Vijay Janapa Reddi	67348987b4	Remove llm_experiments directory	2025-07-22 18:46:33 -04:00
Vijay Janapa Reddi	d2b7ff04dd	Improves figure and table caption formatting Enhances figure and table caption formatting to ensure consistency and readability. - Implements a comprehensive `apply_sentence_case` function that handles technical terms, acronyms, and proper nouns correctly. - Refines the `format_bold_explanation_caption` function to use the improved sentence casing. - Updates the caption update logic to support both figures and tables. - Widens key phrase length for figures.	2025-07-22 18:21:29 -04:00
Vijay Janapa Reddi	c6f974c4c1	Improves figure caption generation with LLM Enhances the figure caption improvement process by directly updating the QMD files immediately after generating improved captions with the LLM. This approach streamlines the workflow and reduces the need for a separate update step. It also improves the prompt given to the LLM to better guide caption generation.	2025-07-22 17:49:40 -04:00
Vijay Janapa Reddi	a6c6f091ae	Improves figure captioning and extraction stats Refines the guidelines for generating figure and table captions to enhance their clarity and pedagogical value. Also, enhances the reporting of extraction failures by tracking specific IDs and files with issues for easier debugging.	2025-07-22 17:40:47 -04:00
Vijay Janapa Reddi	ae98521f59	Adds instruction to keep source in caption Adds a guideline to the figure caption writing instructions to preserve source information when it exists.	2025-07-22 17:32:50 -04:00
Vijay Janapa Reddi	b6b055412b	Enhances figure caption generation with Ollama Improves the prompt used for generating figure captions with the Ollama model to focus on educational value and clarity. The updated prompt provides clearer instructions and formatting guidelines for generating captions that teach students about the concepts illustrated in figures and tables. It emphasizes the use of key phrases and concise explanations tailored to the context of an AI/ML systems textbook. Additionally, refactors the figure caption extraction logic to handle nested brackets in captions and escaped characters in paths. This fixes issues with figures containing links and special characters. Stores original captions as-is for better comparison.	2025-07-22 17:30:50 -04:00
Vijay Janapa Reddi	963519aff7	Enhances caption improvement workflow Improves the caption improvement process by implementing a complete in-memory workflow, enhancing context extraction with pypandoc, and refining the LLM prompt for better caption generation. It also adds checks for Ollama and the model before proceeding and simplifies command-line arguments. - Introduces a complete in-memory workflow: Build content map → Improve captions → Update QMD files, streamlining the process. - Uses targeted search and replace for updates. - Enhances context extraction using pypandoc AST parsing for richer paragraph context. - Refines the LLM prompt with more focused instructions and examples, improving caption quality. - Adds checks for Ollama and specified model, ensuring smooth execution. - Improves CLI with a more straightforward syntax and helper functions.	2025-07-22 17:06:21 -04:00
Vijay Janapa Reddi	106242b848	Fix caption formatting: title case bold, sentence case explanation FORMATTING IMPROVEMENTS: - Add format_bold_explanation_caption() function for proper capitalization - Bold part: Title Case (Every Important Word Capitalized) - Explanation part: sentence case (only first word capitalized, plus proper nouns) - Updated LLM prompt with explicit formatting rules - Added post-processing after LLM generation for consistent formatting TECHNICAL DETAILS: - Uses titlecase library for proper title case in bold section - Preserves proper nouns/acronyms: AI, ML, TikZ, LaTeX, GitHub, etc. - Applied automatically after LLM response validation - Robust regex parsing of bold: explanation format TESTED RESULTS: BEFORE: DATA SYNCHRONIZATION: The Adaptive Resource Pattern Addresses... AFTER: Data Synchronization in Distributed Systems: The adaptive resource pattern addresses... Perfect academic formatting for educational content	2025-07-22 15:55:44 -04:00
Vijay Janapa Reddi	42dc555791	Implement complete LLM integration for caption improvement NEW FEATURES: - Add --improve workflow to enhance captions with Ollama LLM - Implement context extraction around figures/tables from QMD files - Add multimodal image support for markdown figures - Support TikZ compilation to PNG for vision models - Enforce bold: explanation format through prompt engineering METHODS ADDED: - generate_caption_with_ollama(): Core LLM API integration with format validation - extract_section_context(): Smart context extraction around figure/table references - encode_image(): Base64 encoding for multimodal models - compile_tikz_to_image(): LaTeX to PNG pipeline for TikZ figures - improve_captions_with_llm(): Main orchestration for LLM improvement workflow - parse_sections(): QMD section parsing for context WORKFLOW: 1. --build-qmd-map: Extract figures/tables from QMD files 2. --improve: Use LLM to generate bold: explanation captions with context 3. --update: Apply improved captions back to QMD files TECHNICAL: - Default model: llava:7b (multimodal support) - Smart image path resolution for markdown figures - Temperature 0.3 for consistent formatting - Robust error handling and validation - Complete documentation and examples TESTED: Successfully improved 7 figures + 3 tables with proper format	2025-07-22 15:52:15 -04:00
Vijay Janapa Reddi	11f5d173ac	Replace custom title case with professional titlecase library - Install and import titlecase library for proper English title case - Remove complex 50+ line custom apply_title_case implementation - Remove bold: explanation formatting logic (no longer used) - Simplify normalize_caption_case to use titlecase library directly - Maintain proper capitalization without custom word lists - Significantly cleaner and more reliable implementation	2025-07-22 15:42:29 -04:00
Vijay Janapa Reddi	319814d90a	Simplify JSON structure to essential fields only - Remove unnecessary metadata fields (start_pos, end_pos, detection_method) - Remove tikz_code, language, and path from metadata - Keep only essential fields: current_caption, original_caption, new_caption, type, source_file - Add new_caption placeholder field for future improvements - Significantly reduce JSON file size and complexity	2025-07-22 15:39:56 -04:00
Vijay Janapa Reddi	b935980fc4	Implement robust file-based caption update mechanism - Add process_qmd_files function for efficient batch updates - Group figures and tables by source file to minimize I/O - Update each file once with all caption changes - Use regex-based replacement with proper error handling - Track source_file in JSON structure for organized processing - Support both figure and table caption updates in single pass	2025-07-22 15:36:25 -04:00
Vijay Janapa Reddi	c601364f07	Update table caption detection to support format transition - Support both old format (: Caption {#tbl-id}) and new format (Caption {#tbl-id}) - Improve regex patterns to capture only caption line, not table content - Add line boundary anchors to prevent capturing table structure - Update functions will convert old format to new format automatically - Ensure clean caption extraction without leading colons	2025-07-22 15:33:37 -04:00
Vijay Janapa Reddi	ea1f676ecd	Remove tex-file functionality and simplify workflow - Remove --tex-file argument and build_content_map_from_tex method - Eliminate all tex-file related processing code - Update help text and examples to focus on QMD-only approach - Remove phase-based terminology in favor of descriptive language - Simplify workflow to QMD-focused content mapping only - Maintain backward compatibility for existing save/load functions	2025-07-22 15:29:04 -04:00
Vijay Janapa Reddi	bedff42709	Add QMD-focused content map building with save-json option - Implement build_content_map_from_qmd for direct QMD file processing - Add specialized detection functions for markdown, tikz, and code figures - Support flexible fig-id and tbl-id placement in attributes - Add --save-json option to output content map for review - Achieve 100% extraction success rate across all core chapters - Process 270 figures and 92 tables with zero failures	2025-07-22 15:25:15 -04:00
Vijay Janapa Reddi	ca7bc57050	feat: Improve pattern flexibility for ID placement 🔧 ENHANCED DETECTION PATTERNS: - Allow fig-id/tbl-id anywhere in attribute blocks - Support: {#fig-id}, {width=80% #fig-id}, {#fig-id .class} - More robust handling of complex attribute combinations 📝 PATTERN IMPROVEMENTS: - Markdown figures: (?:\s\|[^}}])* allows ID placement anywhere - TikZ figures: Same flexible ID matching - Tables: Simplified to find any line with #tbl-id - Code figures: Already flexible, no changes needed ✅ VALIDATION CONFIRMED: - All existing detections maintained: 265/296 figures, 91/91 tables - No regressions in functionality - Patterns handle edge cases like multiple attributes 🎯 BENEFITS: - Handles real-world QMD variations where IDs aren't first - More resilient to attribute order changes - Simpler table detection logic - Future-proof for new attribute patterns Ready for QMD-focused development with robust pattern matching	2025-07-22 15:14:09 -04:00
Vijay Janapa Reddi	278275a3d5	feat: Switch default input to caps.tex for comprehensive coverage - Change default --tex-file from Machine-Learning-Systems.tex to caps.tex - Update build_content_map_from_tex default parameter - Update help documentation to reflect new default Benefits with caps.tex: - 296 figures (vs 37 previously) - 8x more content - 91 tables (vs 3 previously) - 30x more content - 95.9% figure mapping coverage (284/296 found) - 100% table mapping coverage (91/91 found) - Complete book content without requiring slow builds Usage remains the same: python improve_figure_captions.py --build-map # Now uses caps.tex python improve_figure_captions.py --build-map --tex-file custom.tex	2025-07-22 14:59:03 -04:00
Vijay Janapa Reddi	2d63744a74	feat: Add --tex-file option to specify custom LaTeX input - Add --tex-file argument to bypass automatic builds - Default remains Machine-Learning-Systems.tex for backward compatibility - Enables using existing .tex files without rebuilding - Update help documentation with usage examples Benefits: - Faster workflow when .tex file already exists - Support for custom/alternative .tex file paths - No need to rebuild entire book for caption processing Usage: python improve_figure_captions.py --build-map --tex-file custom.tex python improve_figure_captions.py --build-map # Uses default	2025-07-22 14:38:12 -04:00
Vijay Janapa Reddi	8fdfe2970b	feat: Add consistency checking for commented chapter content - Parse both active and commented chapters from _quarto.yml - Detect figures/tables from commented-out chapters in content map - Warn when .tex content doesn't match active book structure - Add detailed reporting of consistency issues - Provide actionable guidance for resolving mismatches Prevents silent failures where: - .tex file contains figures from all chapters (including commented) - QMD processing only scans active chapters - Caption updates would fail for commented chapter content Now shows: 'Found 9 active chapters, 50 commented chapters' and warns if content map contains items from inactive sources.	2025-07-22 14:17:37 -04:00
Vijay Janapa Reddi	550f2cac69	fix: Add TikZ figure detection and improve nested bracket handling - Add support for TikZ figures using ::: {#fig-id} div format - Detect figures in conditional visibility blocks - Fix regex pattern to handle nested square brackets in captions - Add type detection ('markdown' vs 'tikz') for proper updating - Update caption replacement logic for both formats Results: Perfect figure detection coverage - Before: 23/37 figures found (62%) - After: 37/37 figures found (100%) Fixes issue with fig-ai-timeline, fig-cloudml-example, and fig-TinyML-example that were previously showing as missing despite being present in QMD files.	2025-07-22 14:10:04 -04:00
Vijay Janapa Reddi	117fdeef99	feat: Add YAML-based book structure processing - Parse _quarto.yml to extract active chapters in order - Process QMD files following book structure instead of filesystem order - Skip commented-out chapters (47 total, only 8 active) - Add get_book_chapters_from_quarto() method - Add find_qmd_files_in_order() method - Update validation and quality check to use ordered processing Results in much more accurate analysis: - 23/37 figures found in active chapters (not 23/61 random files) - Missing 14 figures are in commented-out chapters - Follows intended book structure and order	2025-07-22 14:03:12 -04:00
Vijay Janapa Reddi	a99962e442	feat: Add comprehensive caption validation and repair system - Add CaptionQualityChecker class with quality rules - Implement --check/-c flag for caption quality analysis - Implement --repair/-r flag for selective caption fixing - Add short-form flags for all options (-b, -c, -r, -v, -u) - Support multiple directories with -d flag - Professional quality reports with issue categorization - Smart repair of punctuation and capitalization issues Quality rules detect missing punctuation, poor capitalization, generic captions, broken formatting, and LaTeX artifacts. Enables targeted caption improvements while maintaining quality.	2025-07-22 13:57:28 -04:00
Vijay Janapa Reddi	e7ab17ccca	Add AI-powered figure caption improvement script - Created scripts/improve_figure_captions.py - comprehensive tool for improving figure captions - Uses llava:7b model to analyze images and generate educational captions - Features JSON-structured responses for consistent formatting - Supports both single file (-f) and directory (-d) processing modes - Handles large images using requests library instead of curl - Robust regex patterns to find figure definitions with any attribute ordering - Comprehensive error handling and progress reporting with statistics - Enhanced prompting for textbook-specific educational content - Successfully tested on socratiq.qmd with 100% improvement rate	2025-07-22 11:45:01 -04:00
Vijay Janapa Reddi	7479bf52f7	Improves cross-reference generation with AI explanations Enhances the cross-reference generation script to leverage local LLMs (via Ollama) for generating natural language explanations, offering readers contextual insights into the connections between document sections. Refines prompts and adds retry logic for improved explanation quality. Also adds command-line option to specify a ollama model. Updates the cross-reference injection to display a better formated explanation. Fixes the reference to the cross-reference data file in the config.	2025-07-22 07:38:10 -04:00
Vijay Janapa Reddi	82f676da2f	Merge branch 'cross-referencing' into dev	2025-07-21 21:51:03 -04:00
Vijay Janapa Reddi	6d97ef2765	sweep study	2025-07-21 21:49:48 -04:00
Vijay Janapa Reddi	fe579b28d4	Add sophisticated explanation cleanup and optimize model recommendations based on systematic experiments	2025-07-21 20:54:37 -04:00
Vijay Janapa Reddi	f187028ef5	Fix _quarto.yml ordering to process all chapters in correct sequence	2025-07-21 20:48:55 -04:00
Vijay Janapa Reddi	7ddf5d0b1e	Optimize cross-reference generation: switch to llama3.1:8b with flexible 6-12 word explanations	2025-07-21 20:41:30 -04:00
Vijay Janapa Reddi	9551eb6fcd	Add comprehensive design space analysis and model comparison	2025-07-21 20:11:53 -04:00
Vijay Janapa Reddi	ad9f9c50ad	Optimize explanation length based on experiments	2025-07-21 19:55:30 -04:00
Vijay Janapa Reddi	42618a3205	Add LLM optimization experiment results	2025-07-21 19:54:51 -04:00
Vijay Janapa Reddi	0f9fb29a9b	🧪 LLM Optimization Experiment Framework - Ready to Test Multiple Models and Lengths	2025-07-21 19:46:20 -04:00
Vijay Janapa Reddi	8fb5e43f00	✅ Complete Cross-Reference System Ready for Production 🎉 FINAL SYSTEM FEATURES: - Bold directional arrows (→ ← •) with fallback logic - Academic § symbol with auto-generated section numbers - AI-generated natural explanations - Clean academic formatting - Robust error handling and edge cases - Professional textbook design standards 📚 READY FOR MODEL OPTIMIZATION EXPERIMENTS	2025-07-21 19:39:40 -04:00
Vijay Janapa Reddi	af2024a8fb	🚀 Revolutionary Feature: AI-Generated Cross-Reference Explanations ✨ WORLD-CLASS INNOVATION: - Added --explain flag for AI-generated cross-reference explanations - Uses local Ollama + qwen2.5:7b (private, no external API costs) - Generates 8-12 word explanations of WHY sections connect 🛠 TECHNICAL IMPLEMENTATION: - Interactive setup with model detection and installation guidance - Graceful fallback if Ollama not available - Self-contained in cross_refs.py with smart error handling - Adds 'explanation' field to JSON output 📚 EXAMPLE OUTPUT: - 'Understands AI's biological inspiration and neural network basics.' - 'Understands the context for choosing the right ML framework.' - 'Understands neural networks' role in AI and machine learning.' 🎯 RESULTS ACHIEVED: - 13 cross-references with AI explanations generated - 76% average similarity maintained - Student-focused language ('Understands...' tells value) - Professional textbook quality 🚀 USAGE: python3 cross_refs.py -g -m model -o output.json -d contents/ --explain This makes your textbook's cross-reference system truly revolutionary - no other textbook provides contextualized AI explanations for connections	2025-07-21 17:32:46 -04:00
Vijay Janapa Reddi	dd9d1b4ade	Update connection types: Foundation → Background - Changed 'Foundation' to 'Background' for backward references - Keeps 'Preview' for forward references - More intuitive terminology for LLM understanding: - Background = earlier chapters providing context/prerequisites - Preview = later chapters showing applications/extensions - Tested: 36 Background + 24 Preview connections generated successfully	2025-07-21 17:02:15 -04:00
Vijay Janapa Reddi	3d634f374b	Update command line options: -t for threshold, --threshold - Changed threshold option from --similarity-threshold to --threshold - Changed short form from -threshold to -t - Removed -t from --train (training now uses --train only) - Updated documentation examples to use --train instead of -t - More intuitive and concise command line usage	2025-07-21 16:43:09 -04:00
Vijay Janapa Reddi	f941fcaecd	Add short -threshold option for similarity threshold - Added -threshold as short form of --similarity-threshold - Maintains backward compatibility with existing scripts - Makes command line usage more concise	2025-07-21 16:39:14 -04:00
Vijay Janapa Reddi	212bdc76bf	Update cross-references and filters with improved results - Updated cross_refs.json with 96 sections and 74 cross-references - User-tweaked filters.yml (removed content_filters section) - Removed old cross_references.json file - Results: 50% extraction rate, 66.3% similarity	2025-07-21 16:25:36 -04:00
Vijay Janapa Reddi	ed4c96f0a5	Fix pypandoc content loss from ASCII tables - 30% improvement in section extraction	2025-07-21 16:25:12 -04:00
Vijay Janapa Reddi	acadb19572	🚀 Optimize filtering for better cross-reference coverage ✅ Improved Section Extraction: - Updated filters.yml to be less aggressive on substantial content - Removed exclusions for 'overview', 'introduction', 'conclusion' sections - These often contain valuable technical content, not just meta-content - Kept exclusions for truly meta content like 'purpose', 'learning objectives' ✅ Relaxed Content Filters: - Min length: 200 → 150 chars (allow shorter sections) - Max length: 15000 → 20000 chars (allow longer sections) - List ratio: 70% → 80% (allow more list-heavy content) - Code ratio: 80% → 90% (allow more code examples) - Citation ratio: 30% → 40% (allow more referenced content) ✅ Results with Domain-Adapted Model: - Section extraction: 52 → 74 sections (42% improvement) - Cross-references: 30 → 63 references (110% improvement) - File coverage: 8 → 13 files (62% more files connected) - Quality maintained: 65.6% average similarity ✅ Optimal Settings Identified: - Similarity threshold: 0.6 (vs default 0.65) - Max suggestions: 3 per section - Balances quantity and quality effectively This version provides much better coverage while maintaining high-quality cross-references between legitimate technical sections.	2025-07-21 16:08:27 -04:00
Vijay Janapa Reddi	494d6b7b58	🎉 Working version: Fix fake headers + Enhanced filtering system ✅ CRITICAL FIX - Preserve Original Section IDs: - Extract exact {#sec-...} identifiers from raw markdown - Preserve original section titles without modification - Eliminate artificial header reconstruction that created fake sections - No more invalid section IDs like 'sec-introduction-then' - Only process sections with legitimate {#sec-...} identifiers ✅ Enhanced File & Section Filtering: - Added file-level regex filtering to exclude entire files - Simplified section filtering to use only regex patterns - Removed redundant exact/pattern distinction - Support anchored patterns (^purpose$) and flexible patterns (.quiz.) - Updated filters.yml with comprehensive filtering rules ✅ Improved Content Processing: - Preserve original markdown structure during pypandoc cleaning - Match cleaned content with original headers by title similarity - Maintain authoritative section IDs throughout the pipeline - Remove only Quarto artifacts while keeping real headers intact ✅ User Experience Enhancements: - Added --quiet mode to reduce verbose output - Better error handling and validation for YAML configuration - Clear feedback about filtering and exclusions - Comprehensive testing verified all functionality ✅ Results: - Only legitimate cross-references between real sections - Exact section IDs matching original markdown files - High-quality embeddings from cleaned content - Robust filtering system for production use This version successfully addresses fake header generation and implements the complete filtering system as requested. All section IDs and titles are now preserved exactly as written in the original markdown files.	2025-07-21 16:00:25 -04:00
Vijay Janapa Reddi	3cb655c2a6	Improve YAML processing with proper Python library usage ✅ Enhanced YAML handling: - Added PyYAML import validation with clear error messages - Improved error handling for malformed YAML files - Added UTF-8 encoding support for international characters - Added config validation for required sections - Better user feedback for missing/invalid configuration ✅ Configuration updates: - Renamed to filters.yml for cleaner naming - Added comprehensive section filtering rules - Alphabetized exact matches for easier maintenance - Enhanced documentation with library requirements ✅ Testing verified: - YAML loading and validation - Section filtering (meta-content exclusion) - Error handling for malformed files - End-to-end cross-reference generation Addresses proper Python library usage for YAML processing.	2025-07-21 15:25:51 -04:00
Vijay Janapa Reddi	5540fc486e	Rename cross_referencing to cross_refs - Rename directory scripts/cross_referencing/ -> scripts/cross_refs/ - Rename cross_referencing.py -> cross_refs.py - Update JSON output structure to use file/sections/targets hierarchy - Update _quarto.yml path to look in current directory and exit if not found	2025-07-21 11:13:34 -04:00
Vijay Janapa Reddi	9fb7e14876	Removes and re-adds several image files. This commit removes several existing image files and re-adds them. This is likely due to some process that needed to be run over all images, or that they were unintentionally removed before. Updates image validation script to include autofix functionality to address format mismatches.	2025-07-11 21:53:02 -04:00
Vijay Janapa Reddi	b7395d942c	Adds script to extract headers from .qmd files Creates a script that extracts section headers from .qmd files, outputting them in a formatted table showing filename, header level, and header text. It supports processing either a single file or all .qmd files within a directory.	2025-07-11 18:09:22 -04:00

1 2 3 4 5 ...

326 Commits