Commit Graph

290 Commits

Author SHA1 Message Date
Vijay Janapa Reddi
eec683053f fix: Prevent double colon prefix in table caption updates
PROBLEM: Getting ': : **Title**: ...' instead of ': **Title**: ...'

ROOT CAUSE: new_caption parameter sometimes already contains ': ' prefix
from validate_and_improve_caption(), but update_table_caption() was
unconditionally adding another ': ' prefix.

SOLUTION: Check if new_caption already starts with ': ' prefix
- If yes: Use as-is (no additional prefix)
- If no: Add ': ' prefix and ensure proper period formatting

BEFORE:
 Input: ': **AI Evolution**: text'  → Output: ': : **AI Evolution**: text'
 Double colon at start

AFTER:
 Input: ': **AI Evolution**: text'  → Output: ': **AI Evolution**: text'
 Input: '**AI Evolution**: text'   → Output: ': **AI Evolution**: text'
 Single colon prefix always

VERIFICATION:
 Starts correctly with single ': ': True
 No ': :' double prefix: True
 Matches correct format: True

RESULT: Clean table format ': **Title**: explanation {#tbl-id attributes}'
2025-07-23 13:49:19 -04:00
Vijay Janapa Reddi
a769ce2385 simplify: Streamline table caption updates to always use simple format
USER INSIGHT: Table format is actually very simple and consistent:
': Representative hardware platforms across... {#tbl-representative-systems hover striped}'

BEFORE: Complex build_table_search_patterns() with 4 different cases:
 Old format with line breaks
 Old format with content stuck to same line
 New format with line breaks
 New format with content stuck to same line
🔧 40+ lines of complex pattern matching logic

AFTER: Simple, single-pattern approach:
 One regex pattern: '^:?\s*{caption}(\s*\{{#tbl-id[^}]*\}})(.*)$'
 Always output: ': [new_caption]. {#tbl-id [attributes]}'
 Handle period correctly (avoid double periods)
 15 lines total - much cleaner

TECHNICAL CHANGES:
- Simplified build_table_search_patterns() from 40+ lines to 15 lines
- Single regex pattern handles both ': caption' and 'caption' formats
- Always produces consistent format: ': [caption]. {#tbl-id [attributes]}'
- Fixed period handling to avoid double periods in output

VERIFICATION:
 Input:  ': Representative hardware platforms... {#tbl-representative-systems hover striped}'
 Output: ': Hardware comparison across ML deployment... {#tbl-representative-systems hover striped}'
 Maintains simple format: ': [caption]. {#tbl-id [attributes]}'

USER WAS RIGHT: Keep it simple! No need for complex edge case handling.
2025-07-23 13:40:39 -04:00
Vijay Janapa Reddi
32ed90035a feat: Add selective content processing with --figures-only and --tables-only
NEW COMMAND LINE OPTIONS:
 --figures-only, -F: Process only figures (ignore tables)
 --tables-only, -T: Process only tables (ignore figures)
 Mutually exclusive group prevents conflicting options

COMPREHENSIVE IMPLEMENTATION:
🔧 Updated all processing methods:
- build_content_map_from_qmd(): Add filtering logic with skip messages
- check_caption_quality(): Filter content analysis
- repair_captions(): Filter repair operations
- complete_caption_improvement_workflow(): Filter LLM improvements

📊 FILTERING VERIFIED:
- Normal mode: 286 figures, 91 tables
- Figures-only: 286 figures, 0 tables 
- Tables-only: 0 figures, 91 tables 

💡 USAGE EXAMPLES:
python improve_figure_captions.py -d contents/core/ --figures-only
python improve_figure_captions.py -d contents/core/ -F
python improve_figure_captions.py --analyze -d contents/core/ --tables-only
python improve_figure_captions.py --repair -d contents/core/ -T

🎯 BENEFITS:
- Faster processing for targeted content types
- Useful for focused caption improvement workflows
- Helpful skip messages show what's being filtered
- Works with all modes (analyze, repair, improve, build-map)
2025-07-23 13:35:40 -04:00
Vijay Janapa Reddi
462cda70a4 fix: Correct table caption extraction to prevent double colon prefix
PROBLEM: User getting ': : **Hardware Spectrum**:' instead of ': **Hardware Spectrum**:'

ROOT CAUSE: Wrong regex pattern order in detect_table()

TECHNICAL CHANGE: Reordered regex patterns to try old format first
- Old format: ^:\s* properly strips ': ' prefix
- New format: ^[^{]+ only for captions without leading colon

RESULT: No more ': :' double colon prefixes in table captions
2025-07-23 13:06:30 -04:00
Vijay Janapa Reddi
9805cb178a enhance: Add explicit anti-weak-verb instructions to LLM prompt
PREVENTION > FIXING: Instead of just post-processing weak verbs, now explicitly instruct the LLM to avoid them

MULTI-LAYER PROTECTION:
1. 🚫 Critical rule section with 14 banned weak verbs listed explicitly
2.  Clear before/after examples showing bad vs good patterns
3. 🎯 Final reminder at end of prompt to reinforce the rule
4. 🛡️ Post-processing cleanup as backup safety net

INSTRUCTIONAL APPROACH:
- LLM now sees explicit 'NEVER start with Shows, Demonstrates, Illustrates...'
- Direct examples: 'Shows how X' → 'X processes Y through Z'
- Multiple reinforcement points throughout the prompt

RESULT: LLM should generate strong captions from the start, with hardcoded fixes as fallback
2025-07-23 12:42:50 -04:00
Vijay Janapa Reddi
ae6b66f87b fix: Eliminate weak verbs from LLM-generated captions
PROBLEM: LLM generating weak textbook captions like 'Shows how', 'Demonstrates how', 'Visualizes how'

ROOT CAUSE: Contradictory LLM prompt examples were teaching the exact weak language we wanted to avoid

SOLUTION:
1. Fixed LLM prompt examples to use strong, direct language
2. Added 6 new banned weak verbs: Visualizes, Exemplifies, Traces, Explains, Displays, Presents
3. Enhanced post-processing to catch and fix these patterns

RESULT: LLM now generates strong, direct textbook captions without weak descriptive language
2025-07-23 12:39:24 -04:00
Vijay Janapa Reddi
b5a97a83b9 fix: Handle all table caption edge cases with malformed colons
🐛 EDGE CASE FIXES: Robust colon handling for table captions

 PROBLEMS FOUND:
- ': :**bold**: explanation' → ': :**bold**: explanation' (double colon)
- '::**bold**: explanation' → ':**bold**: explanation' (wrong prefix)
- ':   :**bold**: explanation' → messy spacing issues

 COMPREHENSIVE SOLUTION:
1. **Detect existing table prefix** (': ' pattern)
2. **Strip table prefix** if present
3. **Clean ALL leading colons** with r'^:+\s*' regex
4. **Fix regex pattern** to only capture **bold** part
5. **Add single table prefix** for final output

🧪 EDGE CASES NOW HANDLED:
 ': :**AI Evolution**: text' → ': **AI Evolution**: text'
 '::**AI Evolution**: text' → ': **AI Evolution**: text'
 ':   :**AI Evolution**: text' → ': **AI Evolution**: text'
 '**AI Evolution**: text' → ': **AI Evolution**: text'
 ': **AI Evolution**: text' → ': **AI Evolution**: text'

🔧 TECHNICAL CHANGES:
- Added r'^:+\s*' pattern to remove multiple leading colons
- Updated regex from r'^(.*?\*\*[^*]+\*\*)\s*:\s*(.+)$'
  to r'^(\*\*[^*]+\*\*)\s*:\s*(.+)$' (exact **bold** match)
- Comprehensive cleanup prevents any colon prefix issues

 RESULT: Bulletproof table formatting regardless of input malformation
2025-07-23 12:32:35 -04:00
Vijay Janapa Reddi
51ccddc37e fix: Prevent double colon in table captions
🐛 CRITICAL BUG FIX: Table prefix duplication

 PROBLEM:
- Table captions were getting double colons: ': : **Title**: explanation'
- Script blindly added ': ' prefix to ALL table captions
- But some captions already had ': ' prefix from previous processing

 SOLUTION:
- Check if caption already starts with ': ' before processing
- Strip existing ': ' prefix during processing
- Add back single ': ' prefix for tables only

🔧 LOGIC FLOW:
1. Input: ': **Bold**: explanation' (existing table format)
2. Strip prefix: '**Bold**: explanation' (for processing)
3. Process: improve language, spacing, etc.
4. Add table prefix: ': **Bold**: improved explanation'

🧪 TESTED:
- Table without prefix → gets ': ' added correctly
- Table with existing prefix → no duplication
- Table with messy spacing → cleaned and normalized
- All tests pass with proper ': **Bold**: format

 RESULT: Clean table format with single colon prefix
2025-07-23 12:29:34 -04:00
Vijay Janapa Reddi
1ceb5779b6 docs: Update all documentation for streamlined command line options
📖 COMPREHENSIVE DOCUMENTATION UPDATE:

 Script Internal Documentation:
- Updated main script header docstring with new modes
- Updated class FigureCaptionImprover docstring
- Fixed function docstrings and comments throughout
- Removed references to old --workflow, --update, --validate options
- Updated print messages to reflect new terminology

📚 New External Documentation:
- Created scripts/FIGURE_CAPTIONS.md with complete usage guide
- Added model selection guide with speed/quality ratings
- Included troubleshooting section and best practices
- Updated scripts/README.md with script overview

🔧 Updated References:
- Main modes: --improve/-i, --build-map/-b, --analyze/-a, --repair/-r
- Removed outdated workflow terminology
- Clear examples for all usage patterns
- Performance optimization guidelines

📋 Documentation Features:
- Command-line option tables with short/long forms
- Model comparison with star ratings
- Before/after caption examples
- Integration with Quarto build process
- Success metrics and quality standards

 All documentation now reflects the streamlined v2.0 interface
2025-07-23 12:19:51 -04:00
Vijay Janapa Reddi
b788bb0104 feat: Add -b short option for --build-map for consistency
 CONSISTENCY FIX:
- Added -b short form for --build-map option
- All main modes now have both short and long forms:
  * --build-map/-b   (build content map)
  * --analyze/-a     (quality analysis)
  * --repair/-r      (fix formatting)
  * --improve/-i     (LLM improvement)

📝 UPDATED EXAMPLES:
- Added python script.py -b -d contents/core/ example
- Maintains consistency across all command options

🧪 TESTED:
- -b option works correctly with content map building
- Help text displays properly formatted options
2025-07-23 12:14:42 -04:00
Vijay Janapa Reddi
b1d1b1f3ca refactor: Streamline command line options to eliminate redundancy
🧹 MAJOR CLEANUP - Removed confusing redundant options:

 REMOVED REDUNDANT OPTIONS:
- --workflow (identical to default behavior)
- --update (useless without --improve, but mutually exclusive)
- --validate (confusing vs --check)
- --check (merged into --analyze)
- --build-qmd-map (renamed for clarity)

 NEW STREAMLINED OPTIONS:
- --improve/-i: LLM caption improvement (default mode)
- --build-map: Build and save content map to JSON
- --analyze/-a: Quality analysis + validation combined
- --repair/-r: Fix formatting issues only

🎯 BENEFITS:
- 4 clear options vs 7 confusing ones
- No more identical default vs --workflow confusion
- No more broken workflow separation (--improve + --update)
- Clear purpose for each option
- Intuitive short flags (-i, -a, -r)

📝 USAGE NOW CRYSTAL CLEAR:
- Default: python script.py -d contents/core/ (LLM improvement)
- Analysis: python script.py --analyze -d contents/core/
- Map only: python script.py --build-map -d contents/core/
- Repair: python script.py --repair -d contents/core/

 Backward compatibility maintained for core workflows
2025-07-23 12:06:02 -04:00
Vijay Janapa Reddi
984cd97997 fix: Critical table extraction regression - restore ability to find all tables
🐛 CRITICAL FIX: Table extraction was broken for most tables
- Before: 23/92 tables found (69 failures, 80.9% success)
- After: 92/92 tables found (0 failures, 100% success)

🔧 Root cause: Regex pattern excluded ':' characters
- Tables like '**Special Function Units**: Details...' were rejected
- Pattern stopped at first ':' because it was in exclusion list [^{{\n:]+?
- Fix: Allow colons in caption text by changing to [^{{\n]+?

📊 Results across all core files:
- hw_acceleration: 0→21 tables (was completely broken)
- optimizations: 0→10 tables
- privacy_security: 0→8 tables
- frameworks: 0→6 tables
- All other files: Similar dramatic improvements

 Perfect extraction now working:
- 270 figures extracted successfully
- 92 tables extracted successfully
- 0 extraction failures
- Ready for LLM caption improvement processing
2025-07-23 11:42:53 -04:00
Vijay Janapa Reddi
76826d2969 feat: Enhanced weak language removal with stronger replacements
🎯 Improved mid-sentence weak language detection:
- Handle 'X illustrates how Y' patterns in middle of sentences
- Replace with stronger constructions: 'Y through X', 'Y via X'
- Avoid circular replacements (no longer use 'shows' as replacement)

💪 Stronger language replacements:
- 'illustrates how' → direct restructure with stronger verbs
- 'demonstrates that' → 'establishes that' / 'confirms that'
- 'depicts' → 'presents' / 'exposes'
- 'reveals' → 'establishes' / 'exposes'

🧪 Comprehensive testing verified:
- All weak words removed from captions
- No circular replacement issues
- Maintains meaning while using stronger language
- Proper table format and spacing preserved

 Real-world test case from screenshot now produces clean output:
'Each of these scenarios illustrates how...'
→ 'Machine learning models can serve as amplifiers through each of these scenarios'
2025-07-23 11:28:43 -04:00
Vijay Janapa Reddi
da476f48a8 fix: Resolve spacing issues in caption processing
🔧 Added comprehensive spacing normalization:
- Replace multiple spaces with single space
- Remove leading/trailing whitespace
- Ensure single space after colons consistently

📝 Enhanced caption parsing:
- More robust regex for **bold**: format parsing
- Handle spaces around colons properly
- Normalize spacing throughout processing pipeline

 Fixed specific issues:
- No more double spaces in captions
- Consistent table format: ': **Bold**: explanation'
- Clean spacing even with malformed input
- Proper handling of edge cases (missing spaces, multiple spaces)

🧪 Thoroughly tested with edge cases including:
- Multiple consecutive spaces
- Missing spaces after colons
- Leading/trailing whitespace
- Complex mixed spacing scenarios
2025-07-23 11:25:06 -04:00
Vijay Janapa Reddi
cae8d061c9 feat: Comprehensive caption quality improvements
 Enhanced LLM prompt:
- Added explicit instructions to avoid weak sentence starters
- Discourage 'Illustrates', 'Shows', 'Demonstrates' etc.
- Encourage direct, strong language with examples

🔧 Post-processing improvements:
- Fix capitalization after periods (handle abbreviations)
- Replace weak sentence starters with direct language
- Ensure proper table format with ':' prefix
- Comprehensive caption validation pipeline

📝 Quality enforcement:
- Automatic detection and correction of weak language
- Proper sentence case throughout explanations
- Standardized table caption format: ': **Bold**: explanation'
- Word-by-word improvements while preserving meaning

 Fully tested with edge cases and validation
2025-07-23 11:21:35 -04:00
Vijay Janapa Reddi
dd49365d29 fix: Preserve and standardize colon prefix for table captions
- Preserve existing ':' prefix in old format table captions
- Add ':' prefix to new format table captions for consistency
- Standardize all table captions to ': Caption {#tbl-id}' format
- Tested with both old and new caption formats
2025-07-23 11:08:14 -04:00
Vijay Janapa Reddi
2ade2df23f fix: Ensure proper line breaks after table captions during updates
- Add line break preservation logic to table caption replacement
- Handle problematic case where content is stuck to caption line
- Force line break insertion between caption and following content
- Update TikZ figure caption replacement to preserve line breaks
- Tested with problematic cases to ensure proper formatting
2025-07-23 10:50:12 -04:00
Vijay Janapa Reddi
6be54008b3 feat: Add retry logic and improve sentence case formatting
- Implement 3-retry logic with exponential backoff (2s, 4s, 8s)
- Smart retry only for recoverable errors (API/network, not content issues)
- Enhanced sentence case formatting with comprehensive technical term preservation
- Preserve spaces and punctuation correctly during caption formatting
- Support for both fast models (qwen2.5:7b) and large models (gemma3:27b)
- Robust error handling for production caption improvement workflow
2025-07-22 21:06:54 -04:00
Vijay Janapa Reddi
67348987b4 Remove llm_experiments directory 2025-07-22 18:46:33 -04:00
Vijay Janapa Reddi
d2b7ff04dd Improves figure and table caption formatting
Enhances figure and table caption formatting to ensure consistency and readability.

- Implements a comprehensive `apply_sentence_case` function that handles technical terms, acronyms, and proper nouns correctly.
- Refines the `format_bold_explanation_caption` function to use the improved sentence casing.
- Updates the caption update logic to support both figures and tables.
- Widens key phrase length for figures.
2025-07-22 18:21:29 -04:00
Vijay Janapa Reddi
c6f974c4c1 Improves figure caption generation with LLM
Enhances the figure caption improvement process by directly
updating the QMD files immediately after generating improved
captions with the LLM. This approach streamlines the workflow
and reduces the need for a separate update step. It also improves
the prompt given to the LLM to better guide caption generation.
2025-07-22 17:49:40 -04:00
Vijay Janapa Reddi
a6c6f091ae Improves figure captioning and extraction stats
Refines the guidelines for generating figure and table
captions to enhance their clarity and pedagogical value.

Also, enhances the reporting of extraction failures by tracking
specific IDs and files with issues for easier debugging.
2025-07-22 17:40:47 -04:00
Vijay Janapa Reddi
ae98521f59 Adds instruction to keep source in caption
Adds a guideline to the figure caption writing instructions to preserve source information when it exists.
2025-07-22 17:32:50 -04:00
Vijay Janapa Reddi
b6b055412b Enhances figure caption generation with Ollama
Improves the prompt used for generating figure captions with the Ollama model to focus on educational value and clarity.

The updated prompt provides clearer instructions and formatting guidelines for generating captions that teach students about the concepts illustrated in figures and tables. It emphasizes the use of key phrases and concise explanations tailored to the context of an AI/ML systems textbook.

Additionally, refactors the figure caption extraction logic to handle nested brackets in captions and escaped characters in paths. This fixes issues with figures containing links and special characters. Stores original captions as-is for better comparison.
2025-07-22 17:30:50 -04:00
Vijay Janapa Reddi
963519aff7 Enhances caption improvement workflow
Improves the caption improvement process by implementing a complete in-memory workflow, enhancing context extraction with pypandoc, and refining the LLM prompt for better caption generation. It also adds checks for Ollama and the model before proceeding and simplifies command-line arguments.

- Introduces a complete in-memory workflow: Build content map → Improve captions → Update QMD files, streamlining the process.
- Uses targeted search and replace for updates.
- Enhances context extraction using pypandoc AST parsing for richer paragraph context.
- Refines the LLM prompt with more focused instructions and examples, improving caption quality.
- Adds checks for Ollama and specified model, ensuring smooth execution.
- Improves CLI with a more straightforward syntax and helper functions.
2025-07-22 17:06:21 -04:00
Vijay Janapa Reddi
106242b848 Fix caption formatting: title case bold, sentence case explanation
FORMATTING IMPROVEMENTS:
- Add format_bold_explanation_caption() function for proper capitalization
- Bold part: Title Case (Every Important Word Capitalized)
- Explanation part: sentence case (only first word capitalized, plus proper nouns)
- Updated LLM prompt with explicit formatting rules
- Added post-processing after LLM generation for consistent formatting

TECHNICAL DETAILS:
- Uses titlecase library for proper title case in bold section
- Preserves proper nouns/acronyms: AI, ML, TikZ, LaTeX, GitHub, etc.
- Applied automatically after LLM response validation
- Robust regex parsing of **bold**: explanation format

TESTED RESULTS:
BEFORE: **DATA SYNCHRONIZATION**: The Adaptive Resource Pattern Addresses...
AFTER:  **Data Synchronization in Distributed Systems**: The adaptive resource pattern addresses...

Perfect academic formatting for educational content
2025-07-22 15:55:44 -04:00
Vijay Janapa Reddi
42dc555791 Implement complete LLM integration for caption improvement
NEW FEATURES:
- Add --improve workflow to enhance captions with Ollama LLM
- Implement context extraction around figures/tables from QMD files
- Add multimodal image support for markdown figures
- Support TikZ compilation to PNG for vision models
- Enforce **bold**: explanation format through prompt engineering

METHODS ADDED:
- generate_caption_with_ollama(): Core LLM API integration with format validation
- extract_section_context(): Smart context extraction around figure/table references
- encode_image(): Base64 encoding for multimodal models
- compile_tikz_to_image(): LaTeX to PNG pipeline for TikZ figures
- improve_captions_with_llm(): Main orchestration for LLM improvement workflow
- parse_sections(): QMD section parsing for context

WORKFLOW:
1. --build-qmd-map: Extract figures/tables from QMD files
2. --improve: Use LLM to generate **bold**: explanation captions with context
3. --update: Apply improved captions back to QMD files

TECHNICAL:
- Default model: llava:7b (multimodal support)
- Smart image path resolution for markdown figures
- Temperature 0.3 for consistent formatting
- Robust error handling and validation
- Complete documentation and examples

TESTED: Successfully improved 7 figures + 3 tables with proper format
2025-07-22 15:52:15 -04:00
Vijay Janapa Reddi
11f5d173ac Replace custom title case with professional titlecase library
- Install and import titlecase library for proper English title case
- Remove complex 50+ line custom apply_title_case implementation
- Remove **bold**: explanation formatting logic (no longer used)
- Simplify normalize_caption_case to use titlecase library directly
- Maintain proper capitalization without custom word lists
- Significantly cleaner and more reliable implementation
2025-07-22 15:42:29 -04:00
Vijay Janapa Reddi
319814d90a Simplify JSON structure to essential fields only
- Remove unnecessary metadata fields (start_pos, end_pos, detection_method)
- Remove tikz_code, language, and path from metadata
- Keep only essential fields: current_caption, original_caption, new_caption, type, source_file
- Add new_caption placeholder field for future improvements
- Significantly reduce JSON file size and complexity
2025-07-22 15:39:56 -04:00
Vijay Janapa Reddi
b935980fc4 Implement robust file-based caption update mechanism
- Add process_qmd_files function for efficient batch updates
- Group figures and tables by source file to minimize I/O
- Update each file once with all caption changes
- Use regex-based replacement with proper error handling
- Track source_file in JSON structure for organized processing
- Support both figure and table caption updates in single pass
2025-07-22 15:36:25 -04:00
Vijay Janapa Reddi
c601364f07 Update table caption detection to support format transition
- Support both old format (: Caption {#tbl-id}) and new format (Caption {#tbl-id})
- Improve regex patterns to capture only caption line, not table content
- Add line boundary anchors to prevent capturing table structure
- Update functions will convert old format to new format automatically
- Ensure clean caption extraction without leading colons
2025-07-22 15:33:37 -04:00
Vijay Janapa Reddi
ea1f676ecd Remove tex-file functionality and simplify workflow
- Remove --tex-file argument and build_content_map_from_tex method
- Eliminate all tex-file related processing code
- Update help text and examples to focus on QMD-only approach
- Remove phase-based terminology in favor of descriptive language
- Simplify workflow to QMD-focused content mapping only
- Maintain backward compatibility for existing save/load functions
2025-07-22 15:29:04 -04:00
Vijay Janapa Reddi
bedff42709 Add QMD-focused content map building with save-json option
- Implement build_content_map_from_qmd for direct QMD file processing
- Add specialized detection functions for markdown, tikz, and code figures
- Support flexible fig-id and tbl-id placement in attributes
- Add --save-json option to output content map for review
- Achieve 100% extraction success rate across all core chapters
- Process 270 figures and 92 tables with zero failures
2025-07-22 15:25:15 -04:00
Vijay Janapa Reddi
ca7bc57050 feat: Improve pattern flexibility for ID placement
🔧 ENHANCED DETECTION PATTERNS:
- Allow fig-id/tbl-id anywhere in attribute blocks
- Support: {#fig-id}, {width=80% #fig-id}, {#fig-id .class}
- More robust handling of complex attribute combinations

📝 PATTERN IMPROVEMENTS:
- Markdown figures: (?:\s|[^}}])* allows ID placement anywhere
- TikZ figures: Same flexible ID matching
- Tables: Simplified to find any line with #tbl-id
- Code figures: Already flexible, no changes needed

 VALIDATION CONFIRMED:
- All existing detections maintained: 265/296 figures, 91/91 tables
- No regressions in functionality
- Patterns handle edge cases like multiple attributes

🎯 BENEFITS:
- Handles real-world QMD variations where IDs aren't first
- More resilient to attribute order changes
- Simpler table detection logic
- Future-proof for new attribute patterns

Ready for QMD-focused development with robust pattern matching
2025-07-22 15:14:09 -04:00
Vijay Janapa Reddi
278275a3d5 feat: Switch default input to caps.tex for comprehensive coverage
- Change default --tex-file from Machine-Learning-Systems.tex to caps.tex
- Update build_content_map_from_tex default parameter
- Update help documentation to reflect new default

Benefits with caps.tex:
- 296 figures (vs 37 previously) - 8x more content
- 91 tables (vs 3 previously) - 30x more content
- 95.9% figure mapping coverage (284/296 found)
- 100% table mapping coverage (91/91 found)
- Complete book content without requiring slow builds

Usage remains the same:
  python improve_figure_captions.py --build-map  # Now uses caps.tex
  python improve_figure_captions.py --build-map --tex-file custom.tex
2025-07-22 14:59:03 -04:00
Vijay Janapa Reddi
2d63744a74 feat: Add --tex-file option to specify custom LaTeX input
- Add --tex-file argument to bypass automatic builds
- Default remains Machine-Learning-Systems.tex for backward compatibility
- Enables using existing .tex files without rebuilding
- Update help documentation with usage examples

Benefits:
- Faster workflow when .tex file already exists
- Support for custom/alternative .tex file paths
- No need to rebuild entire book for caption processing

Usage:
  python improve_figure_captions.py --build-map --tex-file custom.tex
  python improve_figure_captions.py --build-map  # Uses default
2025-07-22 14:38:12 -04:00
Vijay Janapa Reddi
8fdfe2970b feat: Add consistency checking for commented chapter content
- Parse both active and commented chapters from _quarto.yml
- Detect figures/tables from commented-out chapters in content map
- Warn when .tex content doesn't match active book structure
- Add detailed reporting of consistency issues
- Provide actionable guidance for resolving mismatches

Prevents silent failures where:
- .tex file contains figures from all chapters (including commented)
- QMD processing only scans active chapters
- Caption updates would fail for commented chapter content

Now shows: 'Found 9 active chapters, 50 commented chapters'
and warns if content map contains items from inactive sources.
2025-07-22 14:17:37 -04:00
Vijay Janapa Reddi
550f2cac69 fix: Add TikZ figure detection and improve nested bracket handling
- Add support for TikZ figures using ::: {#fig-id} div format
- Detect figures in conditional visibility blocks
- Fix regex pattern to handle nested square brackets in captions
- Add type detection ('markdown' vs 'tikz') for proper updating
- Update caption replacement logic for both formats

Results: Perfect figure detection coverage
- Before: 23/37 figures found (62%)
- After: 37/37 figures found (100%)

Fixes issue with fig-ai-timeline, fig-cloudml-example, and fig-TinyML-example
that were previously showing as missing despite being present in QMD files.
2025-07-22 14:10:04 -04:00
Vijay Janapa Reddi
117fdeef99 feat: Add YAML-based book structure processing
- Parse _quarto.yml to extract active chapters in order
- Process QMD files following book structure instead of filesystem order
- Skip commented-out chapters (47 total, only 8 active)
- Add get_book_chapters_from_quarto() method
- Add find_qmd_files_in_order() method
- Update validation and quality check to use ordered processing

Results in much more accurate analysis:
- 23/37 figures found in active chapters (not 23/61 random files)
- Missing 14 figures are in commented-out chapters
- Follows intended book structure and order
2025-07-22 14:03:12 -04:00
Vijay Janapa Reddi
a99962e442 feat: Add comprehensive caption validation and repair system
- Add CaptionQualityChecker class with quality rules
- Implement --check/-c flag for caption quality analysis
- Implement --repair/-r flag for selective caption fixing
- Add short-form flags for all options (-b, -c, -r, -v, -u)
- Support multiple directories with -d flag
- Professional quality reports with issue categorization
- Smart repair of punctuation and capitalization issues

Quality rules detect missing punctuation, poor capitalization,
generic captions, broken formatting, and LaTeX artifacts.
Enables targeted caption improvements while maintaining quality.
2025-07-22 13:57:28 -04:00
Vijay Janapa Reddi
e7ab17ccca Add AI-powered figure caption improvement script
- Created scripts/improve_figure_captions.py - comprehensive tool for improving figure captions
- Uses llava:7b model to analyze images and generate educational captions
- Features JSON-structured responses for consistent formatting
- Supports both single file (-f) and directory (-d) processing modes
- Handles large images using requests library instead of curl
- Robust regex patterns to find figure definitions with any attribute ordering
- Comprehensive error handling and progress reporting with statistics
- Enhanced prompting for textbook-specific educational content
- Successfully tested on socratiq.qmd with 100% improvement rate
2025-07-22 11:45:01 -04:00
Vijay Janapa Reddi
7479bf52f7 Improves cross-reference generation with AI explanations
Enhances the cross-reference generation script to leverage local LLMs
(via Ollama) for generating natural language explanations, offering readers
contextual insights into the connections between document sections.

Refines prompts and adds retry logic for improved explanation quality.
Also adds command-line option to specify a ollama model.

Updates the cross-reference injection to display a better formated explanation.

Fixes the reference to the cross-reference data file in the config.
2025-07-22 07:38:10 -04:00
Vijay Janapa Reddi
82f676da2f Merge branch 'cross-referencing' into dev 2025-07-21 21:51:03 -04:00
Vijay Janapa Reddi
6d97ef2765 sweep study 2025-07-21 21:49:48 -04:00
Vijay Janapa Reddi
fe579b28d4 Add sophisticated explanation cleanup and optimize model recommendations based on systematic experiments 2025-07-21 20:54:37 -04:00
Vijay Janapa Reddi
f187028ef5 Fix _quarto.yml ordering to process all chapters in correct sequence 2025-07-21 20:48:55 -04:00
Vijay Janapa Reddi
7ddf5d0b1e Optimize cross-reference generation: switch to llama3.1:8b with flexible 6-12 word explanations 2025-07-21 20:41:30 -04:00
Vijay Janapa Reddi
9551eb6fcd Add comprehensive design space analysis and model comparison 2025-07-21 20:11:53 -04:00
Vijay Janapa Reddi
ad9f9c50ad Optimize explanation length based on experiments 2025-07-21 19:55:30 -04:00
Vijay Janapa Reddi
42618a3205 Add LLM optimization experiment results 2025-07-21 19:54:51 -04:00