Commit Graph

321 Commits

Author SHA1 Message Date
Vijay Janapa Reddi
b8183404b8 chore(release): shared versioning infrastructure
Lays foundation for unified release versioning across MLSysBook
publishable artifacts. Pure additions — no existing builds, configs,
or sources are touched.

scripts/version/release.py
  Python CLI with helpers:
  - compute-id: semver bump from previous tag (patch/minor/major/none/explicit)
  - compute-hash: deterministic SHA-256 over input directories with per-file index
  - emit-release: writes releases/<project>-<id>/release.json (canonical artifact)
  - emit-manifest: writes the build-time manifest the deployable bundles
  Tier A (citable) emits per-file Merkle index; Tier B (lite) is flat.

scripts/version/schema.json
  JSON Schema for release.json. Validates project/tier/release_id/release_hash
  + Tier A's files[] index. Used by validators in CI.

shared/release/release-pill.html
  Footer snippet — fetches deployable manifest at runtime, renders
  "v0.1.0 · Apr 26, 2026" pill. Configured per-project via
  <meta name="release-manifest"> tag. Silent on any fetch failure.

shared/release/release-card.html
  About-page snippet — fuller release-identity card with
  click-to-copy hash. Same fetch + meta-tag conventions.

shared/release/README.md
  Operator-facing contract documentation.

.github/workflows/_release-prepare.yml
  Reusable workflow_call. Validates confirm == "PUBLISH", computes
  new_release_id from previous tag + bump (delegates to release.py
  for canonical math). Outputs new_release_id/new_tag/previous_*
  for caller's downstream build and finalize steps. Refuses to
  re-tag existing releases (citation integrity).

Caller workflows still own their build commands and tag/release
creation; this only standardizes the input shape and version math.
2026-04-28 18:06:07 -04:00
Vijay Janapa Reddi
496e728135 fix(bib): restore vol1/vol2 references.bib after title-mangling regression
Commit 42bc54275 (figure-audit feat) inadvertently ran a tool that
broke BibTeX title syntax across hundreds of entries: e.g.
'{TensorFlow: Large-Scale...}' became '{{TensorFlow}}: {Large}-Scale...}',
producing unbalanced braces that caused the bib_lint parser to
truncate parsing partway through the entry. This surfaced in
pre-commit as 772 'missing required field' violations.

Restoring vol1+vol2 references.bib to the pre-mangling state
(9ebdf77d0) preserves all legitimate citation work from earlier
commits while undoing the unintended damage. The mechanical
formatter and bibtex-tidy hooks then re-emit a stable form.

Also: trailing newline added to scripts/README.md by pre-commit's
end-of-file-fixer.
2026-04-27 15:11:37 -04:00
Vijay Janapa Reddi
42bc54275d feat: add multimodal figure audit automation script and README 2026-04-27 13:35:48 -04:00
Vijay Janapa Reddi
086c2cbac8 refactor: move CI scripts to .github/, remove tools/
- Move sync_newsletter.py to .github/scripts/
- Move merge_contributors.py to .github/workflows/contributors/
- Update workflow YAML paths and script path references
- Delete reorganize_interviews_v2.py (one-off, already run)
- Remove tools/ (mcp_server, sysdesign_platform)
2026-03-21 09:04:53 -04:00
Vijay Janapa Reddi
0f034bb63a style(interviews): add horizontal rules between competency sections
Adds --- separators between competency topic sections within each
scope file for clearer visual separation when scrolling.
2026-03-21 08:26:43 -04:00
Vijay Janapa Reddi
05d97db43f feat(website): add About, Community pages and redesign Newsletter
New subsites:
- /about/ — Mission, Team, Contributors (auto-pulled from GitHub API),
  Adopters, Press, License sections with card-based design
- /community/ — Programs (TinyML4D, SciTinyML, Show & Tell, edX),
  Events calendar, Global Network map, Partners

Newsletter redesign:
- Buttondown API sync script (scripts/sync_newsletter.py) pulls
  published emails as .md files with auto-categorization and
  guest author detection
- Grid layout with banner images from Buttondown
- Embedded subscribe triggers shared subscribe-modal.js
- Dynamic stats (_stats.yml) updated by sync workflow
- Daily sync workflow at 6am UTC with build + deploy pipeline

Infrastructure:
- Navbar updated with anchor links (#mission, #team, #events, etc.)
- Subscribe button triggers shared modal across all subsites
- Contributors auto-update workflow generates about/contributors.json
- Deploy workflows for about/ and community/ subsites
- merge_contributors.py merges GitHub API with .all-contributorsrc
2026-03-20 14:44:49 -04:00
Vijay Janapa Reddi
0f33255b59 refactor(interviews): reorganize 1,063 questions by system scope
Restructure all 4 tracks from arbitrary round-based files to
learner-journey-based scopes. Each file represents the system
the student is reasoning about, with competency sub-sections
and L3→L6+ mastery levels inside.

Cloud: Single Machine → Distributed Systems → Serving Stack → Production Ops
Edge: Hardware Platform → Real-Time Pipeline → Deployed System
Mobile: Device & SoC → App Experience → Ship & Update
TinyML: Microcontroller → Sensing Pipeline → Deployed Device

Old round files preserved in _legacy/ folders. All cross-references
updated in README, STUDY_GUIDE, TOPIC_MAP, _quarto.yml, and index.qmd.
2026-03-20 10:40:55 -04:00
Vijay Janapa Reddi
7b92e11193 Repository Restructuring: Prepare for TinyTorch Integration (#1068)
* Restructure: Move book content to book/ subdirectory

- Move quarto/ → book/quarto/
- Move cli/ → book/cli/
- Move docker/ → book/docker/
- Move socratiQ/ → book/socratiQ/
- Move tools/ → book/tools/
- Move scripts/ → book/scripts/
- Move config/ → book/config/
- Move docs/ → book/docs/
- Move binder → book/binder

Git history fully preserved for all moved files.

Part of repository restructuring to support MLSysBook + TinyTorch.

Pre-commit hooks bypassed for this commit as paths need updating.

* Update pre-commit hooks for book/ subdirectory

- Update all quarto/ paths to book/quarto/
- Update all tools/ paths to book/tools/
- Update config/linting to book/config/linting
- Update project structure checks

Pre-commit hooks will now work with new directory structure.

* Update .gitignore for book/ subdirectory structure

- Update quarto/ paths to book/quarto/
- Update assets/ paths to book/quarto/assets/
- Maintain all existing ignore patterns

* Update GitHub workflows for book/ subdirectory

- Update all quarto/ paths to book/quarto/
- Update cli/ paths to book/cli/
- Update tools/ paths to book/tools/
- Update docker/ paths to book/docker/
- Update config/ paths to book/config/
- Maintain all workflow functionality

* Update CLI config to support book/ subdirectory

- Check for book/quarto/ path first
- Fall back to quarto/ for backward compatibility
- Maintain full CLI functionality

* Create new root and book READMEs for dual structure

- Add comprehensive root README explaining both projects
- Create book-specific README with quick start guide
- Document repository structure and navigation
- Prepare for TinyTorch integration
2025-12-05 14:04:21 -08:00
Vijay Janapa Reddi
4793ca1827 refactor: improve footnote validation script with cleanup capability
- Move script to tools/scripts/content/ to match project structure
- Add colored output with emoji indicators for better readability
- Add -f/--file and -d/--directory options for flexible input
- Add --clean flag to automatically remove unused footnote definitions
- Add --dry-run to preview cleanup without making changes
- Add --quiet mode for CI/CD pipelines
- Add --strict mode to fail on any issues
- Match style of other validation scripts in the project
- Update pre-commit hook to use new location and options

The script now provides clear visual feedback and can both validate
and fix footnote issues automatically when needed.
2025-09-06 16:07:53 -04:00
Vijay Janapa Reddi
e8c2bb461c feat: add footnote validation script and pre-commit hook
- Create scripts/validate_footnotes.py to check footnote consistency
- Validates all footnote references have definitions
- Validates all footnote definitions are actually used
- Detects duplicate footnote definitions
- Add to pre-commit hooks for automatic validation
- Currently reports 28 issues to be fixed in future PRs
2025-09-06 16:02:16 -04:00
Vijay Janapa Reddi
f55073d91e fix: properly remove all footnote inline references
- Fixed regex pattern in remove_footnotes.py to correctly match inline refs
- Added catalog_footnotes.py to track and analyze footnotes across the book
- Successfully removed all 366 inline references and definitions
- Provides context generation for footnote agent to avoid duplicates
2025-09-06 10:01:02 -04:00
Vijay Janapa Reddi
f410b7ed09 cleanup: remove all footnotes from qmd files for fresh start
- Created remove_footnotes.py script to cleanly remove all footnotes
- Removed 366 footnote definitions across 64 qmd files
- Preserves all main content while removing footnote references and definitions
- Prepares codebase for systematic footnote reintroduction
2025-09-06 09:57:35 -04:00
Vijay Janapa Reddi
c30cefca1a Allows configurable Quarto log level
Enables users to configure the Quarto log level via a workflow input. This provides more flexibility in controlling the verbosity of Quarto's output during the build process, allowing for easier debugging or reduced output when desired.

Removes the hardcoded DEBUG log level override in the render steps.
2025-07-31 11:49:59 -04:00
Vijay Janapa Reddi
1fdf3749e4 Fix pre-commit configuration paths and create scripts symlink
- Create symlink from scripts/ to tools/scripts/ for pre-commit hooks
- Update all script paths in .pre-commit-config.yaml to correct locations:
  - find_unreferenced_labels.py -> content/find_unreferenced_labels.py
  - section_id_manager.py -> content/manage_section_ids.py
  - collapse_blank_lines.py -> content/collapse_blank_lines.py
  - check_images.py -> utilities/check_images.py
- Update all file patterns from contents/ to book/contents/
- Fix trailing whitespace detected by hooks
2025-07-25 14:07:55 -04:00
Vijay Janapa Reddi
f032447639 Restructure repository for better organization and maintainability
- Move high-level assets to assets/ directory (covers, icons, styles, media)
- Consolidate build configuration in config/ directory (extensions, lua, tex)
- Group development tools under tools/ directory (scripts, dependencies, setup)
- Organize all book content under book/ directory
- Update all path references in _quarto.yml and other config files
- Preserve git history for all moved files
- Maintain full functionality for both HTML and PDF builds

This reorganization reduces root directory clutter from 50+ files to essential
project files only, providing clear separation of concerns and improved
maintainability for the textbook project.
2025-07-25 11:03:16 -04:00
Vijay Janapa Reddi
b8ba5f1df9 Merge branch 'fix-some-more-auto-captions' into dev 2025-07-24 14:10:08 -04:00
Vijay Janapa Reddi
a1c4af537c Remove ALL caption length limits - no truncation allowed
- Remove word count limits that rejected captions (was 150 words max)
- Set num_predict to -1 (unlimited tokens) for complete LLM responses
- Change rejection warnings to info messages
- Ensures generated captions are NEVER truncated regardless of length
- User requirement: no trimming allowed in real captions
2025-07-24 13:19:07 -04:00
Vijay Janapa Reddi
3cc94ce032 Fix caption truncation issues
- Increase LLM token limit from 120 to 200 tokens for complete responses
- Increase word count validation from 100 to 150 words maximum
- Increase display preview from 80 to 120 characters
- Addresses user reports of captions being cut off with '...'
- Allows for more complete and detailed educational captions
2025-07-24 13:16:07 -04:00
Vijay Janapa Reddi
468257e6ee Clean up output messaging for type-specific modes
- Remove 'Skipping X figures/tables/listings' messages when using type filters
- Update extraction summary to only show relevant types (figures-only, tables-only, listings-only)
- Update processing message to only mention types being processed
- Improves user experience by focusing output on relevant information
2025-07-24 13:04:19 -04:00
Vijay Janapa Reddi
7e0c997eb6 Adds script to find duplicate labels in Quarto files
Introduces a new script to identify duplicate labels (e.g., {#fig-xyz})
in Quarto (.qmd) files. This helps prevent ambiguous cross-reference
links and ensures proper linking within the documentation.

The script can be configured to check specific label types (figures,
tables, sections, listings, etc.) and provides different output
formats (text, JSON, summary) for various use cases, including
pre-commit integration. It also includes functionality to suggest fixes
for duplicate labels.

Also, renames figure labels to maintain consistency across the project.
2025-07-24 12:35:31 -04:00
Vijay Janapa Reddi
15a87811c0 Improves figure caption regex escaping
Adds a more selective regex escaping function for figure captions to avoid unintended escaping of common characters like parentheses.

This prevents issues where valid captions are not correctly identified due to over-escaping.
2025-07-24 08:06:20 -04:00
Vijay Janapa Reddi
1327b65543 Handles YAML-unsafe captions in R figures
Adds functions to ensure captions are properly quoted for YAML parsing, specifically addressing issues when captions start with "**" which can cause parsing errors.

This change ensures captions within R code blocks are correctly handled and updated, including adding quotes when necessary to avoid YAML parsing issues. It also provides a utility function for extracting clean captions from YAML values, handling both quoted and unquoted cases.
2025-07-24 07:04:59 -04:00
Vijay Janapa Reddi
1d73d6d808 Corrects figure captions to adhere to style guide
Updates figure captions across multiple documents to ensure consistent formatting and adherence to the project's style guide. Specifically, this commit replaces instances of the asterisk-formatted source annotation with a period-formatted one. This change ensures a consistent and professional presentation of source attributions within the document.
2025-07-24 00:58:59 -04:00
Vijay Janapa Reddi
aa733ba24c Merge branch 'improve-captions' into dev 2025-07-24 00:45:07 -04:00
Vijay Janapa Reddi
5b5fab612e Enhances caption handling and adds listing support
Improves robustness of figure caption detection and repair, and introduces listing support.

- Adds support for detecting and improving captions for code listings.
- Enhances figure detection with more precise pattern matching.
- Uses LLM to repair captions missing the "**Bold**: explanation" format.
- Provides a summary of changes made during the repair process, including counts for basic and LLM-based fixes.
- Refines the extraction of context for better LLM caption generation.
2025-07-24 00:43:07 -04:00
Vijay Janapa Reddi
c1d38bbbd9 Make script lightweight and fix file targeting behavior
 Removed automatic model pulling:
   - No longer automatically installs/pulls Ollama models
   - Provides helpful instructions instead: 'ollama pull model-name'
   - Shows available models and helpful commands
   - Much lighter weight for users

 Fixed -f flag to process single files only:
   - Added specific_files parameter to build_content_map_from_qmd()
   - Added file validation (existence and .qmd extension)
   - Prevents fallback to full directory scanning
   - True single-file processing as intended

 Better error handling:
   - Clear file not found messages
   - Proper QMD file validation
   - Graceful model availability checking

🎯 Script is now lightweight, non-intrusive, and respects user intentions
2025-07-23 22:03:59 -04:00
Vijay Janapa Reddi
eb6a9f3b78 Optimize improve_figure_captions.py - remove dead code and redundancy
 Removed dead code and optimizations:
   - Duplicate extract_section_context function removed
   - Unused find_qmd_files(directories) function removed
   - Duplicate 'requests' import removed
   - Content map redundancy eliminated in analyze mode

 Performance improvements:
   - Added content_map parameter to check_caption_quality()
   - Eliminated redundant content map building in analysis workflow
   - Reduced script size from 3235 to 3221 lines

 Functionality verified:
   - All core features working (--analyze, --list-models, --help)
   - No breaking changes to public interface
   - Maintains complete backward compatibility

🎯 Cleaner, more efficient script with same functionality
2025-07-23 21:54:53 -04:00
Vijay Janapa Reddi
4fcf25b5d2 Add comprehensive Python source checker and documentation
 New Python-based source management:
   - check_sources.py: Advanced analysis and cleanup tool
   - SOURCE_MANAGEMENT_TOOLS.md: Complete usage documentation
   - source_analysis_report.json: Example analysis output

 Features:
   - Pattern analysis and validation
   - Automatic cleanup with safe regex patterns
   - Comprehensive reporting and statistics
   - Problem detection and resolution

🎯 Robust toolchain for maintaining citation quality
2025-07-23 21:44:10 -04:00
Vijay Janapa Reddi
dfd58009ec Fix indentation issues in improve_figure_captions.py
- Corrected Python indentation inconsistencies
- Fixed malformed code blocks from previous edits
- Maintains all functionality while cleaning up formatting
2025-07-23 21:39:33 -04:00
Vijay Janapa Reddi
53e2a6c01b WIP: Caption improvements and source standardization
- Standardized 50+ citations from *source: @citation* to Source: [@citation] format
- Fixed improve_figure_captions.py script functionality
- Added bold title generation and weak starter detection
- Created source standardization plan and automation script
- Enhanced caption quality validation and repair features

Note: Style validation skipped for WIP commit
2025-07-23 21:14:40 -04:00
Vijay Janapa Reddi
2274863b85 Refactors caption improvement logic
Simplifies caption improvement logic by removing unnecessary code related to table formatting and JSON serialization of Path objects. This leads to cleaner and more maintainable code.
2025-07-23 14:04:48 -04:00
Vijay Janapa Reddi
eec683053f fix: Prevent double colon prefix in table caption updates
PROBLEM: Getting ': : **Title**: ...' instead of ': **Title**: ...'

ROOT CAUSE: new_caption parameter sometimes already contains ': ' prefix
from validate_and_improve_caption(), but update_table_caption() was
unconditionally adding another ': ' prefix.

SOLUTION: Check if new_caption already starts with ': ' prefix
- If yes: Use as-is (no additional prefix)
- If no: Add ': ' prefix and ensure proper period formatting

BEFORE:
 Input: ': **AI Evolution**: text'  → Output: ': : **AI Evolution**: text'
 Double colon at start

AFTER:
 Input: ': **AI Evolution**: text'  → Output: ': **AI Evolution**: text'
 Input: '**AI Evolution**: text'   → Output: ': **AI Evolution**: text'
 Single colon prefix always

VERIFICATION:
 Starts correctly with single ': ': True
 No ': :' double prefix: True
 Matches correct format: True

RESULT: Clean table format ': **Title**: explanation {#tbl-id attributes}'
2025-07-23 13:49:19 -04:00
Vijay Janapa Reddi
a769ce2385 simplify: Streamline table caption updates to always use simple format
USER INSIGHT: Table format is actually very simple and consistent:
': Representative hardware platforms across... {#tbl-representative-systems hover striped}'

BEFORE: Complex build_table_search_patterns() with 4 different cases:
 Old format with line breaks
 Old format with content stuck to same line
 New format with line breaks
 New format with content stuck to same line
🔧 40+ lines of complex pattern matching logic

AFTER: Simple, single-pattern approach:
 One regex pattern: '^:?\s*{caption}(\s*\{{#tbl-id[^}]*\}})(.*)$'
 Always output: ': [new_caption]. {#tbl-id [attributes]}'
 Handle period correctly (avoid double periods)
 15 lines total - much cleaner

TECHNICAL CHANGES:
- Simplified build_table_search_patterns() from 40+ lines to 15 lines
- Single regex pattern handles both ': caption' and 'caption' formats
- Always produces consistent format: ': [caption]. {#tbl-id [attributes]}'
- Fixed period handling to avoid double periods in output

VERIFICATION:
 Input:  ': Representative hardware platforms... {#tbl-representative-systems hover striped}'
 Output: ': Hardware comparison across ML deployment... {#tbl-representative-systems hover striped}'
 Maintains simple format: ': [caption]. {#tbl-id [attributes]}'

USER WAS RIGHT: Keep it simple! No need for complex edge case handling.
2025-07-23 13:40:39 -04:00
Vijay Janapa Reddi
32ed90035a feat: Add selective content processing with --figures-only and --tables-only
NEW COMMAND LINE OPTIONS:
 --figures-only, -F: Process only figures (ignore tables)
 --tables-only, -T: Process only tables (ignore figures)
 Mutually exclusive group prevents conflicting options

COMPREHENSIVE IMPLEMENTATION:
🔧 Updated all processing methods:
- build_content_map_from_qmd(): Add filtering logic with skip messages
- check_caption_quality(): Filter content analysis
- repair_captions(): Filter repair operations
- complete_caption_improvement_workflow(): Filter LLM improvements

📊 FILTERING VERIFIED:
- Normal mode: 286 figures, 91 tables
- Figures-only: 286 figures, 0 tables 
- Tables-only: 0 figures, 91 tables 

💡 USAGE EXAMPLES:
python improve_figure_captions.py -d contents/core/ --figures-only
python improve_figure_captions.py -d contents/core/ -F
python improve_figure_captions.py --analyze -d contents/core/ --tables-only
python improve_figure_captions.py --repair -d contents/core/ -T

🎯 BENEFITS:
- Faster processing for targeted content types
- Useful for focused caption improvement workflows
- Helpful skip messages show what's being filtered
- Works with all modes (analyze, repair, improve, build-map)
2025-07-23 13:35:40 -04:00
Vijay Janapa Reddi
462cda70a4 fix: Correct table caption extraction to prevent double colon prefix
PROBLEM: User getting ': : **Hardware Spectrum**:' instead of ': **Hardware Spectrum**:'

ROOT CAUSE: Wrong regex pattern order in detect_table()

TECHNICAL CHANGE: Reordered regex patterns to try old format first
- Old format: ^:\s* properly strips ': ' prefix
- New format: ^[^{]+ only for captions without leading colon

RESULT: No more ': :' double colon prefixes in table captions
2025-07-23 13:06:30 -04:00
Vijay Janapa Reddi
9805cb178a enhance: Add explicit anti-weak-verb instructions to LLM prompt
PREVENTION > FIXING: Instead of just post-processing weak verbs, now explicitly instruct the LLM to avoid them

MULTI-LAYER PROTECTION:
1. 🚫 Critical rule section with 14 banned weak verbs listed explicitly
2.  Clear before/after examples showing bad vs good patterns
3. 🎯 Final reminder at end of prompt to reinforce the rule
4. 🛡️ Post-processing cleanup as backup safety net

INSTRUCTIONAL APPROACH:
- LLM now sees explicit 'NEVER start with Shows, Demonstrates, Illustrates...'
- Direct examples: 'Shows how X' → 'X processes Y through Z'
- Multiple reinforcement points throughout the prompt

RESULT: LLM should generate strong captions from the start, with hardcoded fixes as fallback
2025-07-23 12:42:50 -04:00
Vijay Janapa Reddi
ae6b66f87b fix: Eliminate weak verbs from LLM-generated captions
PROBLEM: LLM generating weak textbook captions like 'Shows how', 'Demonstrates how', 'Visualizes how'

ROOT CAUSE: Contradictory LLM prompt examples were teaching the exact weak language we wanted to avoid

SOLUTION:
1. Fixed LLM prompt examples to use strong, direct language
2. Added 6 new banned weak verbs: Visualizes, Exemplifies, Traces, Explains, Displays, Presents
3. Enhanced post-processing to catch and fix these patterns

RESULT: LLM now generates strong, direct textbook captions without weak descriptive language
2025-07-23 12:39:24 -04:00
Vijay Janapa Reddi
b5a97a83b9 fix: Handle all table caption edge cases with malformed colons
🐛 EDGE CASE FIXES: Robust colon handling for table captions

 PROBLEMS FOUND:
- ': :**bold**: explanation' → ': :**bold**: explanation' (double colon)
- '::**bold**: explanation' → ':**bold**: explanation' (wrong prefix)
- ':   :**bold**: explanation' → messy spacing issues

 COMPREHENSIVE SOLUTION:
1. **Detect existing table prefix** (': ' pattern)
2. **Strip table prefix** if present
3. **Clean ALL leading colons** with r'^:+\s*' regex
4. **Fix regex pattern** to only capture **bold** part
5. **Add single table prefix** for final output

🧪 EDGE CASES NOW HANDLED:
 ': :**AI Evolution**: text' → ': **AI Evolution**: text'
 '::**AI Evolution**: text' → ': **AI Evolution**: text'
 ':   :**AI Evolution**: text' → ': **AI Evolution**: text'
 '**AI Evolution**: text' → ': **AI Evolution**: text'
 ': **AI Evolution**: text' → ': **AI Evolution**: text'

🔧 TECHNICAL CHANGES:
- Added r'^:+\s*' pattern to remove multiple leading colons
- Updated regex from r'^(.*?\*\*[^*]+\*\*)\s*:\s*(.+)$'
  to r'^(\*\*[^*]+\*\*)\s*:\s*(.+)$' (exact **bold** match)
- Comprehensive cleanup prevents any colon prefix issues

 RESULT: Bulletproof table formatting regardless of input malformation
2025-07-23 12:32:35 -04:00
Vijay Janapa Reddi
51ccddc37e fix: Prevent double colon in table captions
🐛 CRITICAL BUG FIX: Table prefix duplication

 PROBLEM:
- Table captions were getting double colons: ': : **Title**: explanation'
- Script blindly added ': ' prefix to ALL table captions
- But some captions already had ': ' prefix from previous processing

 SOLUTION:
- Check if caption already starts with ': ' before processing
- Strip existing ': ' prefix during processing
- Add back single ': ' prefix for tables only

🔧 LOGIC FLOW:
1. Input: ': **Bold**: explanation' (existing table format)
2. Strip prefix: '**Bold**: explanation' (for processing)
3. Process: improve language, spacing, etc.
4. Add table prefix: ': **Bold**: improved explanation'

🧪 TESTED:
- Table without prefix → gets ': ' added correctly
- Table with existing prefix → no duplication
- Table with messy spacing → cleaned and normalized
- All tests pass with proper ': **Bold**: format

 RESULT: Clean table format with single colon prefix
2025-07-23 12:29:34 -04:00
Vijay Janapa Reddi
1ceb5779b6 docs: Update all documentation for streamlined command line options
📖 COMPREHENSIVE DOCUMENTATION UPDATE:

 Script Internal Documentation:
- Updated main script header docstring with new modes
- Updated class FigureCaptionImprover docstring
- Fixed function docstrings and comments throughout
- Removed references to old --workflow, --update, --validate options
- Updated print messages to reflect new terminology

📚 New External Documentation:
- Created scripts/FIGURE_CAPTIONS.md with complete usage guide
- Added model selection guide with speed/quality ratings
- Included troubleshooting section and best practices
- Updated scripts/README.md with script overview

🔧 Updated References:
- Main modes: --improve/-i, --build-map/-b, --analyze/-a, --repair/-r
- Removed outdated workflow terminology
- Clear examples for all usage patterns
- Performance optimization guidelines

📋 Documentation Features:
- Command-line option tables with short/long forms
- Model comparison with star ratings
- Before/after caption examples
- Integration with Quarto build process
- Success metrics and quality standards

 All documentation now reflects the streamlined v2.0 interface
2025-07-23 12:19:51 -04:00
Vijay Janapa Reddi
b788bb0104 feat: Add -b short option for --build-map for consistency
 CONSISTENCY FIX:
- Added -b short form for --build-map option
- All main modes now have both short and long forms:
  * --build-map/-b   (build content map)
  * --analyze/-a     (quality analysis)
  * --repair/-r      (fix formatting)
  * --improve/-i     (LLM improvement)

📝 UPDATED EXAMPLES:
- Added python script.py -b -d contents/core/ example
- Maintains consistency across all command options

🧪 TESTED:
- -b option works correctly with content map building
- Help text displays properly formatted options
2025-07-23 12:14:42 -04:00
Vijay Janapa Reddi
b1d1b1f3ca refactor: Streamline command line options to eliminate redundancy
🧹 MAJOR CLEANUP - Removed confusing redundant options:

 REMOVED REDUNDANT OPTIONS:
- --workflow (identical to default behavior)
- --update (useless without --improve, but mutually exclusive)
- --validate (confusing vs --check)
- --check (merged into --analyze)
- --build-qmd-map (renamed for clarity)

 NEW STREAMLINED OPTIONS:
- --improve/-i: LLM caption improvement (default mode)
- --build-map: Build and save content map to JSON
- --analyze/-a: Quality analysis + validation combined
- --repair/-r: Fix formatting issues only

🎯 BENEFITS:
- 4 clear options vs 7 confusing ones
- No more identical default vs --workflow confusion
- No more broken workflow separation (--improve + --update)
- Clear purpose for each option
- Intuitive short flags (-i, -a, -r)

📝 USAGE NOW CRYSTAL CLEAR:
- Default: python script.py -d contents/core/ (LLM improvement)
- Analysis: python script.py --analyze -d contents/core/
- Map only: python script.py --build-map -d contents/core/
- Repair: python script.py --repair -d contents/core/

 Backward compatibility maintained for core workflows
2025-07-23 12:06:02 -04:00
Vijay Janapa Reddi
984cd97997 fix: Critical table extraction regression - restore ability to find all tables
🐛 CRITICAL FIX: Table extraction was broken for most tables
- Before: 23/92 tables found (69 failures, 80.9% success)
- After: 92/92 tables found (0 failures, 100% success)

🔧 Root cause: Regex pattern excluded ':' characters
- Tables like '**Special Function Units**: Details...' were rejected
- Pattern stopped at first ':' because it was in exclusion list [^{{\n:]+?
- Fix: Allow colons in caption text by changing to [^{{\n]+?

📊 Results across all core files:
- hw_acceleration: 0→21 tables (was completely broken)
- optimizations: 0→10 tables
- privacy_security: 0→8 tables
- frameworks: 0→6 tables
- All other files: Similar dramatic improvements

 Perfect extraction now working:
- 270 figures extracted successfully
- 92 tables extracted successfully
- 0 extraction failures
- Ready for LLM caption improvement processing
2025-07-23 11:42:53 -04:00
Vijay Janapa Reddi
76826d2969 feat: Enhanced weak language removal with stronger replacements
🎯 Improved mid-sentence weak language detection:
- Handle 'X illustrates how Y' patterns in middle of sentences
- Replace with stronger constructions: 'Y through X', 'Y via X'
- Avoid circular replacements (no longer use 'shows' as replacement)

💪 Stronger language replacements:
- 'illustrates how' → direct restructure with stronger verbs
- 'demonstrates that' → 'establishes that' / 'confirms that'
- 'depicts' → 'presents' / 'exposes'
- 'reveals' → 'establishes' / 'exposes'

🧪 Comprehensive testing verified:
- All weak words removed from captions
- No circular replacement issues
- Maintains meaning while using stronger language
- Proper table format and spacing preserved

 Real-world test case from screenshot now produces clean output:
'Each of these scenarios illustrates how...'
→ 'Machine learning models can serve as amplifiers through each of these scenarios'
2025-07-23 11:28:43 -04:00
Vijay Janapa Reddi
da476f48a8 fix: Resolve spacing issues in caption processing
🔧 Added comprehensive spacing normalization:
- Replace multiple spaces with single space
- Remove leading/trailing whitespace
- Ensure single space after colons consistently

📝 Enhanced caption parsing:
- More robust regex for **bold**: format parsing
- Handle spaces around colons properly
- Normalize spacing throughout processing pipeline

 Fixed specific issues:
- No more double spaces in captions
- Consistent table format: ': **Bold**: explanation'
- Clean spacing even with malformed input
- Proper handling of edge cases (missing spaces, multiple spaces)

🧪 Thoroughly tested with edge cases including:
- Multiple consecutive spaces
- Missing spaces after colons
- Leading/trailing whitespace
- Complex mixed spacing scenarios
2025-07-23 11:25:06 -04:00
Vijay Janapa Reddi
cae8d061c9 feat: Comprehensive caption quality improvements
 Enhanced LLM prompt:
- Added explicit instructions to avoid weak sentence starters
- Discourage 'Illustrates', 'Shows', 'Demonstrates' etc.
- Encourage direct, strong language with examples

🔧 Post-processing improvements:
- Fix capitalization after periods (handle abbreviations)
- Replace weak sentence starters with direct language
- Ensure proper table format with ':' prefix
- Comprehensive caption validation pipeline

📝 Quality enforcement:
- Automatic detection and correction of weak language
- Proper sentence case throughout explanations
- Standardized table caption format: ': **Bold**: explanation'
- Word-by-word improvements while preserving meaning

 Fully tested with edge cases and validation
2025-07-23 11:21:35 -04:00
Vijay Janapa Reddi
dd49365d29 fix: Preserve and standardize colon prefix for table captions
- Preserve existing ':' prefix in old format table captions
- Add ':' prefix to new format table captions for consistency
- Standardize all table captions to ': Caption {#tbl-id}' format
- Tested with both old and new caption formats
2025-07-23 11:08:14 -04:00
Vijay Janapa Reddi
2ade2df23f fix: Ensure proper line breaks after table captions during updates
- Add line break preservation logic to table caption replacement
- Handle problematic case where content is stuck to caption line
- Force line break insertion between caption and following content
- Update TikZ figure caption replacement to preserve line breaks
- Tested with problematic cases to ensure proper formatting
2025-07-23 10:50:12 -04:00
Vijay Janapa Reddi
6be54008b3 feat: Add retry logic and improve sentence case formatting
- Implement 3-retry logic with exponential backoff (2s, 4s, 8s)
- Smart retry only for recoverable errors (API/network, not content issues)
- Enhanced sentence case formatting with comprehensive technical term preservation
- Preserve spaces and punctuation correctly during caption formatting
- Support for both fast models (qwen2.5:7b) and large models (gemma3:27b)
- Robust error handling for production caption improvement workflow
2025-07-22 21:06:54 -04:00
Vijay Janapa Reddi
67348987b4 Remove llm_experiments directory 2025-07-22 18:46:33 -04:00