cs249r_book

mirror of https://github.com/harvard-edge/cs249r_book.git synced 2026-05-06 09:38:33 -05:00

Author	SHA1	Message	Date
Vijay Janapa Reddi	b8183404b8	chore(release): shared versioning infrastructure Lays foundation for unified release versioning across MLSysBook publishable artifacts. Pure additions — no existing builds, configs, or sources are touched. scripts/version/release.py Python CLI with helpers: - compute-id: semver bump from previous tag (patch/minor/major/none/explicit) - compute-hash: deterministic SHA-256 over input directories with per-file index - emit-release: writes releases/<project>-<id>/release.json (canonical artifact) - emit-manifest: writes the build-time manifest the deployable bundles Tier A (citable) emits per-file Merkle index; Tier B (lite) is flat. scripts/version/schema.json JSON Schema for release.json. Validates project/tier/release_id/release_hash + Tier A's files[] index. Used by validators in CI. shared/release/release-pill.html Footer snippet — fetches deployable manifest at runtime, renders "v0.1.0 · Apr 26, 2026" pill. Configured per-project via <meta name="release-manifest"> tag. Silent on any fetch failure. shared/release/release-card.html About-page snippet — fuller release-identity card with click-to-copy hash. Same fetch + meta-tag conventions. shared/release/README.md Operator-facing contract documentation. .github/workflows/_release-prepare.yml Reusable workflow_call. Validates confirm == "PUBLISH", computes new_release_id from previous tag + bump (delegates to release.py for canonical math). Outputs new_release_id/new_tag/previous_* for caller's downstream build and finalize steps. Refuses to re-tag existing releases (citation integrity). Caller workflows still own their build commands and tag/release creation; this only standardizes the input shape and version math.	2026-04-28 18:06:07 -04:00
Vijay Janapa Reddi	496e728135	fix(bib): restore vol1/vol2 references.bib after title-mangling regression Commit `42bc54275` (figure-audit feat) inadvertently ran a tool that broke BibTeX title syntax across hundreds of entries: e.g. '{TensorFlow: Large-Scale...}' became '{{TensorFlow}}: {Large}-Scale...}', producing unbalanced braces that caused the bib_lint parser to truncate parsing partway through the entry. This surfaced in pre-commit as 772 'missing required field' violations. Restoring vol1+vol2 references.bib to the pre-mangling state (`9ebdf77d0`) preserves all legitimate citation work from earlier commits while undoing the unintended damage. The mechanical formatter and bibtex-tidy hooks then re-emit a stable form. Also: trailing newline added to scripts/README.md by pre-commit's end-of-file-fixer.	2026-04-27 15:11:37 -04:00
Vijay Janapa Reddi	42bc54275d	feat: add multimodal figure audit automation script and README	2026-04-27 13:35:48 -04:00
Vijay Janapa Reddi	086c2cbac8	refactor: move CI scripts to .github/, remove tools/ - Move sync_newsletter.py to .github/scripts/ - Move merge_contributors.py to .github/workflows/contributors/ - Update workflow YAML paths and script path references - Delete reorganize_interviews_v2.py (one-off, already run) - Remove tools/ (mcp_server, sysdesign_platform)	2026-03-21 09:04:53 -04:00
Vijay Janapa Reddi	0f034bb63a	style(interviews): add horizontal rules between competency sections Adds --- separators between competency topic sections within each scope file for clearer visual separation when scrolling.	2026-03-21 08:26:43 -04:00
Vijay Janapa Reddi	05d97db43f	feat(website): add About, Community pages and redesign Newsletter New subsites: - /about/ — Mission, Team, Contributors (auto-pulled from GitHub API), Adopters, Press, License sections with card-based design - /community/ — Programs (TinyML4D, SciTinyML, Show & Tell, edX), Events calendar, Global Network map, Partners Newsletter redesign: - Buttondown API sync script (scripts/sync_newsletter.py) pulls published emails as .md files with auto-categorization and guest author detection - Grid layout with banner images from Buttondown - Embedded subscribe triggers shared subscribe-modal.js - Dynamic stats (_stats.yml) updated by sync workflow - Daily sync workflow at 6am UTC with build + deploy pipeline Infrastructure: - Navbar updated with anchor links (#mission, #team, #events, etc.) - Subscribe button triggers shared modal across all subsites - Contributors auto-update workflow generates about/contributors.json - Deploy workflows for about/ and community/ subsites - merge_contributors.py merges GitHub API with .all-contributorsrc	2026-03-20 14:44:49 -04:00
Vijay Janapa Reddi	0f33255b59	refactor(interviews): reorganize 1,063 questions by system scope Restructure all 4 tracks from arbitrary round-based files to learner-journey-based scopes. Each file represents the system the student is reasoning about, with competency sub-sections and L3→L6+ mastery levels inside. Cloud: Single Machine → Distributed Systems → Serving Stack → Production Ops Edge: Hardware Platform → Real-Time Pipeline → Deployed System Mobile: Device & SoC → App Experience → Ship & Update TinyML: Microcontroller → Sensing Pipeline → Deployed Device Old round files preserved in _legacy/ folders. All cross-references updated in README, STUDY_GUIDE, TOPIC_MAP, _quarto.yml, and index.qmd.	2026-03-20 10:40:55 -04:00
Vijay Janapa Reddi	7b92e11193	Repository Restructuring: Prepare for TinyTorch Integration (#1068 ) * Restructure: Move book content to book/ subdirectory - Move quarto/ → book/quarto/ - Move cli/ → book/cli/ - Move docker/ → book/docker/ - Move socratiQ/ → book/socratiQ/ - Move tools/ → book/tools/ - Move scripts/ → book/scripts/ - Move config/ → book/config/ - Move docs/ → book/docs/ - Move binder → book/binder Git history fully preserved for all moved files. Part of repository restructuring to support MLSysBook + TinyTorch. Pre-commit hooks bypassed for this commit as paths need updating. * Update pre-commit hooks for book/ subdirectory - Update all quarto/ paths to book/quarto/ - Update all tools/ paths to book/tools/ - Update config/linting to book/config/linting - Update project structure checks Pre-commit hooks will now work with new directory structure. * Update .gitignore for book/ subdirectory structure - Update quarto/ paths to book/quarto/ - Update assets/ paths to book/quarto/assets/ - Maintain all existing ignore patterns * Update GitHub workflows for book/ subdirectory - Update all quarto/ paths to book/quarto/ - Update cli/ paths to book/cli/ - Update tools/ paths to book/tools/ - Update docker/ paths to book/docker/ - Update config/ paths to book/config/ - Maintain all workflow functionality * Update CLI config to support book/ subdirectory - Check for book/quarto/ path first - Fall back to quarto/ for backward compatibility - Maintain full CLI functionality * Create new root and book READMEs for dual structure - Add comprehensive root README explaining both projects - Create book-specific README with quick start guide - Document repository structure and navigation - Prepare for TinyTorch integration	2025-12-05 14:04:21 -08:00
Vijay Janapa Reddi	4793ca1827	refactor: improve footnote validation script with cleanup capability - Move script to tools/scripts/content/ to match project structure - Add colored output with emoji indicators for better readability - Add -f/--file and -d/--directory options for flexible input - Add --clean flag to automatically remove unused footnote definitions - Add --dry-run to preview cleanup without making changes - Add --quiet mode for CI/CD pipelines - Add --strict mode to fail on any issues - Match style of other validation scripts in the project - Update pre-commit hook to use new location and options The script now provides clear visual feedback and can both validate and fix footnote issues automatically when needed.	2025-09-06 16:07:53 -04:00
Vijay Janapa Reddi	e8c2bb461c	feat: add footnote validation script and pre-commit hook - Create scripts/validate_footnotes.py to check footnote consistency - Validates all footnote references have definitions - Validates all footnote definitions are actually used - Detects duplicate footnote definitions - Add to pre-commit hooks for automatic validation - Currently reports 28 issues to be fixed in future PRs	2025-09-06 16:02:16 -04:00
Vijay Janapa Reddi	f55073d91e	fix: properly remove all footnote inline references - Fixed regex pattern in remove_footnotes.py to correctly match inline refs - Added catalog_footnotes.py to track and analyze footnotes across the book - Successfully removed all 366 inline references and definitions - Provides context generation for footnote agent to avoid duplicates	2025-09-06 10:01:02 -04:00
Vijay Janapa Reddi	f410b7ed09	cleanup: remove all footnotes from qmd files for fresh start - Created remove_footnotes.py script to cleanly remove all footnotes - Removed 366 footnote definitions across 64 qmd files - Preserves all main content while removing footnote references and definitions - Prepares codebase for systematic footnote reintroduction	2025-09-06 09:57:35 -04:00
Vijay Janapa Reddi	c30cefca1a	Allows configurable Quarto log level Enables users to configure the Quarto log level via a workflow input. This provides more flexibility in controlling the verbosity of Quarto's output during the build process, allowing for easier debugging or reduced output when desired. Removes the hardcoded DEBUG log level override in the render steps.	2025-07-31 11:49:59 -04:00
Vijay Janapa Reddi	1fdf3749e4	Fix pre-commit configuration paths and create scripts symlink - Create symlink from scripts/ to tools/scripts/ for pre-commit hooks - Update all script paths in .pre-commit-config.yaml to correct locations: - find_unreferenced_labels.py -> content/find_unreferenced_labels.py - section_id_manager.py -> content/manage_section_ids.py - collapse_blank_lines.py -> content/collapse_blank_lines.py - check_images.py -> utilities/check_images.py - Update all file patterns from contents/ to book/contents/ - Fix trailing whitespace detected by hooks	2025-07-25 14:07:55 -04:00
Vijay Janapa Reddi	f032447639	Restructure repository for better organization and maintainability - Move high-level assets to assets/ directory (covers, icons, styles, media) - Consolidate build configuration in config/ directory (extensions, lua, tex) - Group development tools under tools/ directory (scripts, dependencies, setup) - Organize all book content under book/ directory - Update all path references in _quarto.yml and other config files - Preserve git history for all moved files - Maintain full functionality for both HTML and PDF builds This reorganization reduces root directory clutter from 50+ files to essential project files only, providing clear separation of concerns and improved maintainability for the textbook project.	2025-07-25 11:03:16 -04:00
Vijay Janapa Reddi	b8ba5f1df9	Merge branch 'fix-some-more-auto-captions' into dev	2025-07-24 14:10:08 -04:00
Vijay Janapa Reddi	a1c4af537c	Remove ALL caption length limits - no truncation allowed - Remove word count limits that rejected captions (was 150 words max) - Set num_predict to -1 (unlimited tokens) for complete LLM responses - Change rejection warnings to info messages - Ensures generated captions are NEVER truncated regardless of length - User requirement: no trimming allowed in real captions	2025-07-24 13:19:07 -04:00
Vijay Janapa Reddi	3cc94ce032	Fix caption truncation issues - Increase LLM token limit from 120 to 200 tokens for complete responses - Increase word count validation from 100 to 150 words maximum - Increase display preview from 80 to 120 characters - Addresses user reports of captions being cut off with '...' - Allows for more complete and detailed educational captions	2025-07-24 13:16:07 -04:00
Vijay Janapa Reddi	468257e6ee	Clean up output messaging for type-specific modes - Remove 'Skipping X figures/tables/listings' messages when using type filters - Update extraction summary to only show relevant types (figures-only, tables-only, listings-only) - Update processing message to only mention types being processed - Improves user experience by focusing output on relevant information	2025-07-24 13:04:19 -04:00
Vijay Janapa Reddi	7e0c997eb6	Adds script to find duplicate labels in Quarto files Introduces a new script to identify duplicate labels (e.g., {#fig-xyz}) in Quarto (.qmd) files. This helps prevent ambiguous cross-reference links and ensures proper linking within the documentation. The script can be configured to check specific label types (figures, tables, sections, listings, etc.) and provides different output formats (text, JSON, summary) for various use cases, including pre-commit integration. It also includes functionality to suggest fixes for duplicate labels. Also, renames figure labels to maintain consistency across the project.	2025-07-24 12:35:31 -04:00
Vijay Janapa Reddi	15a87811c0	Improves figure caption regex escaping Adds a more selective regex escaping function for figure captions to avoid unintended escaping of common characters like parentheses. This prevents issues where valid captions are not correctly identified due to over-escaping.	2025-07-24 08:06:20 -04:00
Vijay Janapa Reddi	1327b65543	Handles YAML-unsafe captions in R figures Adds functions to ensure captions are properly quoted for YAML parsing, specifically addressing issues when captions start with "**" which can cause parsing errors. This change ensures captions within R code blocks are correctly handled and updated, including adding quotes when necessary to avoid YAML parsing issues. It also provides a utility function for extracting clean captions from YAML values, handling both quoted and unquoted cases.	2025-07-24 07:04:59 -04:00
Vijay Janapa Reddi	1d73d6d808	Corrects figure captions to adhere to style guide Updates figure captions across multiple documents to ensure consistent formatting and adherence to the project's style guide. Specifically, this commit replaces instances of the asterisk-formatted source annotation with a period-formatted one. This change ensures a consistent and professional presentation of source attributions within the document.	2025-07-24 00:58:59 -04:00
Vijay Janapa Reddi	aa733ba24c	Merge branch 'improve-captions' into dev	2025-07-24 00:45:07 -04:00
Vijay Janapa Reddi	5b5fab612e	Enhances caption handling and adds listing support Improves robustness of figure caption detection and repair, and introduces listing support. - Adds support for detecting and improving captions for code listings. - Enhances figure detection with more precise pattern matching. - Uses LLM to repair captions missing the "Bold: explanation" format. - Provides a summary of changes made during the repair process, including counts for basic and LLM-based fixes. - Refines the extraction of context for better LLM caption generation.	2025-07-24 00:43:07 -04:00
Vijay Janapa Reddi	c1d38bbbd9	Make script lightweight and fix file targeting behavior ✅ Removed automatic model pulling: - No longer automatically installs/pulls Ollama models - Provides helpful instructions instead: 'ollama pull model-name' - Shows available models and helpful commands - Much lighter weight for users ✅ Fixed -f flag to process single files only: - Added specific_files parameter to build_content_map_from_qmd() - Added file validation (existence and .qmd extension) - Prevents fallback to full directory scanning - True single-file processing as intended ✅ Better error handling: - Clear file not found messages - Proper QMD file validation - Graceful model availability checking 🎯 Script is now lightweight, non-intrusive, and respects user intentions	2025-07-23 22:03:59 -04:00
Vijay Janapa Reddi	eb6a9f3b78	Optimize improve_figure_captions.py - remove dead code and redundancy ✅ Removed dead code and optimizations: - Duplicate extract_section_context function removed - Unused find_qmd_files(directories) function removed - Duplicate 'requests' import removed - Content map redundancy eliminated in analyze mode ✅ Performance improvements: - Added content_map parameter to check_caption_quality() - Eliminated redundant content map building in analysis workflow - Reduced script size from 3235 to 3221 lines ✅ Functionality verified: - All core features working (--analyze, --list-models, --help) - No breaking changes to public interface - Maintains complete backward compatibility 🎯 Cleaner, more efficient script with same functionality	2025-07-23 21:54:53 -04:00
Vijay Janapa Reddi	4fcf25b5d2	Add comprehensive Python source checker and documentation ✅ New Python-based source management: - check_sources.py: Advanced analysis and cleanup tool - SOURCE_MANAGEMENT_TOOLS.md: Complete usage documentation - source_analysis_report.json: Example analysis output ✅ Features: - Pattern analysis and validation - Automatic cleanup with safe regex patterns - Comprehensive reporting and statistics - Problem detection and resolution 🎯 Robust toolchain for maintaining citation quality	2025-07-23 21:44:10 -04:00
Vijay Janapa Reddi	dfd58009ec	Fix indentation issues in improve_figure_captions.py - Corrected Python indentation inconsistencies - Fixed malformed code blocks from previous edits - Maintains all functionality while cleaning up formatting	2025-07-23 21:39:33 -04:00
Vijay Janapa Reddi	53e2a6c01b	WIP: Caption improvements and source standardization - Standardized 50+ citations from source: @citation to Source: [@citation] format - Fixed improve_figure_captions.py script functionality - Added bold title generation and weak starter detection - Created source standardization plan and automation script - Enhanced caption quality validation and repair features Note: Style validation skipped for WIP commit	2025-07-23 21:14:40 -04:00
Vijay Janapa Reddi	2274863b85	Refactors caption improvement logic Simplifies caption improvement logic by removing unnecessary code related to table formatting and JSON serialization of Path objects. This leads to cleaner and more maintainable code.	2025-07-23 14:04:48 -04:00
Vijay Janapa Reddi	eec683053f	fix: Prevent double colon prefix in table caption updates PROBLEM: Getting ': : Title: ...' instead of ': Title: ...' ROOT CAUSE: new_caption parameter sometimes already contains ': ' prefix from validate_and_improve_caption(), but update_table_caption() was unconditionally adding another ': ' prefix. SOLUTION: Check if new_caption already starts with ': ' prefix - If yes: Use as-is (no additional prefix) - If no: Add ': ' prefix and ensure proper period formatting BEFORE: ❌ Input: ': AI Evolution: text' → Output: ': : AI Evolution: text' ❌ Double colon at start AFTER: ✅ Input: ': AI Evolution: text' → Output: ': AI Evolution: text' ✅ Input: 'AI Evolution: text' → Output: ': AI Evolution: text' ✅ Single colon prefix always VERIFICATION: ✅ Starts correctly with single ': ': True ✅ No ': :' double prefix: True ✅ Matches correct format: True RESULT: Clean table format ': Title: explanation {#tbl-id attributes}'	2025-07-23 13:49:19 -04:00
Vijay Janapa Reddi	a769ce2385	simplify: Streamline table caption updates to always use simple format USER INSIGHT: Table format is actually very simple and consistent: ': Representative hardware platforms across... {#tbl-representative-systems hover striped}' BEFORE: Complex build_table_search_patterns() with 4 different cases: ❌ Old format with line breaks ❌ Old format with content stuck to same line ❌ New format with line breaks ❌ New format with content stuck to same line 🔧 40+ lines of complex pattern matching logic AFTER: Simple, single-pattern approach: ✅ One regex pattern: '^:?\s{caption}(\s\{{#tbl-id[^}]\}})(.)$' ✅ Always output: ': [new_caption]. {#tbl-id [attributes]}' ✅ Handle period correctly (avoid double periods) ✅ 15 lines total - much cleaner TECHNICAL CHANGES: - Simplified build_table_search_patterns() from 40+ lines to 15 lines - Single regex pattern handles both ': caption' and 'caption' formats - Always produces consistent format: ': [caption]. {#tbl-id [attributes]}' - Fixed period handling to avoid double periods in output VERIFICATION: ✅ Input: ': Representative hardware platforms... {#tbl-representative-systems hover striped}' ✅ Output: ': Hardware comparison across ML deployment... {#tbl-representative-systems hover striped}' ✅ Maintains simple format: ': [caption]. {#tbl-id [attributes]}' USER WAS RIGHT: Keep it simple! No need for complex edge case handling.	2025-07-23 13:40:39 -04:00
Vijay Janapa Reddi	32ed90035a	feat: Add selective content processing with --figures-only and --tables-only NEW COMMAND LINE OPTIONS: ✅ --figures-only, -F: Process only figures (ignore tables) ✅ --tables-only, -T: Process only tables (ignore figures) ✅ Mutually exclusive group prevents conflicting options COMPREHENSIVE IMPLEMENTATION: 🔧 Updated all processing methods: - build_content_map_from_qmd(): Add filtering logic with skip messages - check_caption_quality(): Filter content analysis - repair_captions(): Filter repair operations - complete_caption_improvement_workflow(): Filter LLM improvements 📊 FILTERING VERIFIED: - Normal mode: 286 figures, 91 tables - Figures-only: 286 figures, 0 tables ✅ - Tables-only: 0 figures, 91 tables ✅ 💡 USAGE EXAMPLES: python improve_figure_captions.py -d contents/core/ --figures-only python improve_figure_captions.py -d contents/core/ -F python improve_figure_captions.py --analyze -d contents/core/ --tables-only python improve_figure_captions.py --repair -d contents/core/ -T 🎯 BENEFITS: - Faster processing for targeted content types - Useful for focused caption improvement workflows - Helpful skip messages show what's being filtered - Works with all modes (analyze, repair, improve, build-map)	2025-07-23 13:35:40 -04:00
Vijay Janapa Reddi	462cda70a4	fix: Correct table caption extraction to prevent double colon prefix PROBLEM: User getting ': : Hardware Spectrum:' instead of ': Hardware Spectrum:' ROOT CAUSE: Wrong regex pattern order in detect_table() TECHNICAL CHANGE: Reordered regex patterns to try old format first - Old format: ^:\s* properly strips ': ' prefix - New format: ^[^{]+ only for captions without leading colon RESULT: No more ': :' double colon prefixes in table captions	2025-07-23 13:06:30 -04:00
Vijay Janapa Reddi	9805cb178a	enhance: Add explicit anti-weak-verb instructions to LLM prompt PREVENTION > FIXING: Instead of just post-processing weak verbs, now explicitly instruct the LLM to avoid them MULTI-LAYER PROTECTION: 1. 🚫 Critical rule section with 14 banned weak verbs listed explicitly 2. ❌✅ Clear before/after examples showing bad vs good patterns 3. 🎯 Final reminder at end of prompt to reinforce the rule 4. 🛡️ Post-processing cleanup as backup safety net INSTRUCTIONAL APPROACH: - LLM now sees explicit 'NEVER start with Shows, Demonstrates, Illustrates...' - Direct examples: 'Shows how X' → 'X processes Y through Z' - Multiple reinforcement points throughout the prompt RESULT: LLM should generate strong captions from the start, with hardcoded fixes as fallback	2025-07-23 12:42:50 -04:00
Vijay Janapa Reddi	ae6b66f87b	fix: Eliminate weak verbs from LLM-generated captions PROBLEM: LLM generating weak textbook captions like 'Shows how', 'Demonstrates how', 'Visualizes how' ROOT CAUSE: Contradictory LLM prompt examples were teaching the exact weak language we wanted to avoid SOLUTION: 1. Fixed LLM prompt examples to use strong, direct language 2. Added 6 new banned weak verbs: Visualizes, Exemplifies, Traces, Explains, Displays, Presents 3. Enhanced post-processing to catch and fix these patterns RESULT: LLM now generates strong, direct textbook captions without weak descriptive language	2025-07-23 12:39:24 -04:00
Vijay Janapa Reddi	b5a97a83b9	fix: Handle all table caption edge cases with malformed colons 🐛 EDGE CASE FIXES: Robust colon handling for table captions ❌ PROBLEMS FOUND: - ': :bold: explanation' → ': :bold: explanation' (double colon) - '::bold: explanation' → ':bold: explanation' (wrong prefix) - ': :bold: explanation' → messy spacing issues ✅ COMPREHENSIVE SOLUTION: 1. Detect existing table prefix (': ' pattern) 2. Strip table prefix if present 3. Clean ALL leading colons with r'^:+\s' regex 4. Fix regex pattern* to only capture bold part 5. Add single table prefix for final output 🧪 EDGE CASES NOW HANDLED: ✅ ': :AI Evolution: text' → ': AI Evolution: text' ✅ '::AI Evolution: text' → ': AI Evolution: text' ✅ ': :AI Evolution: text' → ': AI Evolution: text' ✅ 'AI Evolution: text' → ': AI Evolution: text' ✅ ': AI Evolution: text' → ': AI Evolution: text' 🔧 TECHNICAL CHANGES: - Added r'^:+\s' pattern to remove multiple leading colons - Updated regex from r'^(.?\\[^]+\\)\s:\s(.+)$' to r'^(\\[^]+\\)\s:\s(.+)$' (exact bold match) - Comprehensive cleanup prevents any colon prefix issues ✅ RESULT: Bulletproof table formatting regardless of input malformation	2025-07-23 12:32:35 -04:00
Vijay Janapa Reddi	51ccddc37e	fix: Prevent double colon in table captions 🐛 CRITICAL BUG FIX: Table prefix duplication ❌ PROBLEM: - Table captions were getting double colons: ': : Title: explanation' - Script blindly added ': ' prefix to ALL table captions - But some captions already had ': ' prefix from previous processing ✅ SOLUTION: - Check if caption already starts with ': ' before processing - Strip existing ': ' prefix during processing - Add back single ': ' prefix for tables only 🔧 LOGIC FLOW: 1. Input: ': Bold: explanation' (existing table format) 2. Strip prefix: 'Bold: explanation' (for processing) 3. Process: improve language, spacing, etc. 4. Add table prefix: ': Bold: improved explanation' 🧪 TESTED: - Table without prefix → gets ': ' added correctly - Table with existing prefix → no duplication - Table with messy spacing → cleaned and normalized - All tests pass with proper ': Bold: format ✅ RESULT: Clean table format with single colon prefix	2025-07-23 12:29:34 -04:00
Vijay Janapa Reddi	1ceb5779b6	docs: Update all documentation for streamlined command line options 📖 COMPREHENSIVE DOCUMENTATION UPDATE: ✅ Script Internal Documentation: - Updated main script header docstring with new modes - Updated class FigureCaptionImprover docstring - Fixed function docstrings and comments throughout - Removed references to old --workflow, --update, --validate options - Updated print messages to reflect new terminology 📚 New External Documentation: - Created scripts/FIGURE_CAPTIONS.md with complete usage guide - Added model selection guide with speed/quality ratings - Included troubleshooting section and best practices - Updated scripts/README.md with script overview 🔧 Updated References: - Main modes: --improve/-i, --build-map/-b, --analyze/-a, --repair/-r - Removed outdated workflow terminology - Clear examples for all usage patterns - Performance optimization guidelines 📋 Documentation Features: - Command-line option tables with short/long forms - Model comparison with star ratings - Before/after caption examples - Integration with Quarto build process - Success metrics and quality standards ✅ All documentation now reflects the streamlined v2.0 interface	2025-07-23 12:19:51 -04:00
Vijay Janapa Reddi	b788bb0104	feat: Add -b short option for --build-map for consistency ✅ CONSISTENCY FIX: - Added -b short form for --build-map option - All main modes now have both short and long forms: * --build-map/-b (build content map) * --analyze/-a (quality analysis) * --repair/-r (fix formatting) * --improve/-i (LLM improvement) 📝 UPDATED EXAMPLES: - Added python script.py -b -d contents/core/ example - Maintains consistency across all command options 🧪 TESTED: - -b option works correctly with content map building - Help text displays properly formatted options	2025-07-23 12:14:42 -04:00
Vijay Janapa Reddi	b1d1b1f3ca	refactor: Streamline command line options to eliminate redundancy 🧹 MAJOR CLEANUP - Removed confusing redundant options: ❌ REMOVED REDUNDANT OPTIONS: - --workflow (identical to default behavior) - --update (useless without --improve, but mutually exclusive) - --validate (confusing vs --check) - --check (merged into --analyze) - --build-qmd-map (renamed for clarity) ✅ NEW STREAMLINED OPTIONS: - --improve/-i: LLM caption improvement (default mode) - --build-map: Build and save content map to JSON - --analyze/-a: Quality analysis + validation combined - --repair/-r: Fix formatting issues only 🎯 BENEFITS: - 4 clear options vs 7 confusing ones - No more identical default vs --workflow confusion - No more broken workflow separation (--improve + --update) - Clear purpose for each option - Intuitive short flags (-i, -a, -r) 📝 USAGE NOW CRYSTAL CLEAR: - Default: python script.py -d contents/core/ (LLM improvement) - Analysis: python script.py --analyze -d contents/core/ - Map only: python script.py --build-map -d contents/core/ - Repair: python script.py --repair -d contents/core/ ✅ Backward compatibility maintained for core workflows	2025-07-23 12:06:02 -04:00
Vijay Janapa Reddi	984cd97997	fix: Critical table extraction regression - restore ability to find all tables 🐛 CRITICAL FIX: Table extraction was broken for most tables - Before: 23/92 tables found (69 failures, 80.9% success) - After: 92/92 tables found (0 failures, 100% success) 🔧 Root cause: Regex pattern excluded ':' characters - Tables like 'Special Function Units: Details...' were rejected - Pattern stopped at first ':' because it was in exclusion list [^{{\n:]+? - Fix: Allow colons in caption text by changing to [^{{\n]+? 📊 Results across all core files: - hw_acceleration: 0→21 tables (was completely broken) - optimizations: 0→10 tables - privacy_security: 0→8 tables - frameworks: 0→6 tables - All other files: Similar dramatic improvements ✅ Perfect extraction now working: - 270 figures extracted successfully - 92 tables extracted successfully - 0 extraction failures - Ready for LLM caption improvement processing	2025-07-23 11:42:53 -04:00
Vijay Janapa Reddi	76826d2969	feat: Enhanced weak language removal with stronger replacements 🎯 Improved mid-sentence weak language detection: - Handle 'X illustrates how Y' patterns in middle of sentences - Replace with stronger constructions: 'Y through X', 'Y via X' - Avoid circular replacements (no longer use 'shows' as replacement) 💪 Stronger language replacements: - 'illustrates how' → direct restructure with stronger verbs - 'demonstrates that' → 'establishes that' / 'confirms that' - 'depicts' → 'presents' / 'exposes' - 'reveals' → 'establishes' / 'exposes' 🧪 Comprehensive testing verified: - All weak words removed from captions - No circular replacement issues - Maintains meaning while using stronger language - Proper table format and spacing preserved ✅ Real-world test case from screenshot now produces clean output: 'Each of these scenarios illustrates how...' → 'Machine learning models can serve as amplifiers through each of these scenarios'	2025-07-23 11:28:43 -04:00
Vijay Janapa Reddi	da476f48a8	fix: Resolve spacing issues in caption processing 🔧 Added comprehensive spacing normalization: - Replace multiple spaces with single space - Remove leading/trailing whitespace - Ensure single space after colons consistently 📝 Enhanced caption parsing: - More robust regex for bold: format parsing - Handle spaces around colons properly - Normalize spacing throughout processing pipeline ✅ Fixed specific issues: - No more double spaces in captions - Consistent table format: ': Bold: explanation' - Clean spacing even with malformed input - Proper handling of edge cases (missing spaces, multiple spaces) 🧪 Thoroughly tested with edge cases including: - Multiple consecutive spaces - Missing spaces after colons - Leading/trailing whitespace - Complex mixed spacing scenarios	2025-07-23 11:25:06 -04:00
Vijay Janapa Reddi	cae8d061c9	feat: Comprehensive caption quality improvements ✨ Enhanced LLM prompt: - Added explicit instructions to avoid weak sentence starters - Discourage 'Illustrates', 'Shows', 'Demonstrates' etc. - Encourage direct, strong language with examples 🔧 Post-processing improvements: - Fix capitalization after periods (handle abbreviations) - Replace weak sentence starters with direct language - Ensure proper table format with ':' prefix - Comprehensive caption validation pipeline 📝 Quality enforcement: - Automatic detection and correction of weak language - Proper sentence case throughout explanations - Standardized table caption format: ': Bold: explanation' - Word-by-word improvements while preserving meaning ✅ Fully tested with edge cases and validation	2025-07-23 11:21:35 -04:00
Vijay Janapa Reddi	dd49365d29	fix: Preserve and standardize colon prefix for table captions - Preserve existing ':' prefix in old format table captions - Add ':' prefix to new format table captions for consistency - Standardize all table captions to ': Caption {#tbl-id}' format - Tested with both old and new caption formats	2025-07-23 11:08:14 -04:00
Vijay Janapa Reddi	2ade2df23f	fix: Ensure proper line breaks after table captions during updates - Add line break preservation logic to table caption replacement - Handle problematic case where content is stuck to caption line - Force line break insertion between caption and following content - Update TikZ figure caption replacement to preserve line breaks - Tested with problematic cases to ensure proper formatting	2025-07-23 10:50:12 -04:00
Vijay Janapa Reddi	6be54008b3	feat: Add retry logic and improve sentence case formatting - Implement 3-retry logic with exponential backoff (2s, 4s, 8s) - Smart retry only for recoverable errors (API/network, not content issues) - Enhanced sentence case formatting with comprehensive technical term preservation - Preserve spaces and punctuation correctly during caption formatting - Support for both fast models (qwen2.5:7b) and large models (gemma3:27b) - Robust error handling for production caption improvement workflow	2025-07-22 21:06:54 -04:00
Vijay Janapa Reddi	67348987b4	Remove llm_experiments directory	2025-07-22 18:46:33 -04:00

1 2 3 4 5 ...

321 Commits