Files
cs249r_book/scripts
Vijay Janapa Reddi acadb19572 🚀 Optimize filtering for better cross-reference coverage
 Improved Section Extraction:
- Updated filters.yml to be less aggressive on substantial content
- Removed exclusions for 'overview', 'introduction', 'conclusion' sections
- These often contain valuable technical content, not just meta-content
- Kept exclusions for truly meta content like 'purpose', 'learning objectives'

 Relaxed Content Filters:
- Min length: 200 → 150 chars (allow shorter sections)
- Max length: 15000 → 20000 chars (allow longer sections)
- List ratio: 70% → 80% (allow more list-heavy content)
- Code ratio: 80% → 90% (allow more code examples)
- Citation ratio: 30% → 40% (allow more referenced content)

 Results with Domain-Adapted Model:
- Section extraction: 52 → 74 sections (42% improvement)
- Cross-references: 30 → 63 references (110% improvement)
- File coverage: 8 → 13 files (62% more files connected)
- Quality maintained: 65.6% average similarity

 Optimal Settings Identified:
- Similarity threshold: 0.6 (vs default 0.65)
- Max suggestions: 3 per section
- Balances quantity and quality effectively

This version provides much better coverage while maintaining high-quality
cross-references between legitimate technical sections.
2025-07-21 16:08:27 -04:00
..
2025-02-17 01:44:00 -05:00
2025-03-03 17:36:33 -08:00
2025-03-01 08:15:06 -05:00
2025-03-01 08:15:06 -05:00
2025-03-01 08:15:06 -05:00
2025-03-28 11:32:49 -04:00
2025-03-01 08:15:06 -05:00
2025-03-01 08:15:06 -05:00
2025-03-19 21:10:18 -04:00
2025-06-20 15:08:23 -04:00

Scripts Directory

This directory contains various Python scripts used for book maintenance and processing.

Python Dependencies

All Python dependencies are managed through the root-level requirements.txt file. This ensures consistent package versions across all scripts and the GitHub Actions workflow.

Adding New Dependencies

When adding new Python scripts that require external packages:

  1. Add the required packages to requirements.txt at the project root
  2. Include version constraints where appropriate (e.g., >=1.0.0)
  3. Add comments to group related packages
  4. Test locally with: pip install -r requirements.txt

Current Dependencies

The current dependencies include:

  • Quarto/Jupyter: jupyterlab-quarto, jupyter
  • NLP: nltk (with stopwords and punkt data)
  • AI Integration: openai, gradio
  • Document Processing: pybtex, pypandoc, pyyaml
  • Image Processing: Pillow
  • Validation: jsonschema
  • Utilities: absl-py

Subdirectory Requirements Files

Some subdirectories have their own requirements.txt files for specific workflows:

  • scripts/genai/requirements.txt - AI-specific dependencies
  • scripts/quarto_publish/requirements.txt - Publishing dependencies

These are kept for reference but the main workflow uses the root requirements.txt.

GitHub Actions Integration

The GitHub Actions workflow automatically:

  1. Caches Python packages for faster builds
  2. Installs all dependencies from requirements.txt
  3. Downloads required NLTK data
  4. Reports cache status in build summaries

Cache is invalidated when requirements.txt changes, ensuring dependencies stay up-to-date.

Pre-commit Setup

The project uses pre-commit hooks for code quality checks. The hooks run automatically on commit and include:

  • Spell checking with codespell
  • YAML validation for _quarto.yml
  • Markdown formatting and linting
  • Bibliography formatting with bibtex-tidy
  • Custom Python scripts for section ID management and unreferenced label detection

Setup Instructions

  1. Install pre-commit (included in requirements.txt):

    pip install -r requirements.txt
    
  2. Install the git hooks:

    pre-commit install
    
  3. Run manually (optional):

    # Run on all files
    pre-commit run --all-files
    
    # Run on specific files
    pre-commit run --files path/to/file.qmd
    

Troubleshooting

  • NLTK data issues: The hooks automatically download required NLTK data, but if you encounter issues, you can manually run:

    import nltk
    nltk.download('stopwords')
    nltk.download('punkt')
    
  • Python environment: The hooks use isolated Python environments with the specified dependencies, so they should work regardless of your local Python setup.