cs249r_book/book/tools/scripts/docs/README.md

# Scripts Directory

This directory contains various Python scripts used for book maintenance and processing.

## Available Scripts

### Figure Caption Improvement
The `improve_figure_captions.py` script provides automated caption enhancement using local Ollama LLM models:

```bash
# Improve all captions (recommended)
python3 scripts/improve_figure_captions.py -d contents/core/

# Analysis and utilities
python3 scripts/improve_figure_captions.py --analyze -d contents/core/
python3 scripts/improve_figure_captions.py --build-map -d contents/core/
```

📖 **Full documentation**: See [`FIGURE_CAPTIONS.md`](FIGURE_CAPTIONS.md) for complete usage guide, model selection, and troubleshooting.

### Cross-Reference Generation
The `cross_refs/` directory contains scripts for generating AI-powered cross-references with explanations.

📖 **Full documentation**: See [`cross_refs/RECIPE.md`](/tools/scripts/cross_refs/RECIPE.md) for complete workflow.
## Python Dependencies

All Python dependencies are managed through the root-level `requirements.txt` file. This ensures consistent package versions across all scripts and the GitHub Actions workflow.

### Adding New Dependencies

When adding new Python scripts that require external packages:

1. Add the required packages to `requirements.txt` at the project root
2. Include version constraints where appropriate (e.g., `>=1.0.0`)
3. Add comments to group related packages
4. Test locally with: `pip install -r requirements.txt`

### Current Dependencies

The current dependencies include:

- **Quarto/Jupyter**: `jupyterlab-quarto`, `jupyter`
- **NLP**: `nltk` (with stopwords and punkt data)
- **AI Integration**: `openai`, `gradio`
- **Document Processing**: `pybtex`, `pypandoc`, `pyyaml`
- **Image Processing**: `Pillow`
- **Validation**: `jsonschema`
- **Utilities**: `absl-py`

### Subdirectory Requirements Files

Some subdirectories have their own `requirements.txt` files for specific workflows:

- `scripts/genai/requirements.txt` - AI-specific dependencies
- `scripts/publish/requirements.txt` - Publishing dependencies

These are kept for reference but the main workflow uses the root `requirements.txt`.

### GitHub Actions Integration

The GitHub Actions workflow automatically:

1. Caches Python packages for faster builds
2. Installs all dependencies from `requirements.txt`
3. Downloads required NLTK data
4. Reports cache status in build summaries

Cache is invalidated when `requirements.txt` changes, ensuring dependencies stay up-to-date.

## Pre-commit Setup

The project uses pre-commit hooks for code quality checks. The hooks run automatically on commit and include:

- **Spell checking** with codespell
- **YAML validation** for `_quarto-html.yml` and `_quarto-pdf.yml`
- **Markdown formatting** and linting
- **Bibliography formatting** with bibtex-tidy
- **Custom Python scripts** for section ID management and unreferenced label detection

### Setup Instructions

1. **Install pre-commit** (included in requirements.txt):
   ```bash
   pip install -r requirements.txt
   ```

2. **Install the git hooks**:
   ```bash
   pre-commit install
   ```

3. **Run manually** (optional):
   ```bash
   # Run on all files
   pre-commit run --all-files

   # Run on specific files
   pre-commit run --files path/to/file.qmd
   ```

### Troubleshooting

- **NLTK data issues**: The hooks automatically download required NLTK data, but if you encounter issues, you can manually run:
  ```python
  import nltk
  nltk.download('stopwords')
  nltk.download('punkt')
  ```

- **Python environment**: The hooks use isolated Python environments with the specified dependencies, so they should work regardless of your local Python setup.