Files
cs249r_book/book/tools/scripts/docs/README.md

111 lines
3.6 KiB
Markdown

# Scripts Directory
This directory contains various Python scripts used for book maintenance and processing.
## Available Scripts
### Figure Caption Improvement
The `improve_figure_captions.py` script provides automated caption enhancement using local Ollama LLM models:
```bash
# Improve all captions (recommended)
python3 scripts/improve_figure_captions.py -d contents/core/
# Analysis and utilities
python3 scripts/improve_figure_captions.py --analyze -d contents/core/
python3 scripts/improve_figure_captions.py --build-map -d contents/core/
```
📖 **Full documentation**: See [`FIGURE_CAPTIONS.md`](FIGURE_CAPTIONS.md) for complete usage guide, model selection, and troubleshooting.
### Cross-Reference Generation
The `cross_refs/` directory contains scripts for generating AI-powered cross-references with explanations.
📖 **Full documentation**: See [`cross_refs/RECIPE.md`](/tools/scripts/cross_refs/RECIPE.md) for complete workflow.
## Python Dependencies
All Python dependencies are managed through the root-level `requirements.txt` file. This ensures consistent package versions across all scripts and the GitHub Actions workflow.
### Adding New Dependencies
When adding new Python scripts that require external packages:
1. Add the required packages to `requirements.txt` at the project root
2. Include version constraints where appropriate (e.g., `>=1.0.0`)
3. Add comments to group related packages
4. Test locally with: `pip install -r requirements.txt`
### Current Dependencies
The current dependencies include:
- **Quarto/Jupyter**: `jupyterlab-quarto`, `jupyter`
- **NLP**: `nltk` (with stopwords and punkt data)
- **AI Integration**: `openai`, `gradio`
- **Document Processing**: `pybtex`, `pypandoc`, `pyyaml`
- **Image Processing**: `Pillow`
- **Validation**: `jsonschema`
- **Utilities**: `absl-py`
### Subdirectory Requirements Files
Some subdirectories have their own `requirements.txt` files for specific workflows:
- `scripts/genai/requirements.txt` - AI-specific dependencies
- `scripts/publish/requirements.txt` - Publishing dependencies
These are kept for reference but the main workflow uses the root `requirements.txt`.
### GitHub Actions Integration
The GitHub Actions workflow automatically:
1. Caches Python packages for faster builds
2. Installs all dependencies from `requirements.txt`
3. Downloads required NLTK data
4. Reports cache status in build summaries
Cache is invalidated when `requirements.txt` changes, ensuring dependencies stay up-to-date.
## Pre-commit Setup
The project uses pre-commit hooks for code quality checks. The hooks run automatically on commit and include:
- **Spell checking** with codespell
- **YAML validation** for `_quarto-html.yml` and `_quarto-pdf.yml`
- **Markdown formatting** and linting
- **Bibliography formatting** with bibtex-tidy
- **Custom Python scripts** for section ID management and unreferenced label detection
### Setup Instructions
1. **Install pre-commit** (included in requirements.txt):
```bash
pip install -r requirements.txt
```
2. **Install the git hooks**:
```bash
pre-commit install
```
3. **Run manually** (optional):
```bash
# Run on all files
pre-commit run --all-files
# Run on specific files
pre-commit run --files path/to/file.qmd
```
### Troubleshooting
- **NLTK data issues**: The hooks automatically download required NLTK data, but if you encounter issues, you can manually run:
```python
import nltk
nltk.download('stopwords')
nltk.download('punkt')
```
- **Python environment**: The hooks use isolated Python environments with the specified dependencies, so they should work regardless of your local Python setup.