PROBLEM: LLM generating weak textbook captions like 'Shows how', 'Demonstrates how', 'Visualizes how' ROOT CAUSE: Contradictory LLM prompt examples were teaching the exact weak language we wanted to avoid SOLUTION: 1. Fixed LLM prompt examples to use strong, direct language 2. Added 6 new banned weak verbs: Visualizes, Exemplifies, Traces, Explains, Displays, Presents 3. Enhanced post-processing to catch and fix these patterns RESULT: LLM now generates strong, direct textbook captions without weak descriptive language
Scripts Directory
This directory contains various Python scripts used for book maintenance and processing.
Available Scripts
Figure Caption Improvement
The improve_figure_captions.py script provides automated caption enhancement using local Ollama LLM models:
# Improve all captions (recommended)
python3 scripts/improve_figure_captions.py -d contents/core/
# Analysis and utilities
python3 scripts/improve_figure_captions.py --analyze -d contents/core/
python3 scripts/improve_figure_captions.py --build-map -d contents/core/
📖 Full documentation: See FIGURE_CAPTIONS.md for complete usage guide, model selection, and troubleshooting.
Cross-Reference Generation
The cross_refs/ directory contains scripts for generating AI-powered cross-references with explanations.
📖 Full documentation: See cross_refs/RECIPE.md for complete workflow.
Python Dependencies
All Python dependencies are managed through the root-level requirements.txt file. This ensures consistent package versions across all scripts and the GitHub Actions workflow.
Adding New Dependencies
When adding new Python scripts that require external packages:
- Add the required packages to
requirements.txtat the project root - Include version constraints where appropriate (e.g.,
>=1.0.0) - Add comments to group related packages
- Test locally with:
pip install -r requirements.txt
Current Dependencies
The current dependencies include:
- Quarto/Jupyter:
jupyterlab-quarto,jupyter - NLP:
nltk(with stopwords and punkt data) - AI Integration:
openai,gradio - Document Processing:
pybtex,pypandoc,pyyaml - Image Processing:
Pillow - Validation:
jsonschema - Utilities:
absl-py
Subdirectory Requirements Files
Some subdirectories have their own requirements.txt files for specific workflows:
scripts/genai/requirements.txt- AI-specific dependenciesscripts/quarto_publish/requirements.txt- Publishing dependencies
These are kept for reference but the main workflow uses the root requirements.txt.
GitHub Actions Integration
The GitHub Actions workflow automatically:
- Caches Python packages for faster builds
- Installs all dependencies from
requirements.txt - Downloads required NLTK data
- Reports cache status in build summaries
Cache is invalidated when requirements.txt changes, ensuring dependencies stay up-to-date.
Pre-commit Setup
The project uses pre-commit hooks for code quality checks. The hooks run automatically on commit and include:
- Spell checking with codespell
- YAML validation for
_quarto.yml - Markdown formatting and linting
- Bibliography formatting with bibtex-tidy
- Custom Python scripts for section ID management and unreferenced label detection
Setup Instructions
-
Install pre-commit (included in requirements.txt):
pip install -r requirements.txt -
Install the git hooks:
pre-commit install -
Run manually (optional):
# Run on all files pre-commit run --all-files # Run on specific files pre-commit run --files path/to/file.qmd
Troubleshooting
-
NLTK data issues: The hooks automatically download required NLTK data, but if you encounter issues, you can manually run:
import nltk nltk.download('stopwords') nltk.download('punkt') -
Python environment: The hooks use isolated Python environments with the specified dependencies, so they should work regardless of your local Python setup.