mirror of https://github.com/harvard-edge/cs249r_book.git synced 2026-05-06 17:49:07 -05:00

Files

Vijay Janapa Reddi ae6b66f87b fix: Eliminate weak verbs from LLM-generated captions

PROBLEM: LLM generating weak textbook captions like 'Shows how', 'Demonstrates how', 'Visualizes how'

ROOT CAUSE: Contradictory LLM prompt examples were teaching the exact weak language we wanted to avoid

SOLUTION:
1. Fixed LLM prompt examples to use strong, direct language
2. Added 6 new banned weak verbs: Visualizes, Exemplifies, Traces, Explains, Displays, Presents
3. Enhanced post-processing to catch and fix these patterns

RESULT: LLM now generates strong, direct textbook captions without weak descriptive language

2025-07-23 12:39:24 -04:00

ai_menu/dist

update sr fixed quiz and sr

2025-02-17 01:44:00 -05:00

cross_refs

Improves figure and table caption formatting

2025-07-22 18:21:29 -04:00

genai

Enhances quiz generation with dynamic book context

2025-07-06 09:15:50 -04:00

quarto_publish

fix files

2025-03-03 17:36:33 -08:00

__init__.py

…

ascii_checker.py

relocating files

2025-03-01 08:15:06 -05:00

check_images.py

Removes and re-adds several image files.

2025-07-11 21:53:02 -04:00

clean_callout_title_blocks.py

relocating files

2025-03-01 08:15:06 -05:00

clean_sync_bibs.py

relocating files

2025-03-01 08:15:06 -05:00

collapse_blank_lines.py

Replaces sed with python script for blank line collapsing

2025-06-30 11:08:25 -04:00

delete_old_runs.sh

Adds script to delete old workflow runs

2025-07-02 16:57:09 -04:00

extract_headers.py

Adds script to extract headers from .qmd files

2025-07-11 18:09:22 -04:00

FIGURE_CAPTIONS.md

docs: Update all documentation for streamlined command line options

2025-07-23 12:19:51 -04:00

find_acronyms.py

acronym fixes

2025-03-28 11:32:49 -04:00

find_fig_references.py

relocating files

2025-03-01 08:15:06 -05:00

find_unreferenced_labels.py

Removes Section label type from unreferenced labels

2025-06-30 10:21:47 -04:00

fix_changelog.py

Refactors changelog generation and adds OpenAI summary

2025-06-10 16:39:57 -04:00

fixbib.py

relocating files

2025-03-01 08:15:06 -05:00

fixtitle.py

Improves script execution and header handling

2025-05-14 13:30:01 -04:00

fn_cnt.sh

Updated footnote detection to only show defn. count

2025-03-18 07:54:36 -04:00

footnote_cnt.sh

change sorting order

2025-03-19 21:10:18 -04:00

improve_figure_captions.py

fix: Eliminate weak verbs from LLM-generated captions

2025-07-23 12:39:24 -04:00

quarto_stats.py

Produces correct stats for the book's contents

2025-03-28 12:25:57 -04:00

README.md

docs: Update all documentation for streamlined command line options

2025-07-23 12:19:51 -04:00

run_tests.py

Adds section ID management system

2025-06-20 15:08:23 -04:00

section_id_manager.py

Updates section IDs and quiz JSON files

2025-07-06 19:03:18 -04:00

SECTION_ID_SYSTEM.md

Updated section ids with new changes to manager code

2025-06-20 16:10:19 -04:00

test_section_id_manager.py

Adds section ID management system

2025-06-20 15:08:23 -04:00

update_changelog.py

Improves changelog generation with sections

2025-06-10 13:08:19 -04:00

update_texlive_packages.py

Automates TeX Live package management

2025-07-02 15:04:20 -04:00

README.md

Scripts Directory

This directory contains various Python scripts used for book maintenance and processing.

Available Scripts

Figure Caption Improvement

The improve_figure_captions.py script provides automated caption enhancement using local Ollama LLM models:

# Improve all captions (recommended)
python3 scripts/improve_figure_captions.py -d contents/core/

# Analysis and utilities
python3 scripts/improve_figure_captions.py --analyze -d contents/core/
python3 scripts/improve_figure_captions.py --build-map -d contents/core/

📖 Full documentation: See FIGURE_CAPTIONS.md for complete usage guide, model selection, and troubleshooting.

Cross-Reference Generation

The cross_refs/ directory contains scripts for generating AI-powered cross-references with explanations.

📖 Full documentation: See cross_refs/RECIPE.md for complete workflow.

Python Dependencies

All Python dependencies are managed through the root-level requirements.txt file. This ensures consistent package versions across all scripts and the GitHub Actions workflow.

Adding New Dependencies

When adding new Python scripts that require external packages:

Add the required packages to requirements.txt at the project root
Include version constraints where appropriate (e.g., >=1.0.0)
Add comments to group related packages
Test locally with: pip install -r requirements.txt

Current Dependencies

The current dependencies include:

Quarto/Jupyter: jupyterlab-quarto, jupyter
NLP: nltk (with stopwords and punkt data)
AI Integration: openai, gradio
Document Processing: pybtex, pypandoc, pyyaml
Image Processing: Pillow
Validation: jsonschema
Utilities: absl-py

Subdirectory Requirements Files

Some subdirectories have their own requirements.txt files for specific workflows:

scripts/genai/requirements.txt - AI-specific dependencies
scripts/quarto_publish/requirements.txt - Publishing dependencies

These are kept for reference but the main workflow uses the root requirements.txt.

GitHub Actions Integration

The GitHub Actions workflow automatically:

Caches Python packages for faster builds
Installs all dependencies from requirements.txt
Downloads required NLTK data
Reports cache status in build summaries

Cache is invalidated when requirements.txt changes, ensuring dependencies stay up-to-date.

Pre-commit Setup

The project uses pre-commit hooks for code quality checks. The hooks run automatically on commit and include:

Spell checking with codespell
YAML validation for _quarto.yml
Markdown formatting and linting
Bibliography formatting with bibtex-tidy
Custom Python scripts for section ID management and unreferenced label detection

Setup Instructions

Install pre-commit (included in requirements.txt):
```
pip install -r requirements.txt
```
Install the git hooks:
```
pre-commit install
```

Run manually (optional):

# Run on all files
pre-commit run --all-files

# Run on specific files
pre-commit run --files path/to/file.qmd

Troubleshooting

NLTK data issues: The hooks automatically download required NLTK data, but if you encounter issues, you can manually run:
```
import nltk
nltk.download('stopwords')
nltk.download('punkt')
```
Python environment: The hooks use isolated Python environments with the specified dependencies, so they should work regardless of your local Python setup.