github-starred/cs249r_book

Fork 0

mirror of https://github.com/harvard-edge/cs249r_book.git synced 2026-05-10 15:49:25 -05:00

Files

History

…

results

…

enhanced_design_space.py

…

experiment_runner.py

…

flexible_length_test.py

…

llm_judge.py

…

models_with_flexible_length.py

…

quality_model_comparison.py

…

quick_model_test.py

…

README.md

…

real_examples_length_test.py

…

run_experiments.py

…

test_cases.py

…

README.md

LLM Cross-Reference Explanation Optimization Experiments

This framework provides systematic testing and evaluation of different LLM models and explanation lengths for generating cross-reference explanations in the ML Systems textbook.

🎯 What This Tests

Model Comparison

Multiple Ollama Models: Tests available models (qwen2.5, llama3.1, mistral, gemma2, etc.)
Consistent Evaluation: All models tested on identical test cases
Performance Metrics: Comprehensive scoring across 6 criteria

Length Optimization

5 Length Targets: ultra_short (3-5 words) → extended (10-15 words)
Quality vs Brevity: Finds optimal balance for margin space
Adherence Tracking: Monitors if models follow length constraints

LLM-as-Judge Evaluation

6 Evaluation Criteria:
- Relevance: Captures actual relationship between sections
- Clarity: Clear and understandable for students
- Conciseness: Appropriate length without verbosity
- Usefulness: Helps readers decide to follow the link
- Accuracy: Factually correct about content domains
- Uniqueness: Adds value beyond section titles

📁 Framework Components

scripts/llm_experiments/
├── test_cases.py          # 8 realistic cross-reference test cases
├── llm_judge.py           # LLM-based evaluation system
├── experiment_runner.py   # Main orchestration system
├── run_experiments.py     # Automated runner script
├── results/               # Experiment outputs (JSON files)
└── README.md             # This documentation

🚀 Quick Start

Prerequisites

Ollama installed and running
At least one model pulled (recommended: qwen2.5:7b, qwen2.5:32b)
Python packages: requests (already in requirements)

Run Experiments

cd scripts/llm_experiments
python3 run_experiments.py

The script will:

✅ Check available Ollama models
🧪 Test each model on standardized test cases
📏 Optimize explanation length with best model
📊 Generate data-driven recommendations
💾 Save detailed results to results/ directory

Expected Duration: 30-60 minutes depending on available models

📊 Understanding Results

Key Output Files

recommendations_latest.json: Main recommendations and analysis
model_comparison_latest.json: Detailed model performance data
length_optimization_latest.json: Optimal length analysis

Sample Recommendation Output

{
  "recommendations": {
    "model": {
      "recommended": "qwen2.5:14b",
      "confidence": "high",
      "reasoning": "qwen2.5:14b significantly outperforms other models with 8.2 average score vs 6.8 for worst model"
    },
    "length": {
      "recommended": "medium",
      "reasoning": "Length target 'medium' achieved highest score of 8.1 with 7.8 average words"
    }
  }
}

🧪 Test Cases Overview

The framework uses 8 carefully designed test cases covering:

Introductory Connections: AI Pervasiveness → Neural Networks
Technical Depth: Training → Hardware Acceleration
Advanced Topics: Adversarial Attacks → Privacy
Practical Applications: Frameworks → Deployment
Backward References: Optimization → Training Fundamentals
Complex Technical: Transformers → Efficient Attention
Real-world Applications: Edge Computing → Deployment
Short Content: CNN Basics → Image Classification

Each test case includes realistic content excerpts and represents different difficulty levels and domains.

🔬 Methodology

Model Testing Process

Generate explanations using each available model
Evaluate with LLM judge (powerful model like qwen2.5:32b)
Score across 6 criteria (1-10 scale)
Calculate statistics (mean, median, std dev)
Rank models by overall performance

Length Optimization Process

Use best-performing model from comparison phase
Test 5 length targets on diverse test cases
Measure quality vs length trade-offs
Check adherence to length constraints
Recommend optimal range

Evaluation Reliability

Low temperature (0.1) for consistent judge scoring
Multiple test cases per condition for statistical validity
Retry logic for network reliability
Comprehensive criteria covering all important aspects

📈 Expected Outcomes

The experiments will determine:

Best Model: Which Ollama model generates highest-quality explanations
Optimal Length: Sweet spot between informativeness and conciseness
Performance Gaps: How much difference model choice makes
Length Sensitivity: How explanation length affects quality
Deployment Recommendations: Data-driven guidance for production

🛠️ Customization

Adding New Models

Edit experiment_runner.py:

self.test_models = [
    "qwen2.5:7b",
    "your-new-model:version",  # Add here
    # ... existing models
]

Adding Test Cases

Edit test_cases.py:

TEST_CASES.append({
    "id": "your_test_case",
    "source_title": "Source Section",
    "source_content": "Content...",
    "target_title": "Target Section", 
    "target_content": "Content...",
    "connection_type": "Preview",
    "domain": "your_domain",
    "difficulty": "intermediate"
})

Adjusting Length Targets

Edit test_cases.py:

LENGTH_TARGETS.append({
    "min_words": 5, 
    "max_words": 8, 
    "description": "custom_length"
})

🚨 Troubleshooting

Common Issues

No models available

ollama list                    # Check installed models
ollama pull qwen2.5:7b        # Install a model
ollama serve                   # Start Ollama daemon

Import errors

cd scripts/llm_experiments
python -c "import requests; print('✅ OK')"

Slow performance

Use smaller models for faster testing
Reduce test cases in run_model_comparison_experiment()
Increase timeouts in _make_ollama_request()

Debug Mode

For detailed debugging, run individual components:

from experiment_runner import ExperimentRunner
runner = ExperimentRunner()
models = runner.check_available_models()
print(f"Available models: {models}")

📝 Next Steps After Experiments

Review recommendations in recommendations_latest.json
Update cross_refs.py with optimal model
Adjust prompt for optimal explanation length
Test on real data with a small batch
Deploy to production if results are satisfactory

The framework provides the data-driven foundation for making informed decisions about LLM model selection and explanation generation parameters.