LLM Cross-Reference Explanation Optimization Experiments
This framework provides systematic testing and evaluation of different LLM models and explanation lengths for generating cross-reference explanations in the ML Systems textbook.
🎯 What This Tests
Model Comparison
- Multiple Ollama Models: Tests available models (qwen2.5, llama3.1, mistral, gemma2, etc.)
- Consistent Evaluation: All models tested on identical test cases
- Performance Metrics: Comprehensive scoring across 6 criteria
Length Optimization
- 5 Length Targets: ultra_short (3-5 words) → extended (10-15 words)
- Quality vs Brevity: Finds optimal balance for margin space
- Adherence Tracking: Monitors if models follow length constraints
LLM-as-Judge Evaluation
- 6 Evaluation Criteria:
- Relevance: Captures actual relationship between sections
- Clarity: Clear and understandable for students
- Conciseness: Appropriate length without verbosity
- Usefulness: Helps readers decide to follow the link
- Accuracy: Factually correct about content domains
- Uniqueness: Adds value beyond section titles
📁 Framework Components
scripts/llm_experiments/
├── test_cases.py # 8 realistic cross-reference test cases
├── llm_judge.py # LLM-based evaluation system
├── experiment_runner.py # Main orchestration system
├── run_experiments.py # Automated runner script
├── results/ # Experiment outputs (JSON files)
└── README.md # This documentation
🚀 Quick Start
Prerequisites
- Ollama installed and running
- At least one model pulled (recommended:
qwen2.5:7b,qwen2.5:32b) - Python packages:
requests(already in requirements)
Run Experiments
cd scripts/llm_experiments
python3 run_experiments.py
The script will:
- ✅ Check available Ollama models
- 🧪 Test each model on standardized test cases
- 📏 Optimize explanation length with best model
- 📊 Generate data-driven recommendations
- 💾 Save detailed results to
results/directory
Expected Duration: 30-60 minutes depending on available models
📊 Understanding Results
Key Output Files
recommendations_latest.json: Main recommendations and analysismodel_comparison_latest.json: Detailed model performance datalength_optimization_latest.json: Optimal length analysis
Sample Recommendation Output
{
"recommendations": {
"model": {
"recommended": "qwen2.5:14b",
"confidence": "high",
"reasoning": "qwen2.5:14b significantly outperforms other models with 8.2 average score vs 6.8 for worst model"
},
"length": {
"recommended": "medium",
"reasoning": "Length target 'medium' achieved highest score of 8.1 with 7.8 average words"
}
}
}
🧪 Test Cases Overview
The framework uses 8 carefully designed test cases covering:
- Introductory Connections: AI Pervasiveness → Neural Networks
- Technical Depth: Training → Hardware Acceleration
- Advanced Topics: Adversarial Attacks → Privacy
- Practical Applications: Frameworks → Deployment
- Backward References: Optimization → Training Fundamentals
- Complex Technical: Transformers → Efficient Attention
- Real-world Applications: Edge Computing → Deployment
- Short Content: CNN Basics → Image Classification
Each test case includes realistic content excerpts and represents different difficulty levels and domains.
🔬 Methodology
Model Testing Process
- Generate explanations using each available model
- Evaluate with LLM judge (powerful model like qwen2.5:32b)
- Score across 6 criteria (1-10 scale)
- Calculate statistics (mean, median, std dev)
- Rank models by overall performance
Length Optimization Process
- Use best-performing model from comparison phase
- Test 5 length targets on diverse test cases
- Measure quality vs length trade-offs
- Check adherence to length constraints
- Recommend optimal range
Evaluation Reliability
- Low temperature (0.1) for consistent judge scoring
- Multiple test cases per condition for statistical validity
- Retry logic for network reliability
- Comprehensive criteria covering all important aspects
📈 Expected Outcomes
The experiments will determine:
- Best Model: Which Ollama model generates highest-quality explanations
- Optimal Length: Sweet spot between informativeness and conciseness
- Performance Gaps: How much difference model choice makes
- Length Sensitivity: How explanation length affects quality
- Deployment Recommendations: Data-driven guidance for production
🛠️ Customization
Adding New Models
Edit experiment_runner.py:
self.test_models = [
"qwen2.5:7b",
"your-new-model:version", # Add here
# ... existing models
]
Adding Test Cases
Edit test_cases.py:
TEST_CASES.append({
"id": "your_test_case",
"source_title": "Source Section",
"source_content": "Content...",
"target_title": "Target Section",
"target_content": "Content...",
"connection_type": "Preview",
"domain": "your_domain",
"difficulty": "intermediate"
})
Adjusting Length Targets
Edit test_cases.py:
LENGTH_TARGETS.append({
"min_words": 5,
"max_words": 8,
"description": "custom_length"
})
🚨 Troubleshooting
Common Issues
No models available
ollama list # Check installed models
ollama pull qwen2.5:7b # Install a model
ollama serve # Start Ollama daemon
Import errors
cd scripts/llm_experiments
python -c "import requests; print('✅ OK')"
Slow performance
- Use smaller models for faster testing
- Reduce test cases in
run_model_comparison_experiment() - Increase timeouts in
_make_ollama_request()
Debug Mode
For detailed debugging, run individual components:
from experiment_runner import ExperimentRunner
runner = ExperimentRunner()
models = runner.check_available_models()
print(f"Available models: {models}")
📝 Next Steps After Experiments
- Review recommendations in
recommendations_latest.json - Update
cross_refs.pywith optimal model - Adjust prompt for optimal explanation length
- Test on real data with a small batch
- Deploy to production if results are satisfactory
The framework provides the data-driven foundation for making informed decisions about LLM model selection and explanation generation parameters.