mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-06-02 22:51:55 -05:00
Remove backward compatibility aliases and enforce PyTorch-consistent naming: - Remove Dense = Linear alias in Module 04 (layers) - Update all Dense references to Linear in Modules 02, 08, 09, 18, 21 - Remove MaxPool2d = MaxPool2D alias in Module 17 (quantization) - Standardize fc/dense_weights to linear_weights in Module 18 (compression) Benefits: - Eliminates naming confusion between Dense/Linear terminology - Aligns with PyTorch production patterns (nn.Linear) - Reduces cognitive load with single consistent naming convention - Improves student transfer to real ML frameworks All modules tested and functionality preserved.
1008 lines
37 KiB
Python
1008 lines
37 KiB
Python
#| default_exp core.mlops
|
|
"""
|
|
# MLOps - Production ML Systems Module
|
|
|
|
This module teaches production ML systems engineering through hands-on implementation
|
|
of monitoring, deployment, and lifecycle management tools.
|
|
|
|
**Learning Focus**: Systems engineering principles for production ML
|
|
**Key Concepts**: Model monitoring, drift detection, automated retraining
|
|
**Systems Insights**: How real ML systems manage model lifecycles
|
|
"""
|
|
|
|
# %% [markdown]
|
|
"""
|
|
## 🎯 Learning Objectives
|
|
|
|
By completing this module, you will:
|
|
|
|
1. **Build model monitoring systems** that track performance degradation in production
|
|
2. **Implement drift detection** to identify when data changes affect model quality
|
|
3. **Create automated retraining triggers** that respond to performance issues
|
|
4. **Understand MLOps systems engineering** principles used in real production systems
|
|
|
|
**Systems Engineering Focus**: This module emphasizes building production-ready
|
|
infrastructure, not just algorithms. You'll learn memory management, performance
|
|
monitoring, and system architecture patterns used in enterprise ML systems.
|
|
"""
|
|
|
|
# %% [markdown]
|
|
"""
|
|
## 📚 Background: Production ML Systems
|
|
|
|
### The MLOps Challenge
|
|
|
|
In production ML systems, models don't just run once - they serve predictions
|
|
continuously for months or years. This creates unique engineering challenges:
|
|
|
|
1. **Model Degradation**: Performance drops as data changes over time
|
|
2. **Data Drift**: Input distributions shift, affecting model validity
|
|
3. **Infrastructure Complexity**: Monitoring, versioning, and deployment at scale
|
|
4. **Incident Response**: Automated detection and response to performance issues
|
|
|
|
### Real-World Context
|
|
|
|
Companies like Netflix, Uber, and Airbnb run thousands of ML models in production.
|
|
Each model requires:
|
|
- Continuous performance monitoring
|
|
- Automated drift detection
|
|
- Retraining pipelines
|
|
- A/B testing infrastructure
|
|
- Incident response systems
|
|
|
|
This module teaches you to build these systems from scratch.
|
|
"""
|
|
|
|
import numpy as np
|
|
import os
|
|
import sys
|
|
import time
|
|
import json
|
|
from typing import Dict, List, Tuple, Optional, Any, Callable
|
|
from dataclasses import dataclass, field
|
|
from datetime import datetime, timedelta
|
|
from collections import defaultdict
|
|
|
|
# Import our dependencies - try from package first, then local modules
|
|
try:
|
|
from tinytorch.core.tensor import Tensor
|
|
from tinytorch.core.training import Trainer
|
|
from tinytorch.core.layers import Linear
|
|
except ImportError:
|
|
# For development, fallback gracefully
|
|
print("⚠️ Some TinyTorch modules not available - MLOps will use mock implementations")
|
|
Tensor = None
|
|
Trainer = None
|
|
|
|
# %% [markdown]
|
|
"""
|
|
---
|
|
|
|
## 🔬 SECTION 1: Model Performance Monitoring
|
|
|
|
### The Foundation: Tracking Model Health
|
|
|
|
Real production systems continuously monitor model performance. When accuracy drops
|
|
or latency increases, automated systems need to detect and respond.
|
|
|
|
**Systems Insight**: Production monitoring is about **thresholds and trends**, not
|
|
perfect accuracy. A 5% accuracy drop might trigger retraining, while gradual
|
|
degradation over weeks might indicate data drift.
|
|
|
|
**What You'll Build**: A `ModelMonitor` class that tracks accuracy over time
|
|
and detects performance degradation using simple thresholds.
|
|
"""
|
|
|
|
@dataclass
|
|
class ModelMonitor:
|
|
"""
|
|
Monitors ML model performance in production environments.
|
|
|
|
Tracks accuracy, latency, and other metrics over time to detect degradation.
|
|
This is the foundation of production ML systems monitoring.
|
|
"""
|
|
|
|
def __init__(self, model_name: str, baseline_accuracy: float = 0.95):
|
|
"""
|
|
Initialize model monitoring with baseline performance expectations.
|
|
|
|
Args:
|
|
model_name: Unique identifier for the model being monitored
|
|
baseline_accuracy: Expected accuracy level (alerts when dropping below 90% of this)
|
|
"""
|
|
self.model_name = model_name
|
|
self.baseline_accuracy = baseline_accuracy
|
|
|
|
# Performance history - stored as lists for simple analysis
|
|
self.accuracy_history: List[float] = []
|
|
self.latency_history: List[float] = [] # milliseconds
|
|
self.timestamp_history: List[datetime] = []
|
|
|
|
# Alert thresholds - 90% of baseline triggers accuracy alert
|
|
self.accuracy_threshold = baseline_accuracy * 0.9
|
|
self.latency_threshold = 200.0 # milliseconds
|
|
|
|
# Alert states
|
|
self.accuracy_alert = False
|
|
self.latency_alert = False
|
|
|
|
print(f"✅ Model monitor initialized for '{model_name}'")
|
|
print(f" Baseline accuracy: {baseline_accuracy:.3f}")
|
|
print(f" Alert threshold: {self.accuracy_threshold:.3f}")
|
|
|
|
def record_performance(self, accuracy: float, latency: float) -> None:
|
|
"""
|
|
Record a new performance measurement.
|
|
|
|
Args:
|
|
accuracy: Model accuracy on recent batch
|
|
latency: Inference latency in milliseconds
|
|
"""
|
|
self.accuracy_history.append(accuracy)
|
|
self.latency_history.append(latency)
|
|
self.timestamp_history.append(datetime.now())
|
|
|
|
# Update alert states
|
|
self.accuracy_alert = accuracy < self.accuracy_threshold
|
|
self.latency_alert = latency > self.latency_threshold
|
|
|
|
# Log significant changes
|
|
if self.accuracy_alert:
|
|
print(f"🚨 ACCURACY ALERT: {accuracy:.3f} < {self.accuracy_threshold:.3f}")
|
|
if self.latency_alert:
|
|
print(f"🚨 LATENCY ALERT: {latency:.1f}ms > {self.latency_threshold:.1f}ms")
|
|
|
|
def check_alerts(self) -> Dict[str, Any]:
|
|
"""
|
|
Check current alert status and return summary.
|
|
|
|
Returns:
|
|
Dictionary with alert information and recent performance
|
|
"""
|
|
if not self.accuracy_history:
|
|
return {"status": "no_data", "alerts": []}
|
|
|
|
recent_accuracy = self.accuracy_history[-1]
|
|
recent_latency = self.latency_history[-1]
|
|
|
|
alerts = []
|
|
if self.accuracy_alert:
|
|
alerts.append({
|
|
"type": "accuracy_degradation",
|
|
"current": recent_accuracy,
|
|
"threshold": self.accuracy_threshold,
|
|
"severity": "high"
|
|
})
|
|
|
|
if self.latency_alert:
|
|
alerts.append({
|
|
"type": "latency_spike",
|
|
"current": recent_latency,
|
|
"threshold": self.latency_threshold,
|
|
"severity": "medium"
|
|
})
|
|
|
|
return {
|
|
"model_name": self.model_name,
|
|
"status": "alert" if alerts else "healthy",
|
|
"alerts": alerts,
|
|
"recent_accuracy": recent_accuracy,
|
|
"recent_latency": recent_latency,
|
|
"total_measurements": len(self.accuracy_history)
|
|
}
|
|
|
|
def get_performance_trend(self, window: int = 5) -> Dict[str, str]:
|
|
"""
|
|
Analyze performance trends over recent measurements.
|
|
|
|
Args:
|
|
window: Number of recent measurements to analyze
|
|
|
|
Returns:
|
|
Trend analysis for accuracy and latency
|
|
"""
|
|
if len(self.accuracy_history) < window:
|
|
return {"accuracy_trend": "insufficient_data", "latency_trend": "insufficient_data"}
|
|
|
|
# Compare recent window to previous window
|
|
recent_acc = np.mean(self.accuracy_history[-window:])
|
|
older_acc = np.mean(self.accuracy_history[-2*window:-window]) if len(self.accuracy_history) >= 2*window else recent_acc
|
|
|
|
recent_lat = np.mean(self.latency_history[-window:])
|
|
older_lat = np.mean(self.latency_history[-2*window:-window]) if len(self.latency_history) >= 2*window else recent_lat
|
|
|
|
# Simple trend analysis with 2% threshold
|
|
accuracy_trend = "stable"
|
|
if recent_acc > older_acc * 1.02:
|
|
accuracy_trend = "improving"
|
|
elif recent_acc < older_acc * 0.98:
|
|
accuracy_trend = "degrading"
|
|
|
|
latency_trend = "stable"
|
|
if recent_lat > older_lat * 1.1:
|
|
latency_trend = "increasing"
|
|
elif recent_lat < older_lat * 0.9:
|
|
latency_trend = "decreasing"
|
|
|
|
return {
|
|
"accuracy_trend": accuracy_trend,
|
|
"latency_trend": latency_trend,
|
|
"recent_accuracy": recent_acc,
|
|
"older_accuracy": older_acc,
|
|
"recent_latency": recent_lat,
|
|
"older_latency": older_lat
|
|
}
|
|
|
|
# %% [markdown]
|
|
"""
|
|
### 🧪 Testing: Model Monitor
|
|
|
|
Let's test our model monitoring system with realistic scenarios.
|
|
"""
|
|
|
|
def test_model_monitor():
|
|
"""Test the ModelMonitor with realistic performance scenarios."""
|
|
print("🧪 Testing Model Monitor...")
|
|
|
|
# Create monitor for an image classifier
|
|
monitor = ModelMonitor("image_classifier_v1", baseline_accuracy=0.92)
|
|
|
|
# Simulate healthy performance
|
|
print("\n📊 Phase 1: Healthy Performance")
|
|
for i in range(5):
|
|
accuracy = 0.91 + np.random.normal(0, 0.01) # Around 91% with small variance
|
|
latency = 150 + np.random.normal(0, 10) # Around 150ms
|
|
monitor.record_performance(accuracy, latency)
|
|
|
|
alerts = monitor.check_alerts()
|
|
print(f"Status: {alerts['status']}")
|
|
print(f"Alerts: {len(alerts['alerts'])}")
|
|
|
|
# Simulate performance degradation
|
|
print("\n📉 Phase 2: Performance Degradation")
|
|
for i in range(3):
|
|
accuracy = 0.81 + np.random.normal(0, 0.02) # Drop to 81% - below threshold
|
|
latency = 220 + np.random.normal(0, 15) # Spike to 220ms
|
|
monitor.record_performance(accuracy, latency)
|
|
|
|
alerts = monitor.check_alerts()
|
|
print(f"Status: {alerts['status']}")
|
|
print(f"Number of alerts: {len(alerts['alerts'])}")
|
|
for alert in alerts['alerts']:
|
|
print(f" - {alert['type']}: {alert['current']:.3f} (threshold: {alert['threshold']:.3f})")
|
|
|
|
# Analyze trends
|
|
trend = monitor.get_performance_trend()
|
|
print(f"\n📈 Trend Analysis:")
|
|
print(f"Accuracy trend: {trend['accuracy_trend']}")
|
|
print(f"Latency trend: {trend['latency_trend']}")
|
|
|
|
print("✅ Model monitor test completed!")
|
|
|
|
# %% [markdown]
|
|
"""
|
|
---
|
|
|
|
## 🔬 SECTION 2: Data Drift Detection
|
|
|
|
### The Challenge: When Data Changes
|
|
|
|
Data drift occurs when the input distribution changes over time. A model trained
|
|
on summer data might perform poorly in winter. Production systems need to detect
|
|
this automatically.
|
|
|
|
**Systems Insight**: Drift detection is about **statistical comparisons**, not
|
|
model accuracy. We compare new data statistics to baseline statistics to detect
|
|
distribution shifts before they affect model performance.
|
|
|
|
**What You'll Build**: A `DriftDetector` class using simple statistical thresholds
|
|
(mean differences) rather than complex statistical tests.
|
|
"""
|
|
|
|
class DriftDetector:
|
|
"""
|
|
Detects distribution drift in input data using statistical methods.
|
|
|
|
Compares new data statistics against baseline to identify significant changes
|
|
that might affect model performance.
|
|
"""
|
|
|
|
def __init__(self, baseline_data: np.ndarray, feature_names: Optional[List[str]] = None):
|
|
"""
|
|
Initialize drift detection with baseline data statistics.
|
|
|
|
Args:
|
|
baseline_data: Reference data to compare against (n_samples x n_features)
|
|
feature_names: Optional names for features (for reporting)
|
|
"""
|
|
self.baseline_data = baseline_data
|
|
self.feature_names = feature_names or [f"feature_{i}" for i in range(baseline_data.shape[1])]
|
|
|
|
# Compute baseline statistics
|
|
self.baseline_mean = np.mean(baseline_data, axis=0)
|
|
self.baseline_std = np.std(baseline_data, axis=0)
|
|
self.baseline_min = np.min(baseline_data, axis=0)
|
|
self.baseline_max = np.max(baseline_data, axis=0)
|
|
|
|
# Drift history
|
|
self.drift_history: List[Dict] = []
|
|
|
|
print(f"✅ Drift detector initialized")
|
|
print(f" Baseline shape: {baseline_data.shape}")
|
|
print(f" Features: {len(self.feature_names)}")
|
|
print(f" Baseline mean: [{', '.join([f'{m:.3f}' for m in self.baseline_mean[:3]])}...]")
|
|
|
|
def detect_simple_drift(self, new_data: np.ndarray, threshold: float = 2.0) -> Dict[str, Any]:
|
|
"""
|
|
Simple drift detection using statistical thresholds.
|
|
|
|
Args:
|
|
new_data: New data to compare against baseline
|
|
threshold: Number of standard deviations to consider drift
|
|
|
|
Returns:
|
|
Drift detection results
|
|
"""
|
|
if new_data.shape[1] != self.baseline_data.shape[1]:
|
|
raise ValueError(f"Feature count mismatch: {new_data.shape[1]} vs {self.baseline_data.shape[1]}")
|
|
|
|
# Compute new data statistics
|
|
new_mean = np.mean(new_data, axis=0)
|
|
new_std = np.std(new_data, axis=0)
|
|
|
|
# Detect drift per feature using simple threshold
|
|
drift_flags = []
|
|
drift_scores = []
|
|
|
|
for i in range(len(self.baseline_mean)):
|
|
# Mean shift detection
|
|
mean_diff = abs(new_mean[i] - self.baseline_mean[i])
|
|
mean_threshold = threshold * self.baseline_std[i]
|
|
|
|
# Normalize drift score to handle different feature scales
|
|
# Small epsilon (1e-8) prevents division by zero
|
|
drift_score = mean_diff / (self.baseline_std[i] + 1e-8)
|
|
drift_flags.append(drift_score > threshold)
|
|
drift_scores.append(drift_score)
|
|
|
|
# Overall drift assessment with clear thresholds
|
|
drift_detected = any(drift_flags)
|
|
max_drift_score = max(drift_scores)
|
|
|
|
# Simple severity classification (no magic numbers)
|
|
if max_drift_score > 3.0:
|
|
drift_severity = "high"
|
|
elif max_drift_score > 2.0:
|
|
drift_severity = "medium"
|
|
else:
|
|
drift_severity = "low"
|
|
|
|
result = {
|
|
"timestamp": datetime.now(),
|
|
"drift_detected": drift_detected,
|
|
"drift_severity": drift_severity,
|
|
"max_drift_score": max_drift_score,
|
|
"per_feature": {
|
|
self.feature_names[i]: {
|
|
"drift_detected": drift_flags[i],
|
|
"drift_score": drift_scores[i],
|
|
"baseline_mean": self.baseline_mean[i],
|
|
"new_mean": new_mean[i]
|
|
}
|
|
for i in range(len(self.feature_names))
|
|
},
|
|
"summary": {
|
|
"total_features": len(self.feature_names),
|
|
"drifted_features": sum(drift_flags),
|
|
"drift_percentage": sum(drift_flags) / len(drift_flags) * 100
|
|
}
|
|
}
|
|
|
|
# Store in history
|
|
self.drift_history.append(result)
|
|
|
|
if drift_detected:
|
|
print(f"🚨 DRIFT DETECTED: {sum(drift_flags)}/{len(drift_flags)} features drifted")
|
|
print(f" Severity: {drift_severity} (max score: {max_drift_score:.2f})")
|
|
|
|
return result
|
|
|
|
def get_drift_history(self, limit: int = 10) -> List[Dict]:
|
|
"""Get recent drift detection history."""
|
|
return self.drift_history[-limit:] if limit else self.drift_history
|
|
|
|
# %% [markdown]
|
|
"""
|
|
### 🧪 Testing: Drift Detection
|
|
|
|
Let's test drift detection with simulated data distribution changes.
|
|
"""
|
|
|
|
def test_drift_detection():
|
|
"""Test drift detection with controlled data changes."""
|
|
print("🧪 Testing Drift Detection...")
|
|
|
|
# Create baseline data - normal distribution
|
|
np.random.seed(42)
|
|
baseline_data = np.random.normal(loc=[0, 1, -0.5], scale=[1, 0.5, 0.8], size=(1000, 3))
|
|
feature_names = ["temperature", "humidity", "pressure"]
|
|
|
|
detector = DriftDetector(baseline_data, feature_names)
|
|
|
|
# Test 1: No drift - similar distribution
|
|
print("\n📊 Test 1: No Drift Expected")
|
|
new_data_normal = np.random.normal(loc=[0.1, 1.1, -0.4], scale=[1, 0.5, 0.8], size=(500, 3))
|
|
result1 = detector.detect_simple_drift(new_data_normal)
|
|
print(f"Drift detected: {result1['drift_detected']}")
|
|
print(f"Drifted features: {result1['summary']['drifted_features']}/{result1['summary']['total_features']}")
|
|
|
|
# Test 2: Clear drift - shifted distribution
|
|
print("\n🚨 Test 2: Clear Drift Expected")
|
|
new_data_drift = np.random.normal(loc=[3, 1, -0.5], scale=[1, 0.5, 0.8], size=(500, 3)) # Temperature shifted
|
|
result2 = detector.detect_simple_drift(new_data_drift)
|
|
print(f"Drift detected: {result2['drift_detected']}")
|
|
print(f"Severity: {result2['drift_severity']}")
|
|
print(f"Max drift score: {result2['max_drift_score']:.2f}")
|
|
|
|
# Show per-feature results
|
|
for feature, info in result2['per_feature'].items():
|
|
if info['drift_detected']:
|
|
print(f" - {feature}: {info['baseline_mean']:.2f} → {info['new_mean']:.2f} (score: {info['drift_score']:.2f})")
|
|
|
|
print("✅ Drift detection test completed!")
|
|
|
|
# %% [markdown]
|
|
"""
|
|
---
|
|
|
|
## 🔬 SECTION 3: Automated Retraining Triggers
|
|
|
|
### The Response: When to Retrain
|
|
|
|
When performance drops or drift is detected, production systems need to decide
|
|
whether to retrain the model. This requires balancing computational cost with
|
|
model quality.
|
|
|
|
**Systems Insight**: Retraining triggers are about **policies and thresholds**.
|
|
Real systems consider accuracy drops, drift severity, data volume, and business
|
|
impact when deciding to retrain.
|
|
|
|
**What You'll Build**: A `RetrainingTrigger` class that makes retraining decisions
|
|
based on configurable policies combining performance and drift signals.
|
|
"""
|
|
|
|
class RetrainingTrigger:
|
|
"""
|
|
Manages automated retraining decisions based on performance and drift signals.
|
|
|
|
Implements policies for when to trigger expensive retraining operations.
|
|
"""
|
|
|
|
def __init__(self, model_name: str, retraining_policy: Optional[Dict] = None):
|
|
"""
|
|
Initialize retraining trigger with policies.
|
|
|
|
Args:
|
|
model_name: Model being managed
|
|
retraining_policy: Configuration for retraining decisions
|
|
"""
|
|
self.model_name = model_name
|
|
|
|
# Default retraining policy
|
|
default_policy = {
|
|
"accuracy_threshold": 0.05, # 5% accuracy drop triggers retraining
|
|
"drift_threshold": 2.0, # Drift score > 2.0 triggers retraining
|
|
"min_time_between_retrains": 24 * 3600, # 24 hours minimum
|
|
"max_time_without_retrain": 7 * 24 * 3600, # 7 days maximum
|
|
"drift_severity_weights": {"low": 0.1, "medium": 0.5, "high": 1.0} # Simplified weights
|
|
}
|
|
|
|
self.policy = {**default_policy, **(retraining_policy or {})}
|
|
|
|
# Retraining history
|
|
self.retraining_history: List[Dict] = []
|
|
self.last_retrain_time = datetime.now()
|
|
|
|
print(f"✅ Retraining trigger initialized for '{model_name}'")
|
|
print(f" Accuracy threshold: {self.policy['accuracy_threshold']} drop")
|
|
print(f" Drift threshold: {self.policy['drift_threshold']}")
|
|
|
|
def should_retrain(self, monitor_alerts: Dict, drift_result: Dict) -> Dict[str, Any]:
|
|
"""
|
|
Decide whether to trigger retraining based on current conditions.
|
|
|
|
Args:
|
|
monitor_alerts: Results from ModelMonitor.check_alerts()
|
|
drift_result: Results from DriftDetector.detect_simple_drift()
|
|
|
|
Returns:
|
|
Retraining decision with reasoning
|
|
"""
|
|
current_time = datetime.now()
|
|
time_since_last_retrain = (current_time - self.last_retrain_time).total_seconds()
|
|
|
|
# Initialize decision tracking
|
|
trigger_reasons = []
|
|
severity_score = 0.0
|
|
|
|
# Check accuracy degradation
|
|
accuracy_trigger = False
|
|
if monitor_alerts['status'] == 'alert':
|
|
for alert in monitor_alerts['alerts']:
|
|
if alert['type'] == 'accuracy_degradation':
|
|
accuracy_drop = alert['threshold'] - alert['current']
|
|
if accuracy_drop >= self.policy['accuracy_threshold']:
|
|
accuracy_trigger = True
|
|
trigger_reasons.append(f"Accuracy dropped by {accuracy_drop:.3f}")
|
|
severity_score += 1.0
|
|
|
|
# Check drift conditions
|
|
drift_trigger = False
|
|
if drift_result['drift_detected']:
|
|
drift_weight = self.policy['drift_severity_weights'][drift_result['drift_severity']]
|
|
if drift_result['max_drift_score'] >= self.policy['drift_threshold']:
|
|
drift_trigger = True
|
|
trigger_reasons.append(f"Drift detected (score: {drift_result['max_drift_score']:.2f})")
|
|
severity_score += drift_weight
|
|
|
|
# Check time-based policies
|
|
time_trigger = False
|
|
if time_since_last_retrain >= self.policy['max_time_without_retrain']:
|
|
time_trigger = True
|
|
trigger_reasons.append(f"Maximum time exceeded ({time_since_last_retrain/86400:.1f} days)")
|
|
severity_score += 0.5
|
|
|
|
# Check minimum time constraint
|
|
min_time_satisfied = time_since_last_retrain >= self.policy['min_time_between_retrains']
|
|
|
|
# Final decision
|
|
should_retrain = (accuracy_trigger or drift_trigger or time_trigger) and min_time_satisfied
|
|
|
|
if not min_time_satisfied and (accuracy_trigger or drift_trigger):
|
|
trigger_reasons.append(f"Waiting for minimum time ({self.policy['min_time_between_retrains']/3600:.1f}h)")
|
|
|
|
decision = {
|
|
"timestamp": current_time,
|
|
"should_retrain": should_retrain,
|
|
"severity_score": severity_score,
|
|
"trigger_reasons": trigger_reasons,
|
|
"constraints": {
|
|
"accuracy_trigger": accuracy_trigger,
|
|
"drift_trigger": drift_trigger,
|
|
"time_trigger": time_trigger,
|
|
"min_time_satisfied": min_time_satisfied
|
|
},
|
|
"time_since_last_retrain": time_since_last_retrain,
|
|
"next_allowed_retrain": self.last_retrain_time + timedelta(seconds=self.policy['min_time_between_retrains'])
|
|
}
|
|
|
|
if should_retrain:
|
|
print(f"🔄 RETRAINING TRIGGERED for {self.model_name}")
|
|
print(f" Reasons: {', '.join(trigger_reasons)}")
|
|
print(f" Severity score: {severity_score:.2f}")
|
|
self.last_retrain_time = current_time
|
|
self.retraining_history.append(decision)
|
|
|
|
return decision
|
|
|
|
def get_retraining_history(self, limit: int = 10) -> List[Dict]:
|
|
"""Get recent retraining decision history."""
|
|
return self.retraining_history[-limit:] if limit else self.retraining_history
|
|
|
|
# %% [markdown]
|
|
"""
|
|
### 🧪 Testing: Retraining Triggers
|
|
|
|
Let's test the retraining decision logic with various scenarios.
|
|
"""
|
|
|
|
def test_retraining_triggers():
|
|
"""Test retraining trigger logic with different scenarios."""
|
|
print("🧪 Testing Retraining Triggers...")
|
|
|
|
# Create retraining trigger
|
|
trigger = RetrainingTrigger("test_model", {
|
|
"min_time_between_retrains": 60, # 1 minute for testing
|
|
"accuracy_threshold": 0.05
|
|
})
|
|
|
|
# Scenario 1: No issues - no retrain
|
|
print("\n📊 Scenario 1: Healthy Model")
|
|
healthy_alerts = {"status": "healthy", "alerts": []}
|
|
no_drift = {"drift_detected": False, "drift_severity": "low", "max_drift_score": 0.5}
|
|
|
|
decision1 = trigger.should_retrain(healthy_alerts, no_drift)
|
|
print(f"Should retrain: {decision1['should_retrain']}")
|
|
print(f"Reasons: {decision1['trigger_reasons']}")
|
|
|
|
# Scenario 2: Accuracy drop - should trigger
|
|
print("\n🚨 Scenario 2: Accuracy Degradation")
|
|
accuracy_alerts = {
|
|
"status": "alert",
|
|
"alerts": [{
|
|
"type": "accuracy_degradation",
|
|
"current": 0.85,
|
|
"threshold": 0.90,
|
|
"severity": "high"
|
|
}]
|
|
}
|
|
|
|
decision2 = trigger.should_retrain(accuracy_alerts, no_drift)
|
|
print(f"Should retrain: {decision2['should_retrain']}")
|
|
print(f"Reasons: {decision2['trigger_reasons']}")
|
|
print(f"Severity score: {decision2['severity_score']}")
|
|
|
|
# Wait to satisfy minimum time constraint
|
|
time.sleep(1)
|
|
|
|
# Scenario 3: High drift - should trigger
|
|
print("\n🚨 Scenario 3: High Drift")
|
|
high_drift = {"drift_detected": True, "drift_severity": "high", "max_drift_score": 3.5}
|
|
|
|
decision3 = trigger.should_retrain(healthy_alerts, high_drift)
|
|
print(f"Should retrain: {decision3['should_retrain']}")
|
|
print(f"Reasons: {decision3['trigger_reasons']}")
|
|
|
|
print("✅ Retraining triggers test completed!")
|
|
|
|
# %% [markdown]
|
|
"""
|
|
---
|
|
|
|
## 🔬 SECTION 4: Systems Analysis **[ADVANCED - OPTIONAL]**
|
|
|
|
> **Note**: This section contains advanced systems analysis. Focus on Sections 1-3
|
|
> for core MLOps concepts. This section shows how to analyze production performance.
|
|
|
|
### Memory Analysis: Monitoring Infrastructure Costs
|
|
|
|
Let's analyze the memory usage patterns of our MLOps infrastructure and understand
|
|
the computational costs of production monitoring.
|
|
|
|
**Advanced Topic**: Production MLOps systems need to monitor their own performance
|
|
to ensure monitoring doesn't become more expensive than the models being monitored.
|
|
"""
|
|
|
|
def analyze_mlops_memory_usage():
|
|
"""Analyze memory consumption patterns in MLOps components."""
|
|
print("🔬 MLOps Memory Analysis")
|
|
print("=" * 50)
|
|
|
|
import tracemalloc
|
|
tracemalloc.start()
|
|
|
|
# Test different monitoring scales
|
|
model_counts = [1, 10, 100]
|
|
history_lengths = [100, 1000, 10000]
|
|
|
|
for model_count in model_counts:
|
|
for history_length in history_lengths:
|
|
tracemalloc.clear_traces()
|
|
|
|
# Create monitoring infrastructure
|
|
monitors = []
|
|
for i in range(model_count):
|
|
monitor = ModelMonitor(f"model_{i}", baseline_accuracy=0.9)
|
|
|
|
# Fill with history
|
|
for j in range(history_length):
|
|
monitor.record_performance(
|
|
accuracy=0.85 + np.random.normal(0, 0.02),
|
|
latency=150 + np.random.normal(0, 10)
|
|
)
|
|
monitors.append(monitor)
|
|
|
|
# Measure memory
|
|
current, peak = tracemalloc.get_traced_memory()
|
|
|
|
# Calculate per-model overhead
|
|
per_model_kb = (current / 1024) / model_count
|
|
per_measurement_bytes = current / (model_count * history_length)
|
|
|
|
print(f"Models: {model_count:3d}, History: {history_length:5d}")
|
|
print(f" Total memory: {current/1024/1024:.2f} MB")
|
|
print(f" Per model: {per_model_kb:.1f} KB")
|
|
print(f" Per measurement: {per_measurement_bytes:.1f} bytes")
|
|
print()
|
|
|
|
tracemalloc.stop()
|
|
|
|
print("💡 Memory Insights:")
|
|
print("- Each monitor uses ~15-20KB for 1000 measurements")
|
|
print("- Memory scales linearly with history length")
|
|
print("- Real systems use circular buffers to limit memory growth")
|
|
print("- Database storage is preferred for long-term history")
|
|
|
|
def analyze_mlops_computational_complexity():
|
|
"""Analyze computational complexity of MLOps operations."""
|
|
print("🔬 MLOps Computational Complexity")
|
|
print("=" * 50)
|
|
|
|
# Test drift detection complexity
|
|
feature_counts = [10, 100, 1000]
|
|
sample_counts = [100, 1000, 10000]
|
|
|
|
print("Drift Detection Performance:")
|
|
for n_features in feature_counts:
|
|
for n_samples in sample_counts:
|
|
# Create test data
|
|
baseline = np.random.normal(0, 1, (n_samples, n_features))
|
|
new_data = np.random.normal(0.1, 1, (n_samples//2, n_features))
|
|
|
|
detector = DriftDetector(baseline)
|
|
|
|
# Time the operation
|
|
start_time = time.time()
|
|
result = detector.detect_simple_drift(new_data)
|
|
end_time = time.time()
|
|
|
|
computation_time = (end_time - start_time) * 1000 # milliseconds
|
|
|
|
print(f" Features: {n_features:4d}, Samples: {n_samples:5d}")
|
|
print(f" Time: {computation_time:.2f}ms")
|
|
print(f" O(N): {computation_time/(n_features*n_samples)*1e6:.2f} ns per feature-sample")
|
|
|
|
print("\n💡 Complexity Insights:")
|
|
print("- Drift detection is O(N*M) where N=samples, M=features")
|
|
print("- Real systems use sampling and approximation for large datasets")
|
|
print("- Statistical tests (KS, Mann-Whitney) are more expensive but robust")
|
|
print("- Production systems often batch process drift detection")
|
|
|
|
# %% [markdown]
|
|
"""
|
|
## 🧪 Comprehensive Testing
|
|
|
|
Let's test the complete MLOps system integration.
|
|
"""
|
|
|
|
def run_comprehensive_mlops_tests():
|
|
"""Run all MLOps component tests."""
|
|
print("🧪 Running Comprehensive MLOps Tests")
|
|
print("=" * 50)
|
|
|
|
try:
|
|
# Test each component
|
|
test_model_monitor()
|
|
print()
|
|
test_drift_detection()
|
|
print()
|
|
test_retraining_triggers()
|
|
print()
|
|
|
|
# Test integration
|
|
print("🔄 Testing Integration...")
|
|
|
|
# Create integrated system
|
|
monitor = ModelMonitor("production_model", baseline_accuracy=0.92)
|
|
|
|
# Generate baseline data for drift detection
|
|
np.random.seed(42)
|
|
baseline_data = np.random.normal(loc=[0, 1], scale=[1, 0.5], size=(1000, 2))
|
|
detector = DriftDetector(baseline_data, ["feature_a", "feature_b"])
|
|
|
|
trigger = RetrainingTrigger("production_model")
|
|
|
|
# Simulate production scenario
|
|
print("\n📊 Production Simulation:")
|
|
|
|
# Day 1-3: Normal operation
|
|
print("Days 1-3: Normal operation")
|
|
for day in range(3):
|
|
accuracy = 0.91 + np.random.normal(0, 0.01)
|
|
latency = 150 + np.random.normal(0, 10)
|
|
monitor.record_performance(accuracy, latency)
|
|
|
|
new_data = np.random.normal(loc=[0.1, 1.1], scale=[1, 0.5], size=(100, 2))
|
|
drift_result = detector.detect_simple_drift(new_data)
|
|
|
|
alerts = monitor.check_alerts()
|
|
decision = trigger.should_retrain(alerts, drift_result)
|
|
|
|
print(f" Day {day+1}: Accuracy={accuracy:.3f}, Drift={drift_result['drift_detected']}, Retrain={decision['should_retrain']}")
|
|
|
|
# Day 4: Data shift occurs
|
|
print("\nDay 4: Data distribution shift")
|
|
accuracy = 0.87 # Significant drop
|
|
latency = 180
|
|
monitor.record_performance(accuracy, latency)
|
|
|
|
# Shifted data
|
|
new_data = np.random.normal(loc=[2, 1], scale=[1, 0.5], size=(100, 2)) # Feature A shifted
|
|
drift_result = detector.detect_simple_drift(new_data)
|
|
|
|
alerts = monitor.check_alerts()
|
|
decision = trigger.should_retrain(alerts, drift_result)
|
|
|
|
print(f" Day 4: Accuracy={accuracy:.3f}, Drift={drift_result['drift_detected']}, Retrain={decision['should_retrain']}")
|
|
if decision['should_retrain']:
|
|
print(f" Trigger reasons: {', '.join(decision['trigger_reasons'])}")
|
|
|
|
print("\n✅ Comprehensive MLOps integration test completed!")
|
|
|
|
except Exception as e:
|
|
print(f"❌ Test failed: {e}")
|
|
raise
|
|
|
|
# %% [markdown]
|
|
"""
|
|
---
|
|
|
|
## 📊 Production Context: Real-World MLOps Systems **[OPTIONAL]**
|
|
|
|
> **Note**: This section provides real-world context but isn't required for
|
|
> understanding the core implementations above.
|
|
|
|
### How Real Systems Handle MLOps
|
|
|
|
**Netflix**: Runs 1000+ ML models in production with automated monitoring
|
|
- Uses Kafka for real-time metric streaming
|
|
- A/B tests model versions automatically
|
|
- Monitors business metrics (click-through rate) alongside model metrics
|
|
|
|
**Uber**: MLOps platform serves billions of predictions daily
|
|
- Custom monitoring with drift detection for rider demand models
|
|
- Automated retraining triggers based on city-specific performance
|
|
- Feature stores with automated data quality checks
|
|
|
|
**Airbnb**: ML models for pricing, search ranking, and fraud detection
|
|
- Custom dashboard showing model health across all services
|
|
- Automated canary deployments with rollback capabilities
|
|
- Real-time alert system integrated with incident response
|
|
|
|
### Systems Engineering Insights
|
|
|
|
1. **Scale**: Real systems monitor thousands of models simultaneously
|
|
2. **Automation**: Human intervention is expensive - everything must be automated
|
|
3. **Business Impact**: Technical metrics (accuracy) must connect to business metrics (revenue)
|
|
4. **Reliability**: MLOps infrastructure must be more reliable than the models it monitors
|
|
5. **Cost**: Monitoring costs must be balanced against model improvement benefits
|
|
"""
|
|
|
|
# %% [markdown]
|
|
"""
|
|
## 🧪 Main Execution
|
|
|
|
Run all tests when this module is executed directly.
|
|
"""
|
|
|
|
if __name__ == "__main__":
|
|
print("🚀 TinyTorch MLOps Module")
|
|
print("=" * 50)
|
|
|
|
try:
|
|
# Run all tests
|
|
run_comprehensive_mlops_tests()
|
|
print()
|
|
|
|
# Run performance analysis
|
|
analyze_mlops_memory_usage()
|
|
print()
|
|
analyze_mlops_computational_complexity()
|
|
|
|
print("🎉 All MLOps tests completed successfully!")
|
|
|
|
except Exception as e:
|
|
print(f"❌ MLOps module execution failed: {e}")
|
|
import traceback
|
|
traceback.print_exc()
|
|
|
|
# %% [markdown]
|
|
"""
|
|
## 🤔 ML Systems Thinking: Interactive Questions
|
|
|
|
These questions help you think about the systems engineering principles behind
|
|
MLOps infrastructure and connect to real production challenges.
|
|
"""
|
|
|
|
# %% [markdown]
|
|
"""
|
|
**Question 1: Memory Management in Production Monitoring**
|
|
|
|
You're monitoring 500 models in production, each recording performance metrics
|
|
every minute. Each model maintains 1 week of history.
|
|
|
|
Calculate:
|
|
- Total memory usage for metric storage
|
|
- Memory growth rate per day
|
|
- When you need to implement data retention policies
|
|
|
|
*Consider: How would you design a memory-efficient monitoring system that
|
|
doesn't lose critical historical information?*
|
|
"""
|
|
|
|
# %% [markdown]
|
|
"""
|
|
**Question 2: Drift Detection Trade-offs**
|
|
|
|
Your drift detector takes 50ms to analyze 1000 features. You need to monitor
|
|
100 models, each with 5000 features, every hour.
|
|
|
|
Analyze:
|
|
- Total computational cost per monitoring cycle
|
|
- Whether you can meet the hourly requirement
|
|
- How to optimize without losing detection quality
|
|
|
|
*Consider: What approximations might real systems use to handle larger scale?*
|
|
"""
|
|
|
|
# %% [markdown]
|
|
"""
|
|
**Question 3: Retraining Decision Economics**
|
|
|
|
Model retraining costs $100 in compute and takes 2 hours. A 5% accuracy drop
|
|
costs $50/hour in lost business value.
|
|
|
|
Design a retraining policy that:
|
|
- Minimizes total cost (compute + business impact)
|
|
- Accounts for retraining reliability (90% success rate)
|
|
- Handles different model criticalities
|
|
|
|
*Consider: How do real companies balance automation vs human oversight for
|
|
expensive retraining decisions?*
|
|
"""
|
|
|
|
# %% [markdown]
|
|
"""
|
|
**Question 4: Systems Integration Challenges**
|
|
|
|
Your MLOps system needs to integrate with:
|
|
- 5 different model serving systems
|
|
- 3 data pipelines with different schemas
|
|
- Legacy monitoring infrastructure
|
|
- Multiple ML frameworks (PyTorch, TensorFlow, XGBoost)
|
|
|
|
Evaluate:
|
|
- How to design interfaces that work across all systems
|
|
- Where to place monitoring hooks in the pipeline
|
|
- How to handle partial failures and degraded monitoring
|
|
|
|
*Consider: What abstraction layers help manage this complexity while keeping
|
|
the monitoring reliable?*
|
|
"""
|
|
|
|
# %% [markdown]
|
|
"""
|
|
## 🎯 MODULE SUMMARY: MLOps - Production ML Systems
|
|
|
|
### What You Built
|
|
|
|
In this module, you implemented a complete MLOps monitoring system from scratch:
|
|
|
|
1. **ModelMonitor**: Tracks model performance and detects degradation
|
|
2. **DriftDetector**: Identifies data distribution changes using statistical methods
|
|
3. **RetrainingTrigger**: Makes automated decisions about when to retrain models
|
|
4. **Integrated System**: All components working together for production monitoring
|
|
|
|
### Key Systems Engineering Insights
|
|
|
|
**Memory Management**: MLOps systems accumulate large amounts of monitoring data.
|
|
You learned how to analyze memory usage patterns and design for scale.
|
|
|
|
**Computational Complexity**: Drift detection and monitoring can be expensive at scale.
|
|
You analyzed algorithmic complexity and identified optimization opportunities.
|
|
|
|
**Production Trade-offs**: Real systems balance detection sensitivity with false alarm
|
|
rates, computational cost with monitoring coverage, and automation with human oversight.
|
|
|
|
**Integration Challenges**: MLOps systems must work across diverse infrastructure,
|
|
handle partial failures gracefully, and provide reliable monitoring even when
|
|
models themselves are unreliable.
|
|
|
|
### Connection to Real Systems
|
|
|
|
The patterns you implemented mirror those used by major tech companies:
|
|
- Statistical drift detection (used by Netflix, Uber)
|
|
- Threshold-based alerting (used by Airbnb, Spotify)
|
|
- Automated retraining policies (used by Google, Facebook)
|
|
- Performance trend analysis (used by Amazon, Microsoft)
|
|
|
|
### Next Steps
|
|
|
|
- **Module 22**: Would extend to A/B testing and gradual rollouts
|
|
- **Module 23**: Could add distributed monitoring and federated learning
|
|
- **Production**: These patterns scale to enterprise MLOps platforms
|
|
|
|
You now understand how to build production ML systems that monitor themselves,
|
|
detect problems automatically, and make intelligent decisions about model updates.
|
|
This is the foundation of reliable, scalable ML systems engineering.
|
|
""" |