Files
TinyTorch/modules/21_mlops/mlops_dev.py
Vijay Janapa Reddi acb7816ad9 STANDARDIZE: Consistent Linear terminology across all modules
Remove backward compatibility aliases and enforce PyTorch-consistent naming:
- Remove Dense = Linear alias in Module 04 (layers)
- Update all Dense references to Linear in Modules 02, 08, 09, 18, 21
- Remove MaxPool2d = MaxPool2D alias in Module 17 (quantization)
- Standardize fc/dense_weights to linear_weights in Module 18 (compression)

Benefits:
- Eliminates naming confusion between Dense/Linear terminology
- Aligns with PyTorch production patterns (nn.Linear)
- Reduces cognitive load with single consistent naming convention
- Improves student transfer to real ML frameworks

All modules tested and functionality preserved.
2025-09-26 11:51:54 -04:00

1008 lines
37 KiB
Python

#| default_exp core.mlops
"""
# MLOps - Production ML Systems Module
This module teaches production ML systems engineering through hands-on implementation
of monitoring, deployment, and lifecycle management tools.
**Learning Focus**: Systems engineering principles for production ML
**Key Concepts**: Model monitoring, drift detection, automated retraining
**Systems Insights**: How real ML systems manage model lifecycles
"""
# %% [markdown]
"""
## 🎯 Learning Objectives
By completing this module, you will:
1. **Build model monitoring systems** that track performance degradation in production
2. **Implement drift detection** to identify when data changes affect model quality
3. **Create automated retraining triggers** that respond to performance issues
4. **Understand MLOps systems engineering** principles used in real production systems
**Systems Engineering Focus**: This module emphasizes building production-ready
infrastructure, not just algorithms. You'll learn memory management, performance
monitoring, and system architecture patterns used in enterprise ML systems.
"""
# %% [markdown]
"""
## 📚 Background: Production ML Systems
### The MLOps Challenge
In production ML systems, models don't just run once - they serve predictions
continuously for months or years. This creates unique engineering challenges:
1. **Model Degradation**: Performance drops as data changes over time
2. **Data Drift**: Input distributions shift, affecting model validity
3. **Infrastructure Complexity**: Monitoring, versioning, and deployment at scale
4. **Incident Response**: Automated detection and response to performance issues
### Real-World Context
Companies like Netflix, Uber, and Airbnb run thousands of ML models in production.
Each model requires:
- Continuous performance monitoring
- Automated drift detection
- Retraining pipelines
- A/B testing infrastructure
- Incident response systems
This module teaches you to build these systems from scratch.
"""
import numpy as np
import os
import sys
import time
import json
from typing import Dict, List, Tuple, Optional, Any, Callable
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from collections import defaultdict
# Import our dependencies - try from package first, then local modules
try:
from tinytorch.core.tensor import Tensor
from tinytorch.core.training import Trainer
from tinytorch.core.layers import Linear
except ImportError:
# For development, fallback gracefully
print("⚠️ Some TinyTorch modules not available - MLOps will use mock implementations")
Tensor = None
Trainer = None
# %% [markdown]
"""
---
## 🔬 SECTION 1: Model Performance Monitoring
### The Foundation: Tracking Model Health
Real production systems continuously monitor model performance. When accuracy drops
or latency increases, automated systems need to detect and respond.
**Systems Insight**: Production monitoring is about **thresholds and trends**, not
perfect accuracy. A 5% accuracy drop might trigger retraining, while gradual
degradation over weeks might indicate data drift.
**What You'll Build**: A `ModelMonitor` class that tracks accuracy over time
and detects performance degradation using simple thresholds.
"""
@dataclass
class ModelMonitor:
"""
Monitors ML model performance in production environments.
Tracks accuracy, latency, and other metrics over time to detect degradation.
This is the foundation of production ML systems monitoring.
"""
def __init__(self, model_name: str, baseline_accuracy: float = 0.95):
"""
Initialize model monitoring with baseline performance expectations.
Args:
model_name: Unique identifier for the model being monitored
baseline_accuracy: Expected accuracy level (alerts when dropping below 90% of this)
"""
self.model_name = model_name
self.baseline_accuracy = baseline_accuracy
# Performance history - stored as lists for simple analysis
self.accuracy_history: List[float] = []
self.latency_history: List[float] = [] # milliseconds
self.timestamp_history: List[datetime] = []
# Alert thresholds - 90% of baseline triggers accuracy alert
self.accuracy_threshold = baseline_accuracy * 0.9
self.latency_threshold = 200.0 # milliseconds
# Alert states
self.accuracy_alert = False
self.latency_alert = False
print(f"✅ Model monitor initialized for '{model_name}'")
print(f" Baseline accuracy: {baseline_accuracy:.3f}")
print(f" Alert threshold: {self.accuracy_threshold:.3f}")
def record_performance(self, accuracy: float, latency: float) -> None:
"""
Record a new performance measurement.
Args:
accuracy: Model accuracy on recent batch
latency: Inference latency in milliseconds
"""
self.accuracy_history.append(accuracy)
self.latency_history.append(latency)
self.timestamp_history.append(datetime.now())
# Update alert states
self.accuracy_alert = accuracy < self.accuracy_threshold
self.latency_alert = latency > self.latency_threshold
# Log significant changes
if self.accuracy_alert:
print(f"🚨 ACCURACY ALERT: {accuracy:.3f} < {self.accuracy_threshold:.3f}")
if self.latency_alert:
print(f"🚨 LATENCY ALERT: {latency:.1f}ms > {self.latency_threshold:.1f}ms")
def check_alerts(self) -> Dict[str, Any]:
"""
Check current alert status and return summary.
Returns:
Dictionary with alert information and recent performance
"""
if not self.accuracy_history:
return {"status": "no_data", "alerts": []}
recent_accuracy = self.accuracy_history[-1]
recent_latency = self.latency_history[-1]
alerts = []
if self.accuracy_alert:
alerts.append({
"type": "accuracy_degradation",
"current": recent_accuracy,
"threshold": self.accuracy_threshold,
"severity": "high"
})
if self.latency_alert:
alerts.append({
"type": "latency_spike",
"current": recent_latency,
"threshold": self.latency_threshold,
"severity": "medium"
})
return {
"model_name": self.model_name,
"status": "alert" if alerts else "healthy",
"alerts": alerts,
"recent_accuracy": recent_accuracy,
"recent_latency": recent_latency,
"total_measurements": len(self.accuracy_history)
}
def get_performance_trend(self, window: int = 5) -> Dict[str, str]:
"""
Analyze performance trends over recent measurements.
Args:
window: Number of recent measurements to analyze
Returns:
Trend analysis for accuracy and latency
"""
if len(self.accuracy_history) < window:
return {"accuracy_trend": "insufficient_data", "latency_trend": "insufficient_data"}
# Compare recent window to previous window
recent_acc = np.mean(self.accuracy_history[-window:])
older_acc = np.mean(self.accuracy_history[-2*window:-window]) if len(self.accuracy_history) >= 2*window else recent_acc
recent_lat = np.mean(self.latency_history[-window:])
older_lat = np.mean(self.latency_history[-2*window:-window]) if len(self.latency_history) >= 2*window else recent_lat
# Simple trend analysis with 2% threshold
accuracy_trend = "stable"
if recent_acc > older_acc * 1.02:
accuracy_trend = "improving"
elif recent_acc < older_acc * 0.98:
accuracy_trend = "degrading"
latency_trend = "stable"
if recent_lat > older_lat * 1.1:
latency_trend = "increasing"
elif recent_lat < older_lat * 0.9:
latency_trend = "decreasing"
return {
"accuracy_trend": accuracy_trend,
"latency_trend": latency_trend,
"recent_accuracy": recent_acc,
"older_accuracy": older_acc,
"recent_latency": recent_lat,
"older_latency": older_lat
}
# %% [markdown]
"""
### 🧪 Testing: Model Monitor
Let's test our model monitoring system with realistic scenarios.
"""
def test_model_monitor():
"""Test the ModelMonitor with realistic performance scenarios."""
print("🧪 Testing Model Monitor...")
# Create monitor for an image classifier
monitor = ModelMonitor("image_classifier_v1", baseline_accuracy=0.92)
# Simulate healthy performance
print("\n📊 Phase 1: Healthy Performance")
for i in range(5):
accuracy = 0.91 + np.random.normal(0, 0.01) # Around 91% with small variance
latency = 150 + np.random.normal(0, 10) # Around 150ms
monitor.record_performance(accuracy, latency)
alerts = monitor.check_alerts()
print(f"Status: {alerts['status']}")
print(f"Alerts: {len(alerts['alerts'])}")
# Simulate performance degradation
print("\n📉 Phase 2: Performance Degradation")
for i in range(3):
accuracy = 0.81 + np.random.normal(0, 0.02) # Drop to 81% - below threshold
latency = 220 + np.random.normal(0, 15) # Spike to 220ms
monitor.record_performance(accuracy, latency)
alerts = monitor.check_alerts()
print(f"Status: {alerts['status']}")
print(f"Number of alerts: {len(alerts['alerts'])}")
for alert in alerts['alerts']:
print(f" - {alert['type']}: {alert['current']:.3f} (threshold: {alert['threshold']:.3f})")
# Analyze trends
trend = monitor.get_performance_trend()
print(f"\n📈 Trend Analysis:")
print(f"Accuracy trend: {trend['accuracy_trend']}")
print(f"Latency trend: {trend['latency_trend']}")
print("✅ Model monitor test completed!")
# %% [markdown]
"""
---
## 🔬 SECTION 2: Data Drift Detection
### The Challenge: When Data Changes
Data drift occurs when the input distribution changes over time. A model trained
on summer data might perform poorly in winter. Production systems need to detect
this automatically.
**Systems Insight**: Drift detection is about **statistical comparisons**, not
model accuracy. We compare new data statistics to baseline statistics to detect
distribution shifts before they affect model performance.
**What You'll Build**: A `DriftDetector` class using simple statistical thresholds
(mean differences) rather than complex statistical tests.
"""
class DriftDetector:
"""
Detects distribution drift in input data using statistical methods.
Compares new data statistics against baseline to identify significant changes
that might affect model performance.
"""
def __init__(self, baseline_data: np.ndarray, feature_names: Optional[List[str]] = None):
"""
Initialize drift detection with baseline data statistics.
Args:
baseline_data: Reference data to compare against (n_samples x n_features)
feature_names: Optional names for features (for reporting)
"""
self.baseline_data = baseline_data
self.feature_names = feature_names or [f"feature_{i}" for i in range(baseline_data.shape[1])]
# Compute baseline statistics
self.baseline_mean = np.mean(baseline_data, axis=0)
self.baseline_std = np.std(baseline_data, axis=0)
self.baseline_min = np.min(baseline_data, axis=0)
self.baseline_max = np.max(baseline_data, axis=0)
# Drift history
self.drift_history: List[Dict] = []
print(f"✅ Drift detector initialized")
print(f" Baseline shape: {baseline_data.shape}")
print(f" Features: {len(self.feature_names)}")
print(f" Baseline mean: [{', '.join([f'{m:.3f}' for m in self.baseline_mean[:3]])}...]")
def detect_simple_drift(self, new_data: np.ndarray, threshold: float = 2.0) -> Dict[str, Any]:
"""
Simple drift detection using statistical thresholds.
Args:
new_data: New data to compare against baseline
threshold: Number of standard deviations to consider drift
Returns:
Drift detection results
"""
if new_data.shape[1] != self.baseline_data.shape[1]:
raise ValueError(f"Feature count mismatch: {new_data.shape[1]} vs {self.baseline_data.shape[1]}")
# Compute new data statistics
new_mean = np.mean(new_data, axis=0)
new_std = np.std(new_data, axis=0)
# Detect drift per feature using simple threshold
drift_flags = []
drift_scores = []
for i in range(len(self.baseline_mean)):
# Mean shift detection
mean_diff = abs(new_mean[i] - self.baseline_mean[i])
mean_threshold = threshold * self.baseline_std[i]
# Normalize drift score to handle different feature scales
# Small epsilon (1e-8) prevents division by zero
drift_score = mean_diff / (self.baseline_std[i] + 1e-8)
drift_flags.append(drift_score > threshold)
drift_scores.append(drift_score)
# Overall drift assessment with clear thresholds
drift_detected = any(drift_flags)
max_drift_score = max(drift_scores)
# Simple severity classification (no magic numbers)
if max_drift_score > 3.0:
drift_severity = "high"
elif max_drift_score > 2.0:
drift_severity = "medium"
else:
drift_severity = "low"
result = {
"timestamp": datetime.now(),
"drift_detected": drift_detected,
"drift_severity": drift_severity,
"max_drift_score": max_drift_score,
"per_feature": {
self.feature_names[i]: {
"drift_detected": drift_flags[i],
"drift_score": drift_scores[i],
"baseline_mean": self.baseline_mean[i],
"new_mean": new_mean[i]
}
for i in range(len(self.feature_names))
},
"summary": {
"total_features": len(self.feature_names),
"drifted_features": sum(drift_flags),
"drift_percentage": sum(drift_flags) / len(drift_flags) * 100
}
}
# Store in history
self.drift_history.append(result)
if drift_detected:
print(f"🚨 DRIFT DETECTED: {sum(drift_flags)}/{len(drift_flags)} features drifted")
print(f" Severity: {drift_severity} (max score: {max_drift_score:.2f})")
return result
def get_drift_history(self, limit: int = 10) -> List[Dict]:
"""Get recent drift detection history."""
return self.drift_history[-limit:] if limit else self.drift_history
# %% [markdown]
"""
### 🧪 Testing: Drift Detection
Let's test drift detection with simulated data distribution changes.
"""
def test_drift_detection():
"""Test drift detection with controlled data changes."""
print("🧪 Testing Drift Detection...")
# Create baseline data - normal distribution
np.random.seed(42)
baseline_data = np.random.normal(loc=[0, 1, -0.5], scale=[1, 0.5, 0.8], size=(1000, 3))
feature_names = ["temperature", "humidity", "pressure"]
detector = DriftDetector(baseline_data, feature_names)
# Test 1: No drift - similar distribution
print("\n📊 Test 1: No Drift Expected")
new_data_normal = np.random.normal(loc=[0.1, 1.1, -0.4], scale=[1, 0.5, 0.8], size=(500, 3))
result1 = detector.detect_simple_drift(new_data_normal)
print(f"Drift detected: {result1['drift_detected']}")
print(f"Drifted features: {result1['summary']['drifted_features']}/{result1['summary']['total_features']}")
# Test 2: Clear drift - shifted distribution
print("\n🚨 Test 2: Clear Drift Expected")
new_data_drift = np.random.normal(loc=[3, 1, -0.5], scale=[1, 0.5, 0.8], size=(500, 3)) # Temperature shifted
result2 = detector.detect_simple_drift(new_data_drift)
print(f"Drift detected: {result2['drift_detected']}")
print(f"Severity: {result2['drift_severity']}")
print(f"Max drift score: {result2['max_drift_score']:.2f}")
# Show per-feature results
for feature, info in result2['per_feature'].items():
if info['drift_detected']:
print(f" - {feature}: {info['baseline_mean']:.2f}{info['new_mean']:.2f} (score: {info['drift_score']:.2f})")
print("✅ Drift detection test completed!")
# %% [markdown]
"""
---
## 🔬 SECTION 3: Automated Retraining Triggers
### The Response: When to Retrain
When performance drops or drift is detected, production systems need to decide
whether to retrain the model. This requires balancing computational cost with
model quality.
**Systems Insight**: Retraining triggers are about **policies and thresholds**.
Real systems consider accuracy drops, drift severity, data volume, and business
impact when deciding to retrain.
**What You'll Build**: A `RetrainingTrigger` class that makes retraining decisions
based on configurable policies combining performance and drift signals.
"""
class RetrainingTrigger:
"""
Manages automated retraining decisions based on performance and drift signals.
Implements policies for when to trigger expensive retraining operations.
"""
def __init__(self, model_name: str, retraining_policy: Optional[Dict] = None):
"""
Initialize retraining trigger with policies.
Args:
model_name: Model being managed
retraining_policy: Configuration for retraining decisions
"""
self.model_name = model_name
# Default retraining policy
default_policy = {
"accuracy_threshold": 0.05, # 5% accuracy drop triggers retraining
"drift_threshold": 2.0, # Drift score > 2.0 triggers retraining
"min_time_between_retrains": 24 * 3600, # 24 hours minimum
"max_time_without_retrain": 7 * 24 * 3600, # 7 days maximum
"drift_severity_weights": {"low": 0.1, "medium": 0.5, "high": 1.0} # Simplified weights
}
self.policy = {**default_policy, **(retraining_policy or {})}
# Retraining history
self.retraining_history: List[Dict] = []
self.last_retrain_time = datetime.now()
print(f"✅ Retraining trigger initialized for '{model_name}'")
print(f" Accuracy threshold: {self.policy['accuracy_threshold']} drop")
print(f" Drift threshold: {self.policy['drift_threshold']}")
def should_retrain(self, monitor_alerts: Dict, drift_result: Dict) -> Dict[str, Any]:
"""
Decide whether to trigger retraining based on current conditions.
Args:
monitor_alerts: Results from ModelMonitor.check_alerts()
drift_result: Results from DriftDetector.detect_simple_drift()
Returns:
Retraining decision with reasoning
"""
current_time = datetime.now()
time_since_last_retrain = (current_time - self.last_retrain_time).total_seconds()
# Initialize decision tracking
trigger_reasons = []
severity_score = 0.0
# Check accuracy degradation
accuracy_trigger = False
if monitor_alerts['status'] == 'alert':
for alert in monitor_alerts['alerts']:
if alert['type'] == 'accuracy_degradation':
accuracy_drop = alert['threshold'] - alert['current']
if accuracy_drop >= self.policy['accuracy_threshold']:
accuracy_trigger = True
trigger_reasons.append(f"Accuracy dropped by {accuracy_drop:.3f}")
severity_score += 1.0
# Check drift conditions
drift_trigger = False
if drift_result['drift_detected']:
drift_weight = self.policy['drift_severity_weights'][drift_result['drift_severity']]
if drift_result['max_drift_score'] >= self.policy['drift_threshold']:
drift_trigger = True
trigger_reasons.append(f"Drift detected (score: {drift_result['max_drift_score']:.2f})")
severity_score += drift_weight
# Check time-based policies
time_trigger = False
if time_since_last_retrain >= self.policy['max_time_without_retrain']:
time_trigger = True
trigger_reasons.append(f"Maximum time exceeded ({time_since_last_retrain/86400:.1f} days)")
severity_score += 0.5
# Check minimum time constraint
min_time_satisfied = time_since_last_retrain >= self.policy['min_time_between_retrains']
# Final decision
should_retrain = (accuracy_trigger or drift_trigger or time_trigger) and min_time_satisfied
if not min_time_satisfied and (accuracy_trigger or drift_trigger):
trigger_reasons.append(f"Waiting for minimum time ({self.policy['min_time_between_retrains']/3600:.1f}h)")
decision = {
"timestamp": current_time,
"should_retrain": should_retrain,
"severity_score": severity_score,
"trigger_reasons": trigger_reasons,
"constraints": {
"accuracy_trigger": accuracy_trigger,
"drift_trigger": drift_trigger,
"time_trigger": time_trigger,
"min_time_satisfied": min_time_satisfied
},
"time_since_last_retrain": time_since_last_retrain,
"next_allowed_retrain": self.last_retrain_time + timedelta(seconds=self.policy['min_time_between_retrains'])
}
if should_retrain:
print(f"🔄 RETRAINING TRIGGERED for {self.model_name}")
print(f" Reasons: {', '.join(trigger_reasons)}")
print(f" Severity score: {severity_score:.2f}")
self.last_retrain_time = current_time
self.retraining_history.append(decision)
return decision
def get_retraining_history(self, limit: int = 10) -> List[Dict]:
"""Get recent retraining decision history."""
return self.retraining_history[-limit:] if limit else self.retraining_history
# %% [markdown]
"""
### 🧪 Testing: Retraining Triggers
Let's test the retraining decision logic with various scenarios.
"""
def test_retraining_triggers():
"""Test retraining trigger logic with different scenarios."""
print("🧪 Testing Retraining Triggers...")
# Create retraining trigger
trigger = RetrainingTrigger("test_model", {
"min_time_between_retrains": 60, # 1 minute for testing
"accuracy_threshold": 0.05
})
# Scenario 1: No issues - no retrain
print("\n📊 Scenario 1: Healthy Model")
healthy_alerts = {"status": "healthy", "alerts": []}
no_drift = {"drift_detected": False, "drift_severity": "low", "max_drift_score": 0.5}
decision1 = trigger.should_retrain(healthy_alerts, no_drift)
print(f"Should retrain: {decision1['should_retrain']}")
print(f"Reasons: {decision1['trigger_reasons']}")
# Scenario 2: Accuracy drop - should trigger
print("\n🚨 Scenario 2: Accuracy Degradation")
accuracy_alerts = {
"status": "alert",
"alerts": [{
"type": "accuracy_degradation",
"current": 0.85,
"threshold": 0.90,
"severity": "high"
}]
}
decision2 = trigger.should_retrain(accuracy_alerts, no_drift)
print(f"Should retrain: {decision2['should_retrain']}")
print(f"Reasons: {decision2['trigger_reasons']}")
print(f"Severity score: {decision2['severity_score']}")
# Wait to satisfy minimum time constraint
time.sleep(1)
# Scenario 3: High drift - should trigger
print("\n🚨 Scenario 3: High Drift")
high_drift = {"drift_detected": True, "drift_severity": "high", "max_drift_score": 3.5}
decision3 = trigger.should_retrain(healthy_alerts, high_drift)
print(f"Should retrain: {decision3['should_retrain']}")
print(f"Reasons: {decision3['trigger_reasons']}")
print("✅ Retraining triggers test completed!")
# %% [markdown]
"""
---
## 🔬 SECTION 4: Systems Analysis **[ADVANCED - OPTIONAL]**
> **Note**: This section contains advanced systems analysis. Focus on Sections 1-3
> for core MLOps concepts. This section shows how to analyze production performance.
### Memory Analysis: Monitoring Infrastructure Costs
Let's analyze the memory usage patterns of our MLOps infrastructure and understand
the computational costs of production monitoring.
**Advanced Topic**: Production MLOps systems need to monitor their own performance
to ensure monitoring doesn't become more expensive than the models being monitored.
"""
def analyze_mlops_memory_usage():
"""Analyze memory consumption patterns in MLOps components."""
print("🔬 MLOps Memory Analysis")
print("=" * 50)
import tracemalloc
tracemalloc.start()
# Test different monitoring scales
model_counts = [1, 10, 100]
history_lengths = [100, 1000, 10000]
for model_count in model_counts:
for history_length in history_lengths:
tracemalloc.clear_traces()
# Create monitoring infrastructure
monitors = []
for i in range(model_count):
monitor = ModelMonitor(f"model_{i}", baseline_accuracy=0.9)
# Fill with history
for j in range(history_length):
monitor.record_performance(
accuracy=0.85 + np.random.normal(0, 0.02),
latency=150 + np.random.normal(0, 10)
)
monitors.append(monitor)
# Measure memory
current, peak = tracemalloc.get_traced_memory()
# Calculate per-model overhead
per_model_kb = (current / 1024) / model_count
per_measurement_bytes = current / (model_count * history_length)
print(f"Models: {model_count:3d}, History: {history_length:5d}")
print(f" Total memory: {current/1024/1024:.2f} MB")
print(f" Per model: {per_model_kb:.1f} KB")
print(f" Per measurement: {per_measurement_bytes:.1f} bytes")
print()
tracemalloc.stop()
print("💡 Memory Insights:")
print("- Each monitor uses ~15-20KB for 1000 measurements")
print("- Memory scales linearly with history length")
print("- Real systems use circular buffers to limit memory growth")
print("- Database storage is preferred for long-term history")
def analyze_mlops_computational_complexity():
"""Analyze computational complexity of MLOps operations."""
print("🔬 MLOps Computational Complexity")
print("=" * 50)
# Test drift detection complexity
feature_counts = [10, 100, 1000]
sample_counts = [100, 1000, 10000]
print("Drift Detection Performance:")
for n_features in feature_counts:
for n_samples in sample_counts:
# Create test data
baseline = np.random.normal(0, 1, (n_samples, n_features))
new_data = np.random.normal(0.1, 1, (n_samples//2, n_features))
detector = DriftDetector(baseline)
# Time the operation
start_time = time.time()
result = detector.detect_simple_drift(new_data)
end_time = time.time()
computation_time = (end_time - start_time) * 1000 # milliseconds
print(f" Features: {n_features:4d}, Samples: {n_samples:5d}")
print(f" Time: {computation_time:.2f}ms")
print(f" O(N): {computation_time/(n_features*n_samples)*1e6:.2f} ns per feature-sample")
print("\n💡 Complexity Insights:")
print("- Drift detection is O(N*M) where N=samples, M=features")
print("- Real systems use sampling and approximation for large datasets")
print("- Statistical tests (KS, Mann-Whitney) are more expensive but robust")
print("- Production systems often batch process drift detection")
# %% [markdown]
"""
## 🧪 Comprehensive Testing
Let's test the complete MLOps system integration.
"""
def run_comprehensive_mlops_tests():
"""Run all MLOps component tests."""
print("🧪 Running Comprehensive MLOps Tests")
print("=" * 50)
try:
# Test each component
test_model_monitor()
print()
test_drift_detection()
print()
test_retraining_triggers()
print()
# Test integration
print("🔄 Testing Integration...")
# Create integrated system
monitor = ModelMonitor("production_model", baseline_accuracy=0.92)
# Generate baseline data for drift detection
np.random.seed(42)
baseline_data = np.random.normal(loc=[0, 1], scale=[1, 0.5], size=(1000, 2))
detector = DriftDetector(baseline_data, ["feature_a", "feature_b"])
trigger = RetrainingTrigger("production_model")
# Simulate production scenario
print("\n📊 Production Simulation:")
# Day 1-3: Normal operation
print("Days 1-3: Normal operation")
for day in range(3):
accuracy = 0.91 + np.random.normal(0, 0.01)
latency = 150 + np.random.normal(0, 10)
monitor.record_performance(accuracy, latency)
new_data = np.random.normal(loc=[0.1, 1.1], scale=[1, 0.5], size=(100, 2))
drift_result = detector.detect_simple_drift(new_data)
alerts = monitor.check_alerts()
decision = trigger.should_retrain(alerts, drift_result)
print(f" Day {day+1}: Accuracy={accuracy:.3f}, Drift={drift_result['drift_detected']}, Retrain={decision['should_retrain']}")
# Day 4: Data shift occurs
print("\nDay 4: Data distribution shift")
accuracy = 0.87 # Significant drop
latency = 180
monitor.record_performance(accuracy, latency)
# Shifted data
new_data = np.random.normal(loc=[2, 1], scale=[1, 0.5], size=(100, 2)) # Feature A shifted
drift_result = detector.detect_simple_drift(new_data)
alerts = monitor.check_alerts()
decision = trigger.should_retrain(alerts, drift_result)
print(f" Day 4: Accuracy={accuracy:.3f}, Drift={drift_result['drift_detected']}, Retrain={decision['should_retrain']}")
if decision['should_retrain']:
print(f" Trigger reasons: {', '.join(decision['trigger_reasons'])}")
print("\n✅ Comprehensive MLOps integration test completed!")
except Exception as e:
print(f"❌ Test failed: {e}")
raise
# %% [markdown]
"""
---
## 📊 Production Context: Real-World MLOps Systems **[OPTIONAL]**
> **Note**: This section provides real-world context but isn't required for
> understanding the core implementations above.
### How Real Systems Handle MLOps
**Netflix**: Runs 1000+ ML models in production with automated monitoring
- Uses Kafka for real-time metric streaming
- A/B tests model versions automatically
- Monitors business metrics (click-through rate) alongside model metrics
**Uber**: MLOps platform serves billions of predictions daily
- Custom monitoring with drift detection for rider demand models
- Automated retraining triggers based on city-specific performance
- Feature stores with automated data quality checks
**Airbnb**: ML models for pricing, search ranking, and fraud detection
- Custom dashboard showing model health across all services
- Automated canary deployments with rollback capabilities
- Real-time alert system integrated with incident response
### Systems Engineering Insights
1. **Scale**: Real systems monitor thousands of models simultaneously
2. **Automation**: Human intervention is expensive - everything must be automated
3. **Business Impact**: Technical metrics (accuracy) must connect to business metrics (revenue)
4. **Reliability**: MLOps infrastructure must be more reliable than the models it monitors
5. **Cost**: Monitoring costs must be balanced against model improvement benefits
"""
# %% [markdown]
"""
## 🧪 Main Execution
Run all tests when this module is executed directly.
"""
if __name__ == "__main__":
print("🚀 TinyTorch MLOps Module")
print("=" * 50)
try:
# Run all tests
run_comprehensive_mlops_tests()
print()
# Run performance analysis
analyze_mlops_memory_usage()
print()
analyze_mlops_computational_complexity()
print("🎉 All MLOps tests completed successfully!")
except Exception as e:
print(f"❌ MLOps module execution failed: {e}")
import traceback
traceback.print_exc()
# %% [markdown]
"""
## 🤔 ML Systems Thinking: Interactive Questions
These questions help you think about the systems engineering principles behind
MLOps infrastructure and connect to real production challenges.
"""
# %% [markdown]
"""
**Question 1: Memory Management in Production Monitoring**
You're monitoring 500 models in production, each recording performance metrics
every minute. Each model maintains 1 week of history.
Calculate:
- Total memory usage for metric storage
- Memory growth rate per day
- When you need to implement data retention policies
*Consider: How would you design a memory-efficient monitoring system that
doesn't lose critical historical information?*
"""
# %% [markdown]
"""
**Question 2: Drift Detection Trade-offs**
Your drift detector takes 50ms to analyze 1000 features. You need to monitor
100 models, each with 5000 features, every hour.
Analyze:
- Total computational cost per monitoring cycle
- Whether you can meet the hourly requirement
- How to optimize without losing detection quality
*Consider: What approximations might real systems use to handle larger scale?*
"""
# %% [markdown]
"""
**Question 3: Retraining Decision Economics**
Model retraining costs $100 in compute and takes 2 hours. A 5% accuracy drop
costs $50/hour in lost business value.
Design a retraining policy that:
- Minimizes total cost (compute + business impact)
- Accounts for retraining reliability (90% success rate)
- Handles different model criticalities
*Consider: How do real companies balance automation vs human oversight for
expensive retraining decisions?*
"""
# %% [markdown]
"""
**Question 4: Systems Integration Challenges**
Your MLOps system needs to integrate with:
- 5 different model serving systems
- 3 data pipelines with different schemas
- Legacy monitoring infrastructure
- Multiple ML frameworks (PyTorch, TensorFlow, XGBoost)
Evaluate:
- How to design interfaces that work across all systems
- Where to place monitoring hooks in the pipeline
- How to handle partial failures and degraded monitoring
*Consider: What abstraction layers help manage this complexity while keeping
the monitoring reliable?*
"""
# %% [markdown]
"""
## 🎯 MODULE SUMMARY: MLOps - Production ML Systems
### What You Built
In this module, you implemented a complete MLOps monitoring system from scratch:
1. **ModelMonitor**: Tracks model performance and detects degradation
2. **DriftDetector**: Identifies data distribution changes using statistical methods
3. **RetrainingTrigger**: Makes automated decisions about when to retrain models
4. **Integrated System**: All components working together for production monitoring
### Key Systems Engineering Insights
**Memory Management**: MLOps systems accumulate large amounts of monitoring data.
You learned how to analyze memory usage patterns and design for scale.
**Computational Complexity**: Drift detection and monitoring can be expensive at scale.
You analyzed algorithmic complexity and identified optimization opportunities.
**Production Trade-offs**: Real systems balance detection sensitivity with false alarm
rates, computational cost with monitoring coverage, and automation with human oversight.
**Integration Challenges**: MLOps systems must work across diverse infrastructure,
handle partial failures gracefully, and provide reliable monitoring even when
models themselves are unreliable.
### Connection to Real Systems
The patterns you implemented mirror those used by major tech companies:
- Statistical drift detection (used by Netflix, Uber)
- Threshold-based alerting (used by Airbnb, Spotify)
- Automated retraining policies (used by Google, Facebook)
- Performance trend analysis (used by Amazon, Microsoft)
### Next Steps
- **Module 22**: Would extend to A/B testing and gradual rollouts
- **Module 23**: Could add distributed monitoring and federated learning
- **Production**: These patterns scale to enterprise MLOps platforms
You now understand how to build production ML systems that monitor themselves,
detect problems automatically, and make intelligent decisions about model updates.
This is the foundation of reliable, scalable ML systems engineering.
"""