Run this notebook: Open in Colab Open in Kaggle

Cloud Deployment for ML Models¶

🎯 Learning Objectives¶

Deploy models to cloud platforms
Use managed ML services
Implement auto-scaling
Optimize costs
Set up cloud monitoring

Cloud Platform Options¶

AWS¶

SageMaker: End-to-end ML platform
Lambda: Serverless inference
ECS/EKS: Container orchestration
S3: Model storage

Azure¶

Azure ML: Managed ML platform
Azure Functions: Serverless
AKS: Kubernetes service
Blob Storage: Model storage

GCP¶

Vertex AI: Unified ML platform
Cloud Run: Serverless containers
GKE: Kubernetes Engine
Cloud Storage: Model storage

Deployment Options Comparison¶

Option	Best For	Pros	Cons
Serverless	Low traffic, variable load	No server management, auto-scale, pay per use	Cold starts, limited resources
Containers	Medium traffic, consistent load	Flexible, portable, good control	Requires orchestration
Kubernetes	High traffic, complex apps	Powerful, scalable, resilient	Complex setup
Managed ML	Quick deployment	Easy setup, managed infrastructure	Less control, vendor lock-in

AWS Lambda Deployment¶

AWS Lambda provides serverless inference where you pay only for the compute time consumed, with no servers to manage. The handler function below loads a scikit-learn model from S3 into a global variable (cached across warm invocations to avoid repeated downloads), parses the input event, and returns predictions in the API Gateway response format. Lambda’s main trade-off is cold starts: the first invocation after idle time takes several seconds as the container initializes and loads the model. For lightweight models under 250MB, Lambda is cost-effective at low-to-medium traffic volumes (under 100K requests per day). For larger models or latency-sensitive applications, container-based deployments are more appropriate.

# Example: AWS Lambda handler for ML model
import json
import boto3
import numpy as np
import joblib
from io import BytesIO

# Initialize S3 client
s3 = boto3.client('s3')

# Load model from S3 (cached globally)
MODEL = None
MODEL_BUCKET = 'my-ml-models'
MODEL_KEY = 'iris-classifier/model.pkl'

def load_model():
    """Load model from S3"""
    global MODEL
    if MODEL is None:
        print("Loading model from S3...")
        obj = s3.get_object(Bucket=MODEL_BUCKET, Key=MODEL_KEY)
        MODEL = joblib.load(BytesIO(obj['Body'].read()))
        print("Model loaded")
    return MODEL

def lambda_handler(event, context):
    """AWS Lambda handler function"""
    try:
        # Parse input
        body = json.loads(event['body'])
        features = np.array(body['features']).reshape(1, -1)
        
        # Load model
        model = load_model()
        
        # Predict
        prediction = model.predict(features)[0]
        probabilities = model.predict_proba(features)[0]
        
        # Return response
        return {
            'statusCode': 200,
            'headers': {
                'Content-Type': 'application/json',
                'Access-Control-Allow-Origin': '*'
            },
            'body': json.dumps({
                'prediction': int(prediction),
                'probabilities': probabilities.tolist(),
                'model_version': '1.0'
            })
        }
    
    except Exception as e:
        return {
            'statusCode': 500,
            'body': json.dumps({'error': str(e)})
        }

print("Lambda handler defined")
print("\nDeploy with:")
print("  1. Package: zip -r function.zip .")
print("  2. Upload: aws lambda create-function ...")
print("  3. Create API Gateway endpoint")

Kubernetes Deployment¶

Kubernetes (K8s) is the production standard for running containerized ML services at scale. The manifests below define three resources: a Deployment with 3 replicas for high availability, a Service that exposes the API behind a load balancer, and a HorizontalPodAutoscaler (HPA) that automatically adds or removes replicas based on CPU and memory utilization. The livenessProbe restarts containers that stop responding, while the readinessProbe removes unhealthy pods from the load balancer rotation. Resource requests and limits prevent a single pod from consuming all cluster resources, which is important for ML workloads that can spike in memory usage during batch predictions.

k8s_deployment = '''
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-api
  labels:
    app: ml-api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ml-api
  template:
    metadata:
      labels:
        app: ml-api
    spec:
      containers:
      - name: ml-api
        image: myregistry/ml-api:v1.0
        ports:
        - containerPort: 8000
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "1000m"
        env:
        - name: MODEL_PATH
          value: "/models/model.pkl"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 5
          periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: ml-api-service
spec:
  selector:
    app: ml-api
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8000
  type: LoadBalancer
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ml-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ml-api
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
'''

print("Kubernetes Deployment Configuration:")
print(k8s_deployment)
print("\nDeploy with:")
print("  kubectl apply -f deployment.yaml")

Auto-Scaling Configuration¶

Advanced auto-scaling goes beyond CPU utilization to incorporate application-level metrics like requests per second and queue depth. The HPA configuration below defines three scaling triggers: CPU utilization above 70%, request rate exceeding 100 per second per pod, and an SQS queue depth above 30 messages. The behavior section controls scaling velocity – scale-up is aggressive (50% increase per minute) to handle traffic spikes, while scale-down is conservative (10% decrease per minute with a 5-minute stabilization window) to avoid oscillation. This asymmetric policy prevents the cluster from repeatedly adding and removing pods during fluctuating traffic.

autoscaling_config = '''
# Auto-scaling based on custom metrics

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ml-api-custom-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ml-api
  minReplicas: 2
  maxReplicas: 20
  metrics:
  # CPU-based scaling
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  
  # Request rate-based scaling
  - type: Pods
    pods:
      metric:
        name: requests_per_second
      target:
        type: AverageValue
        averageValue: "100"
  
  # Queue depth scaling
  - type: External
    external:
      metric:
        name: sqs_queue_depth
        selector:
          matchLabels:
            queue_name: ml_predictions
      target:
        type: Value
        value: "30"
  
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
      - type: Pods
        value: 2
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60
'''

print("Advanced Auto-Scaling Configuration:")
print(autoscaling_config)

Cost Optimization Strategies¶

Cloud ML infrastructure costs can grow quickly if not managed carefully. The cost comparison below shows how different deployment options suit different traffic patterns: Lambda for sporadic, low-volume traffic (pay-per-request); Fargate for medium, variable workloads (no server management overhead); EC2 for high, consistent traffic (lower per-request cost); and Spot Instances for fault-tolerant batch predictions (60-90% savings). The key principle is to match your deployment option to your traffic pattern and latency requirements, then continuously right-size by monitoring actual resource utilization against provisioned capacity.

import pandas as pd

# Example cost analysis
cost_analysis = pd.DataFrame([
    {
        'Option': 'AWS Lambda',
        'Fixed Cost/mo': '$0',
        'Variable Cost': '$0.20 per 1M requests + $0.0000166667 per GB-second',
        'Best For': 'Low/variable traffic (<100K req/day)',
        'Break-even': '~50K requests/day'
    },
    {
        'Option': 'ECS Fargate',
        'Fixed Cost/mo': '~$30-50',
        'Variable Cost': 'Scales with CPU/memory',
        'Best For': 'Medium traffic (100K-1M req/day)',
        'Break-even': '~100K requests/day'
    },
    {
        'Option': 'EC2 + Docker',
        'Fixed Cost/mo': '~$50-200',
        'Variable Cost': 'Minimal',
        'Best For': 'High consistent traffic (>1M req/day)',
        'Break-even': '~500K requests/day'
    },
    {
        'Option': 'Spot Instances',
        'Fixed Cost/mo': '$0',
        'Variable Cost': '60-90% less than EC2',
        'Best For': 'Batch predictions, non-critical',
        'Break-even': 'Always cheaper (if workload fits)'
    }
])

print("Cost Comparison:")
print(cost_analysis.to_string(index=False))

print("\n=== Cost Optimization Tips ===")
tips = [
    "1. Use spot instances for batch processing",
    "2. Right-size instances (monitor actual usage)",
    "3. Enable auto-scaling to avoid over-provisioning",
    "4. Use reserved instances for predictable workloads",
    "5. Implement caching to reduce compute",
    "6. Compress models to reduce memory needs",
    "7. Use cheaper regions when possible",
    "8. Clean up unused resources regularly"
]

for tip in tips:
    print(tip)

Monitoring in Cloud¶

Cloud-native monitoring uses provider-specific services like AWS CloudWatch, Azure Monitor, or GCP Cloud Monitoring to collect metrics, set alarms, and build dashboards. The Terraform configuration below creates two CloudWatch alarms: one that fires when average Lambda latency exceeds 1 second, and another when the error count exceeds 10 per minute. Both alarms notify an SNS topic, which can fan out to email, Slack, or PagerDuty. The dashboard widget provides a real-time view of invocation count, errors, and latency, giving on-call engineers immediate visibility into system health without querying logs manually.

cloudwatch_config = '''
# AWS CloudWatch alarms (Terraform example)

resource "aws_cloudwatch_metric_alarm" "high_latency" {
  alarm_name          = "ml-api-high-latency"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "2"
  metric_name         = "Duration"
  namespace           = "AWS/Lambda"
  period              = "60"
  statistic           = "Average"
  threshold           = "1000"  # 1 second
  alarm_description   = "Alert when average latency > 1s"
  alarm_actions       = [aws_sns_topic.alerts.arn]
  
  dimensions = {
    FunctionName = aws_lambda_function.ml_api.function_name
  }
}

resource "aws_cloudwatch_metric_alarm" "high_error_rate" {
  alarm_name          = "ml-api-high-error-rate"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "2"
  metric_name         = "Errors"
  namespace           = "AWS/Lambda"
  period              = "60"
  statistic           = "Sum"
  threshold           = "10"
  alarm_description   = "Alert when error count > 10 per minute"
  alarm_actions       = [aws_sns_topic.alerts.arn]
  
  dimensions = {
    FunctionName = aws_lambda_function.ml_api.function_name
  }
}

resource "aws_cloudwatch_dashboard" "ml_api" {
  dashboard_name = "ml-api-dashboard"
  
  dashboard_body = jsonencode({
    widgets = [
      {
        type = "metric"
        properties = {
          metrics = [
            ["AWS/Lambda", "Invocations", {stat = "Sum"}],
            [".", "Errors", {stat = "Sum"}],
            [".", "Duration", {stat = "Average"}]
          ]
          period = 300
          stat   = "Average"
          region = "us-east-1"
          title  = "ML API Metrics"
        }
      }
    ]
  })
}
'''

print("CloudWatch Monitoring Configuration:")
print(cloudwatch_config)

Best Practices¶

Multi-Region Deployment
- Deploy to multiple regions for resilience
- Use geo-routing for low latency
- Implement cross-region failover
Security
- Use IAM roles (don’t embed credentials)
- Enable encryption at rest and in transit
- Implement API authentication
- Regular security audits
Cost Management
- Set up billing alerts
- Use cost allocation tags
- Right-size resources
- Review costs monthly
Reliability
- Implement health checks
- Set up auto-recovery
- Use multiple availability zones
- Regular disaster recovery drills
Observability
- Centralized logging
- Distributed tracing
- Custom metrics
- Real-time dashboards

Key Takeaways¶

✅ Choose deployment option based on traffic and budget ✅ Implement auto-scaling for variable load ✅ Monitor costs and optimize regularly ✅ Use managed services when appropriate ✅ Always have a rollback plan ✅ Security and observability are critical