MLOps Production Architecture on AWS

Scalable, Secure & Production-Ready Deployment

Data Ingestion Layer

Facebook Ads API

External Data Source

  • Campaign metrics
  • Daily aggregation
  • Real-time updates
Lambda

AWS Lambda

Scheduled ETL Function

  • CloudWatch Events trigger
  • Daily at 00:00 UTC
  • Error handling & retry
  • Dead letter queue
S3

Amazon S3

Raw Data Lake

  • Bucket: mlops-raw-data
  • Versioning enabled
  • Lifecycle policies
  • Encrypted at rest

Data Processing & Feature Engineering

Apache Airflow on ECS

Workflow Orchestration

  • DAG: daily_ml_pipeline
  • Managed on AWS ECS Fargate
  • Auto-scaling workers
  • Metadata DB on RDS
ECS

AWS ECS Tasks

Processing Jobs

  • Feature engineering
  • Churn detection
  • Quality tier calculation
  • Fargate serverless compute
RDS

Amazon RDS PostgreSQL

Production Database

  • Multi-AZ deployment
  • Automated backups
  • Read replicas for analytics
  • Encrypted connections

Model Training & MLOps

SageMaker

Amazon SageMaker

Model Training & Experimentation

  • Training jobs on ml.m5.xlarge
  • Hyperparameter tuning
  • Model versioning
  • Experiment tracking
  • Automatic model deployment
S3

S3 Model Registry

mlops-models/

  • Versioned model artifacts
  • Metadata & metrics
ECR

Amazon ECR

Container Registry

  • Docker images
  • Vulnerability scanning

API Serving Layer

ALB

Application Load Balancer

Traffic Distribution

  • HTTPS only (SSL/TLS)
  • Health checks
  • Auto-scaling triggers
ECS Service - Auto Scaling Group (Min: 2, Max: 10)
ECS

API Container 1

Flask + ML Models

  • Port 5000
  • Health: /health
  • CPU: 2 vCPU
  • RAM: 4 GB
ECS

API Container 2

Flask + ML Models

  • Port 5000
  • Health: /health
  • CPU: 2 vCPU
  • RAM: 4 GB
ECS

API Container N

Auto-scaled

  • Dynamic scaling
  • CPU threshold: 70%
  • Scale-up: +1
  • Scale-down: -1
ElastiCache

Amazon ElastiCache (Redis)

Prediction Cache

  • TTL: 1 hour
  • Cache warming
  • Session storage

Monitoring & Observability

CloudWatch

Amazon CloudWatch

Metrics & Logs

  • Application logs
  • Custom metrics
  • Dashboards
  • Alarms & notifications
X-Ray

AWS X-Ray

Distributed Tracing

  • Request tracing
  • Performance bottlenecks
  • Service map
SNS

Amazon SNS

Alerting

  • Email notifications
  • Slack integration
  • PagerDuty escalation

Security & Networking

VPC

Amazon VPC

  • Public subnets: ALB
  • Private subnets: ECS, RDS
  • Multi-AZ deployment
  • NAT Gateway
IAM

AWS IAM

  • Role-based access
  • Least privilege
  • Service roles for ECS
  • MFA enabled
Secrets

AWS Secrets Manager

  • API keys rotation
  • Database credentials
  • Encrypted storage
WAF

AWS WAF

  • DDoS protection
  • Rate limiting
  • IP filtering

Estimated Monthly Cost (US East)

ECS Fargate (API)
$150 - $300
2-10 tasks, 0.5 vCPU, 1GB RAM
RDS PostgreSQL
$80 - $150
db.t3.medium, Multi-AZ
S3 Storage
$10 - $30
Models + data, ~200GB
Application Load Balancer
$25 - $40
Running 24/7
Lambda Functions
$5 - $15
ETL jobs, 1M invocations
ElastiCache Redis
$40 - $80
cache.t3.medium
CloudWatch & Logs
$15 - $30
Logs, metrics, alarms
SageMaker Training
$20 - $50
Weekly retraining
Total Estimated
$345 - $695/mo
Scales with usage

Deployment Checklist

Infrastructure Setup

  • Create VPC with public/private subnets
  • Configure Security Groups
  • Setup NAT Gateway
  • Create RDS PostgreSQL instance
  • Create S3 buckets (data, models)
  • Setup ElastiCache Redis cluster

Application Deployment

  • Build Docker images
  • Push to ECR
  • Create ECS task definitions
  • Deploy ECS services
  • Configure ALB target groups
  • Setup auto-scaling policies

Monitoring & Security

  • Configure CloudWatch dashboards
  • Setup CloudWatch alarms
  • Enable X-Ray tracing
  • Configure SNS topics
  • Store secrets in Secrets Manager
  • Enable WAF rules