Demonstrating Production-Grade SRE Practices: Observability, Failure Handling, and Automated Recovery
AutoSRE is a production-like reliability engineering platform built to demonstrate how Site Reliability Engineers design, monitor, break, and heal distributed systems. This isn't just code...it's a showcase of operational maturity, observability, and incident response automation.
FastAPI service with Prometheus metrics (RED: Rate, Errors, Duration)
Counter, Histogram, and Gauge metrics with proper labeling
Docker-based deployment for consistency across environments
Liveness and readiness probes for container orchestration
Auto-scraping metrics, live dashboards, SLO tracking
Worker service + Auth service with circuit breakers
Intentional failure injection and automated recovery validation
Postmortems with timelines, root cause analysis, and prevention strategies
This project is being built incrementally with full documentation of architectural decisions and SRE thinking.
Follow on GitHub →