You know that 3am incident where the model silently started returning null for 8% of requests and nobody noticed for six hours because the dashboard showed 'healthy' the whole time? We've had that incident. We've also had the one where a training job finished successfully but wrote its outputs to the wrong S3 path. And the one where a shadow deployment got promoted to primary because someone clicked the wrong button in a pipeline that had too many buttons. We've learned from all of them — eventually. We now want to hire someone whose job is to make sure we stop learning these lessons the hard way. The title is ML Reliability Engineer. The work is part MLOps, part SRE, part detective. You'll build better observability into our model serving infrastructure on AWS, improve our canary deployment process so bad models don't survive their first five minutes, and design alerting that catches real problems rather than generating noise. The stack is Python, Kubernetes, MLflow, and AWS. If you're the kind of engineer who reads incident reports from other companies for fun and comes away with a list of things to implement, we will get along well.
Responsibilities
Build end-to-end observability for model serving: latency, prediction distribution, null rate, and resource metrics
Design and implement automated canary evaluation to catch degraded models before full promotion
Reduce mean time to detection and mean time to resolution for ML infrastructure incidents
Conduct post-incident reviews and turn findings into implemented improvements, not just action items
Document the reliability architecture and maintain up-to-date runbooks
Requirements
4+ years in platform, site reliability, or MLOps engineering
Kubernetes — you diagnose pod failures, resource exhaustion, and networking issues yourself
MLflow for model registry, versioning, and serving observability
AWS: CloudWatch, EKS, S3, ECR — operational depth, not just familiarity
Python for building monitoring integrations, alerting logic, and reliability tooling
Experience designing canary deployments and automated rollback for ML workloads
Strong incident analysis instincts — you form hypotheses, test them, and document what you found
Benefits
Direct ownership of a reliability engineering function that genuinely needs better
Full remote, US time zones
$92,000 – $115,000 base salary + equity
$1,500 annual tooling and conference budget
Post-incident reviews are blame-free and taken seriously — we mean that