ML Reliability Engineer

Aaron Fischer

Full-time · Mid-level · Los_Angeles

About the role

You know that 3am incident where the model silently started returning null for 8% of requests and nobody noticed for six hours because the dashboard showed 'healthy' the whole time? We've had that incident. We've also had the one where a training job finished successfully but wrote its outputs to the wrong S3 path. And the one where a shadow deployment got promoted to primary because someone clicked the wrong button in a pipeline that had too many buttons. We've learned from all of them — eventually. We now want to hire someone whose job is to make sure we stop learning these lessons the hard way. The title is ML Reliability Engineer. The work is part MLOps, part SRE, part detective. You'll build better observability into our model serving infrastructure on AWS, improve our canary deployment process so bad models don't survive their first five minutes, and design alerting that catches real problems rather than generating noise. The stack is Python, Kubernetes, MLflow, and AWS. If you're the kind of engineer who reads incident reports from other companies for fun and comes away with a list of things to implement, we will get along well.

Responsibilities

Build end-to-end observability for model serving: latency, prediction distribution, null rate, and resource metrics
Design and implement automated canary evaluation to catch degraded models before full promotion
Reduce mean time to detection and mean time to resolution for ML infrastructure incidents
Conduct post-incident reviews and turn findings into implemented improvements, not just action items
Document the reliability architecture and maintain up-to-date runbooks

Requirements

4+ years in platform, site reliability, or MLOps engineering
Kubernetes — you diagnose pod failures, resource exhaustion, and networking issues yourself
MLflow for model registry, versioning, and serving observability
AWS: CloudWatch, EKS, S3, ECR — operational depth, not just familiarity
Python for building monitoring integrations, alerting logic, and reliability tooling
Experience designing canary deployments and automated rollback for ML workloads
Strong incident analysis instincts — you form hypotheses, test them, and document what you found

Benefits

Direct ownership of a reliability engineering function that genuinely needs better
Full remote, US time zones
$92,000 – $115,000 base salary + equity
$1,500 annual tooling and conference budget
Post-incident reviews are blame-free and taken seriously — we mean that

Job Type

Full-time

Level

Mid-level

Language

English

Salary Range

$92,000 – $115,000

AI Expertise

MLOps & AI Infrastructure

Ready to apply for this role?

Create a free talent account in under 2 minutes.

Apply to verified AI companies
Get AI-matched job recommendations
Message hiring managers directly
Build your public AI talent profile

Create free account & apply Log in