Profile cover
A

ML Reliability Engineer

Aaron Fischer

Full-time · Mid-level · Los_Angeles

About the role

You know that 3am incident where the model silently started returning null for 8% of requests and nobody noticed for six hours because the dashboard showed 'healthy' the whole time? We've had that incident. We've also had the one where a training job finished successfully but wrote its outputs to the wrong S3 path. And the one where a shadow deployment got promoted to primary because someone clicked the wrong button in a pipeline that had too many buttons. We've learned from all of them — eventually. We now want to hire someone whose job is to make sure we stop learning these lessons the hard way. The title is ML Reliability Engineer. The work is part MLOps, part SRE, part detective. You'll build better observability into our model serving infrastructure on AWS, improve our canary deployment process so bad models don't survive their first five minutes, and design alerting that catches real problems rather than generating noise. The stack is Python, Kubernetes, MLflow, and AWS. If you're the kind of engineer who reads incident reports from other companies for fun and comes away with a list of things to implement, we will get along well.

Responsibilities

  • Build end-to-end observability for model serving: latency, prediction distribution, null rate, and resource metrics
  • Design and implement automated canary evaluation to catch degraded models before full promotion
  • Reduce mean time to detection and mean time to resolution for ML infrastructure incidents
  • Conduct post-incident reviews and turn findings into implemented improvements, not just action items
  • Document the reliability architecture and maintain up-to-date runbooks

Requirements

  • 4+ years in platform, site reliability, or MLOps engineering
  • Kubernetes — you diagnose pod failures, resource exhaustion, and networking issues yourself
  • MLflow for model registry, versioning, and serving observability
  • AWS: CloudWatch, EKS, S3, ECR — operational depth, not just familiarity
  • Python for building monitoring integrations, alerting logic, and reliability tooling
  • Experience designing canary deployments and automated rollback for ML workloads
  • Strong incident analysis instincts — you form hypotheses, test them, and document what you found

Benefits

  • Direct ownership of a reliability engineering function that genuinely needs better
  • Full remote, US time zones
  • $92,000 – $115,000 base salary + equity
  • $1,500 annual tooling and conference budget
  • Post-incident reviews are blame-free and taken seriously — we mean that

Job Type

Full-time

Level

Mid-level

Language

English

Salary Range

$92,000 – $115,000

AI Expertise

MLOps & AI Infrastructure

Ready to apply for this role?

Create a free talent account in under 2 minutes.

  • Apply to verified AI companies
  • Get AI-matched job recommendations
  • Message hiring managers directly
  • Build your public AI talent profile