Senior MLOps Engineer

Marco Russo

Full-time · Senior · Los_Angeles

About the role

We serve 180 million product recommendations per day across a retail marketplace. The ML systems behind those recommendations — ranking models, personalisation models, complementary product models, price elasticity models — run continuously. They are retrained on fresh data every six hours. They are deployed in blue/green fashion to a Kubernetes cluster handling 40,000 requests per second at peak. They are monitored by dashboards the on-call rotation checks before breakfast. When one of them breaks, it is not a startup inconvenience — it is a revenue impact we can quantify to the dollar within 15 minutes. We are looking for a Senior MLOps Engineer to join the six-person ML platform team. You will not be building recommendation models. You will be building and maintaining the infrastructure that trains, deploys, monitors, and recovers them. You need to have operated ML systems at real scale — not 'we had a lot of requests once' scale, but 'we thought carefully about whether to use blue/green or canary for this model because the wrong choice costs money' scale.

Responsibilities

Own the training pipeline infrastructure for 14 production recommendation and ranking models
Lead the migration of model deployment from manual blue/green to fully automated canary with integrated performance-based rollback
Build and improve real-time model performance monitoring with sub-5-minute alerting on prediction distribution shifts and latency degradation
Manage and optimise Spark-based feature engineering pipelines for data freshness, cost efficiency, and reliability
Mentor two mid-level platform engineers and lead weekly platform health review sessions

Requirements

6+ years in platform, MLOps, or site reliability engineering with direct experience at high-traffic production scale
Kubernetes — you design production cluster configurations, resource policies, and autoscaling strategies, not just deploy sample apps
MLflow for model registry, versioning, and serving in a multi-model production environment
AWS at production depth: EKS, Kinesis, S3, CloudWatch, Lambda
Apache Spark or PySpark for large-scale feature engineering and training data pipeline management
Python for infrastructure tooling, monitoring automation, and deployment scripting
Experience designing A/B testing infrastructure and canary deployment strategies specifically for ML model updates

Benefits

ML infrastructure ownership at genuine production scale — the problems are real, the impact is measurable, and the team takes reliability seriously
Full remote, US time zones
$130,000 – $158,000 base salary + equity
On-call: two weeks per quarter with on-call differential — scheduled, not continuous
Strong team — engineers who have seen real scale and are not interested in shortcuts

Job Type

Full-time

Level

Senior

Language

English

Salary Range

$130,000 – $158,000

AI Expertise

MLOps & AI Infrastructure

Ready to apply for this role?

Create a free talent account in under 2 minutes.

Apply to verified AI companies
Get AI-matched job recommendations
Message hiring managers directly
Build your public AI talent profile

Create free account & apply Log in

Senior MLOps Engineer

Report this job

Report submitted

Apply for
Senior MLOps Engineer

About the role

Responsibilities

Requirements

Benefits

AI Expertise

Senior MLOps Engineer

Report this job

Report submitted

Apply for Senior MLOps Engineer

About the role

Responsibilities

Requirements

Benefits

AI Expertise

Apply for
Senior MLOps Engineer