Profile cover
M

Senior MLOps Engineer

Marco Russo

Full-time · Senior · Los_Angeles

About the role

We serve 180 million product recommendations per day across a retail marketplace. The ML systems behind those recommendations — ranking models, personalisation models, complementary product models, price elasticity models — run continuously. They are retrained on fresh data every six hours. They are deployed in blue/green fashion to a Kubernetes cluster handling 40,000 requests per second at peak. They are monitored by dashboards the on-call rotation checks before breakfast. When one of them breaks, it is not a startup inconvenience — it is a revenue impact we can quantify to the dollar within 15 minutes. We are looking for a Senior MLOps Engineer to join the six-person ML platform team. You will not be building recommendation models. You will be building and maintaining the infrastructure that trains, deploys, monitors, and recovers them. You need to have operated ML systems at real scale — not 'we had a lot of requests once' scale, but 'we thought carefully about whether to use blue/green or canary for this model because the wrong choice costs money' scale.

Responsibilities

  • Own the training pipeline infrastructure for 14 production recommendation and ranking models
  • Lead the migration of model deployment from manual blue/green to fully automated canary with integrated performance-based rollback
  • Build and improve real-time model performance monitoring with sub-5-minute alerting on prediction distribution shifts and latency degradation
  • Manage and optimise Spark-based feature engineering pipelines for data freshness, cost efficiency, and reliability
  • Mentor two mid-level platform engineers and lead weekly platform health review sessions

Requirements

  • 6+ years in platform, MLOps, or site reliability engineering with direct experience at high-traffic production scale
  • Kubernetes — you design production cluster configurations, resource policies, and autoscaling strategies, not just deploy sample apps
  • MLflow for model registry, versioning, and serving in a multi-model production environment
  • AWS at production depth: EKS, Kinesis, S3, CloudWatch, Lambda
  • Apache Spark or PySpark for large-scale feature engineering and training data pipeline management
  • Python for infrastructure tooling, monitoring automation, and deployment scripting
  • Experience designing A/B testing infrastructure and canary deployment strategies specifically for ML model updates

Benefits

  • ML infrastructure ownership at genuine production scale — the problems are real, the impact is measurable, and the team takes reliability seriously
  • Full remote, US time zones
  • $130,000 – $158,000 base salary + equity
  • On-call: two weeks per quarter with on-call differential — scheduled, not continuous
  • Strong team — engineers who have seen real scale and are not interested in shortcuts

Job Type

Full-time

Level

Senior

Language

English

Salary Range

$130,000 – $158,000

AI Expertise

MLOps & AI Infrastructure

Ready to apply for this role?

Create a free talent account in under 2 minutes.

  • Apply to verified AI companies
  • Get AI-matched job recommendations
  • Message hiring managers directly
  • Build your public AI talent profile