We operate a real-time recommendation system serving 800 million requests per day across 40+ models running in parallel on AWS, with specific workloads on GCP. The infrastructure works. It is not, however, mature enough. Our model deployment pipeline has too many manual steps, our rollback process takes 45 minutes when it should take five, and our experiment tracking is inconsistent enough that engineers sometimes can't reproduce a training run from three months ago. We are hiring a senior MLOps engineer to fix this — not to manage it, but to actively engineer better solutions. You will be hands-on in the code every day. You will also be setting engineering standards and reviewing the work of three junior and mid-level engineers. If you have operated ML infrastructure at this scale and you care about the craft of building systems that work reliably when it matters most, we want to talk.
Responsibilities
Lead the redesign of our model deployment and rollback pipeline
Improve experiment tracking standardisation across all ML teams
Mentor and review work for three junior and mid-level MLOps engineers
Define and enforce MLOps engineering standards and best practices
Drive the migration of manual deployment steps into automated, auditable processes
Requirements
6+ years in infrastructure or platform engineering with 3+ years focused on ML systems
Deep expertise in Kubernetes for model serving and scaling
Airflow pipeline design and operational experience at scale
MLflow or a comparable model registry — experiment tracking, versioning, deployment
AWS-native infrastructure (EKS, SageMaker, S3, CloudWatch) is essential; GCP is a bonus
Databricks for large-scale feature engineering and training jobs
Maturity in CI/CD design for ML — canary deployments, shadow mode, automated rollback
Benefits
Technical leadership at real scale — 800M requests/day is not a slide deck number