I want to be honest with you about what we are and what we're not. We are a 400-person company. We have a mature product, a real revenue base, and a data science team that has been shipping models for four years. We are not a startup. We do not move fast. We have compliance requirements, change management processes, and an infosec review that adds two weeks to any third-party integration. If you want to feel like a hero who single-handedly transforms a company in ninety days, this is probably not the right place for you. What we do have is a genuinely hard infrastructure problem that has been partially solved, badly, by three different teams over three years, and a leadership team that is finally serious about fixing it properly. We want someone experienced enough to diagnose what's actually wrong — not just what looks wrong — and patient enough to fix it in an organisation that moves at enterprise speed. If you've spent time inside a company like ours and understand why change is hard before you understand why it's necessary, please apply. That context is the most valuable thing you could bring.
Responsibilities
Produce a written architecture review of the current ML platform: gaps, risks, and a prioritised remediation plan
Lead the migration of model serving from our current ad-hoc ECS deployment to a standardised Kubernetes-based platform
Build automated retraining and deployment pipelines for our eight highest-priority production models
Define observability standards for model serving: latency SLOs, prediction monitoring, and alerting
Work across data science, platform, and infosec teams to drive technical decisions through proper approval processes
Requirements
6+ years in data engineering, platform engineering, or MLOps with at least three years at a company of 200+ employees
Kubernetes at production depth — cluster administration, resource governance, multi-tenant namespace design
MLflow or an equivalent model registry and experiment tracking system in a real production setting
AWS including EKS, S3, IAM boundary policies, and VPC design — you've navigated enterprise AWS, not just sandbox accounts
Terraform with cross-team module sharing and state management at scale
Experience working within change management and infosec review processes without circumventing them
Benefits
Stable, funded company with real engineering problems — no runway anxiety
Full remote with quarterly in-person team meetings (travel covered)