We run 20 ML models in production. Some of them are fine. Some of them catch fire in ways that are not always predictable. Our current infrastructure is a collection of decisions that were each correct at the time, assembled over three years without an overall architecture in mind. It works. It's also getting slower to deploy to, harder to explain to a new hire in under an hour, and increasingly dependent on institutional knowledge that only two people hold. We are looking for an AI Infrastructure Engineer to join our four-person platform team and do three things: understand what we have, make it less fragile, and help us build what comes next. The stack is Python, AWS, Kubernetes on EKS, Terraform, MLflow, and GitHub Actions. The work is engineering work — infrastructure code, debugging, on-call rotations, documentation. Not research. Not strategy decks. If you get genuine satisfaction from watching a deployment time drop from 18 minutes to 4 minutes and then writing it up so someone else can repeat it, apply.
Responsibilities
Audit current ML infrastructure and produce a written map of what exists, who owns it, and what needs attention in priority order
Reduce model deployment time by at least 50% in the first quarter through automation and pipeline improvements
Implement infrastructure as code for all components currently managed manually
Build observability for model serving: latency, error rate, prediction distribution, and resource utilisation
Participate in on-call rotation (two weeks per quarter) and conduct written post-incident reviews
Requirements
3–5 years in platform, DevOps, or ML infrastructure engineering
Kubernetes — you read pod specs, debug CrashLoopBackOffs, and configure resource limits without consulting the docs for syntax
AWS at operational depth: EKS, ECR, S3, IAM, CloudWatch — not certification memorisation
Terraform for infrastructure as code — you've written modules, managed state, and resolved state drift
MLflow for model registry and serving in a production environment
Python for scripting, automation, and infrastructure tooling
GitHub Actions or equivalent CI/CD system in production
Benefits
Real infrastructure ownership — you fix things that are actually broken, not hypothetically broken
Full remote, US time zones preferred
$95,000 – $120,000 base salary + equity
$1,200 annual tooling budget
On-call is two weeks per quarter — not continuous, and taken seriously as a cost by the team