Our researchers are brilliant. Our infrastructure is not. Training runs die halfway through, experiment results get lost, and deployments happen over Slack messages. We need an AI infrastructure engineer to fix this — not as a support function, but as a core part of how we do research. If you enjoy bringing order to technical chaos and want to work alongside serious ML scientists, this is the right place.
Responsibilities
Build and maintain GPU training infrastructure on AWS
Set up and improve experiment tracking with MLflow
Design model versioning and registry workflows
Build CI/CD pipelines for model deployment
Monitor system health and resolve infrastructure incidents
Requirements
Experience building ML infrastructure (training pipelines, model registries, serving)
Strong Python and familiarity with Docker and Kubernetes
Hands-on with at least one ML experiment tracking tool (MLflow, W&B, Neptune)
Cloud infrastructure experience (AWS, GCP, or Azure)
Able to work closely with research scientists and translate their needs into systems