We run over 200 training jobs a week across a GPU cluster on GKE. Utilisation sits around 40%. Jobs fail silently. Experiment results live in one engineer's personal MLflow instance. Our researchers spend more time fighting infrastructure than doing science. We need an MLOps engineer who has managed GPU workloads on Kubernetes in a real multi-team research environment — not a single training run, but sustained production load at scale.
Responsibilities
Manage and scale our GPU Kubernetes cluster on GKE
Build and maintain ML training pipelines with Kubeflow
Optimise GPU utilisation and reduce idle compute costs
Set up monitoring and alerting for training jobs
Document infrastructure and onboard new ML team members