Our ML team is scaling fast and our infra is not keeping up. We've got 12 researchers queuing jobs on a cluster that's half on-prem, half GCP, all held together with bash scripts and good intentions.
Last quarter we had three separate incidents where training runs got silently killed overnight and nobody noticed until the next morning. That can't keep happening.
We need someone to take ownership of this whole layer — proper job scheduling, cost visibility (our cloud bill is a black box right now), fault recovery, and a sensible process for onboarding new model types.
You'll manage two infra engineers who know the current setup well. They're good — they just need leadership and a real plan.
This is a senior technical role with management responsibility. If you want to stay purely hands-on, this probably isn't the right fit.
Requirements
– Has run GPU clusters at scale — SLURM, Ray, or Kubernetes-based