AI Infrastructure Lead — GPU Cluster Management

Valentina Cruz

Full-time · Lead

About the role

Our ML team is scaling fast and our infra is not keeping up. We've got 12 researchers queuing jobs on a cluster that's half on-prem, half GCP, all held together with bash scripts and good intentions. Last quarter we had three separate incidents where training runs got silently killed overnight and nobody noticed until the next morning. That can't keep happening. We need someone to take ownership of this whole layer — proper job scheduling, cost visibility (our cloud bill is a black box right now), fault recovery, and a sensible process for onboarding new model types. You'll manage two infra engineers who know the current setup well. They're good — they just need leadership and a real plan. This is a senior technical role with management responsibility. If you want to stay purely hands-on, this probably isn't the right fit.

Requirements

– Has run GPU clusters at scale — SLURM, Ray, or Kubernetes-based
– Multi-cloud experience, ideally GCP (that's our primary cloud)
– Has dealt with distributed training failures and knows how to build resilience in
– Can lead a small team without micromanaging
– Communicates clearly with research teams who have strong opinions

Job Type

Full-time

Level

Lead

Language

English

Salary Range

$150k – $200k / year

AI Expertise

MLOps & AI Infrastructure

Ready to apply for this role?

Create a free talent account in under 2 minutes.

Apply to verified AI companies
Get AI-matched job recommendations
Message hiring managers directly
Build your public AI talent profile

Create free account & apply Log in

AI Infrastructure Lead — GPU Cluster Management

Apply for AI Infrastructure Lead — GPU Cluster Management

About the role

Requirements

AI Expertise

Apply for
AI Infrastructure Lead — GPU Cluster Management