MLOps Engineer – Kubernetes & GPU Clusters

Amara Diallo

Full-time · Mid-level · Los_Angeles

About the role

We run over 200 training jobs a week across a GPU cluster on GKE. Utilisation sits around 40%. Jobs fail silently. Experiment results live in one engineer's personal MLflow instance. Our researchers spend more time fighting infrastructure than doing science. We need an MLOps engineer who has managed GPU workloads on Kubernetes in a real multi-team research environment — not a single training run, but sustained production load at scale.

Responsibilities

Manage and scale our GPU Kubernetes cluster on GKE
Build and maintain ML training pipelines with Kubeflow
Optimise GPU utilisation and reduce idle compute costs
Set up monitoring and alerting for training jobs
Document infrastructure and onboard new ML team members

Requirements

3+ years MLOps with Kubernetes in production
Experience managing GPU workloads (NVIDIA, CUDA, node selectors)
Strong Python and familiarity with Helm and Kustomize
Hands-on with Airflow or Kubeflow for pipeline orchestration
Cloud experience (GCP preferred — GKE, Vertex AI, Cloud Storage)

Benefits

Greenfield MLOps build at a research-heavy company
Full remote
Competitive salary + equity
GPU budget for personal experiments
25 days PTO

Job Type

Full-time

Level

Mid-level

Language

English

Salary Range

$120,000 – $155,000

AI Expertise

MLOps & AI Infrastructure

Ready to apply for this role?

Create a free talent account in under 2 minutes.

Apply to verified AI companies
Get AI-matched job recommendations
Message hiring managers directly
Build your public AI talent profile

Create free account & apply Log in