Profile cover
A

MLOps Engineer – Kubernetes & GPU Clusters

Amara Diallo

Full-time · Mid-level · Los_Angeles

About the role

We run over 200 training jobs a week across a GPU cluster on GKE. Utilisation sits around 40%. Jobs fail silently. Experiment results live in one engineer's personal MLflow instance. Our researchers spend more time fighting infrastructure than doing science. We need an MLOps engineer who has managed GPU workloads on Kubernetes in a real multi-team research environment — not a single training run, but sustained production load at scale.

Responsibilities

  • Manage and scale our GPU Kubernetes cluster on GKE
  • Build and maintain ML training pipelines with Kubeflow
  • Optimise GPU utilisation and reduce idle compute costs
  • Set up monitoring and alerting for training jobs
  • Document infrastructure and onboard new ML team members

Requirements

  • 3+ years MLOps with Kubernetes in production
  • Experience managing GPU workloads (NVIDIA, CUDA, node selectors)
  • Strong Python and familiarity with Helm and Kustomize
  • Hands-on with Airflow or Kubeflow for pipeline orchestration
  • Cloud experience (GCP preferred — GKE, Vertex AI, Cloud Storage)

Benefits

  • Greenfield MLOps build at a research-heavy company
  • Full remote
  • Competitive salary + equity
  • GPU budget for personal experiments
  • 25 days PTO

Job Type

Full-time

Level

Mid-level

Language

English

Salary Range

$120,000 – $155,000

AI Expertise

MLOps & AI Infrastructure

Ready to apply for this role?

Create a free talent account in under 2 minutes.

  • Apply to verified AI companies
  • Get AI-matched job recommendations
  • Message hiring managers directly
  • Build your public AI talent profile