Our ML training and inference costs went from $40k/month to $180k/month in 8 months. Nobody knows exactly why.
We need an ML infra engineer to audit our GPU usage — what's running, what's idle, what's overprovisioned — and implement a cost reduction plan.
Currently: mixed AWS (SageMaker training) and GCP (GKE serving). Training jobs run for weeks with no visibility into whether they're actually using the GPUs efficiently.
Expected output: usage audit report, 3-month savings roadmap, and implementation of the top 5 quick wins.