Site Reliability Engineer — ML Workloads

Anton Volkov

Full-time · Senior

About the role

We serve 10M+ ML predictions per day and our on-call rotation is a disaster. Right now when something goes wrong at 2am, the engineer on call has no runbooks, no clear SLOs, and ends up pinging Slack until they find someone who knows what the thing does. We need an SRE who has worked on ML inference systems and knows the specific failure modes — model staleness, latency spikes from queue buildup, GPU memory pressure, etc. The first 90 days would be: define SLOs that actually mean something, get alerting out of 'alert on everything' mode, and write runbooks so the on-call engineer isn't flying blind. This is less about keeping servers alive and more about building the processes and tooling that let us sleep at night.

Job Type

Full-time

Level

Senior

Language

English

Salary Range

$130k – $170k / year

AI Expertise

MLOps & AI Infrastructure

Ready to apply for this role?

Create a free talent account in under 2 minutes.

Apply to verified AI companies
Get AI-matched job recommendations
Message hiring managers directly
Build your public AI talent profile

Create free account & apply Log in

Site Reliability Engineer — ML Workloads

Apply for Site Reliability Engineer — ML Workloads

About the role

AI Expertise

Apply for
Site Reliability Engineer — ML Workloads