We serve 10M+ ML predictions per day and our on-call rotation is a disaster. Right now when something goes wrong at 2am, the engineer on call has no runbooks, no clear SLOs, and ends up pinging Slack until they find someone who knows what the thing does.
We need an SRE who has worked on ML inference systems and knows the specific failure modes — model staleness, latency spikes from queue buildup, GPU memory pressure, etc.
The first 90 days would be: define SLOs that actually mean something, get alerting out of 'alert on everything' mode, and write runbooks so the on-call engineer isn't flying blind.
This is less about keeping servers alive and more about building the processes and tooling that let us sleep at night.