We ship AI features every two weeks. Sometimes they're better than the previous version. Sometimes they're not. We don't always know which until a user complains. That's not good enough. We need someone to build the evaluation layer that sits between "we think this prompt change is better" and "we ship it to users." If you've built LLM eval pipelines before — automated suites, regression dashboards, human-in-the-loop workflows — and you care about output quality as much as shipping speed, this role is for you.
Responsibilities
Design and maintain a comprehensive LLM evaluation suite
Run evals before every prompt or model update ships to production
Build dashboards to track eval metrics over time
Identify and document failure modes with reproducible examples
Work with engineers and PMs to define quality standards per feature