LLM Evaluation Specialist

Nadia Kowalski

Full-time · Mid-level · New_York

About the role

We ship AI features every two weeks. Sometimes they're better than the previous version. Sometimes they're not. We don't always know which until a user complains. That's not good enough. We need someone to build the evaluation layer that sits between "we think this prompt change is better" and "we ship it to users." If you've built LLM eval pipelines before — automated suites, regression dashboards, human-in-the-loop workflows — and you care about output quality as much as shipping speed, this role is for you.

Responsibilities

Design and maintain a comprehensive LLM evaluation suite
Run evals before every prompt or model update ships to production
Build dashboards to track eval metrics over time
Identify and document failure modes with reproducible examples
Work with engineers and PMs to define quality standards per feature

Requirements

Experience designing LLM evaluation frameworks (automated + human-in-the-loop)
Familiarity with evaluation tools (RAGAS, DeepEval, PromptFoo, or custom)
Strong Python for building eval pipelines
Sharp analytical eye for LLM failure modes
Able to communicate eval findings clearly to product and engineering teams

Benefits

High-leverage role in a fast-shipping AI team
Full remote
Learning budget
Modern AI stack
Flexible schedule

Job Type

Full-time

Level

Mid-level

Language

English

Salary Range

$90,000 – $120,000

AI Expertise

NLP & Prompt Engineering

Ready to apply for this role?

Create a free talent account in under 2 minutes.

Apply to verified AI companies
Get AI-matched job recommendations
Message hiring managers directly
Build your public AI talent profile

Create free account & apply Log in