Profile cover
N

LLM Evaluation Specialist

Nadia Kowalski

Full-time · Mid-level · New_York

About the role

We ship AI features every two weeks. Sometimes they're better than the previous version. Sometimes they're not. We don't always know which until a user complains. That's not good enough. We need someone to build the evaluation layer that sits between "we think this prompt change is better" and "we ship it to users." If you've built LLM eval pipelines before — automated suites, regression dashboards, human-in-the-loop workflows — and you care about output quality as much as shipping speed, this role is for you.

Responsibilities

  • Design and maintain a comprehensive LLM evaluation suite
  • Run evals before every prompt or model update ships to production
  • Build dashboards to track eval metrics over time
  • Identify and document failure modes with reproducible examples
  • Work with engineers and PMs to define quality standards per feature

Requirements

  • Experience designing LLM evaluation frameworks (automated + human-in-the-loop)
  • Familiarity with evaluation tools (RAGAS, DeepEval, PromptFoo, or custom)
  • Strong Python for building eval pipelines
  • Sharp analytical eye for LLM failure modes
  • Able to communicate eval findings clearly to product and engineering teams

Benefits

  • High-leverage role in a fast-shipping AI team
  • Full remote
  • Learning budget
  • Modern AI stack
  • Flexible schedule

Job Type

Full-time

Level

Mid-level

Language

English

Salary Range

$90,000 – $120,000

AI Expertise

NLP & Prompt Engineering

Ready to apply for this role?

Create a free talent account in under 2 minutes.

  • Apply to verified AI companies
  • Get AI-matched job recommendations
  • Message hiring managers directly
  • Build your public AI talent profile