We build a consumer AI assistant used by 1.4 million people daily. The assistant helps users draft text, navigate complex decisions, and get reliable answers — across financial planning, health information, and professional communication. At that scale, every change to a system prompt has measurable effects within hours. We learned this the hard way: a prompt that tested well on our internal eval suite degraded user satisfaction scores by 3.2 points in its first week in production. That experience is the reason we're hiring a Senior Prompt Engineer whose primary responsibility isn't writing prompts — it's building the evaluation frameworks that tell us whether a prompt is actually better, not just different. You'll work at the intersection of language design, RAG architecture, and rigorous measurement. You'll own the evaluation methodology. You'll run the A/B tests. You'll document every change with a rationale and test results before it ships. The role reports to our Head of AI Product and requires overlap with US Central hours.
Responsibilities
Own and continuously improve all production system prompts across the assistant's core feature set
Design and maintain an automated prompt evaluation framework with clear, reproducible metrics
Run controlled A/B tests on prompt variants and report findings with statistical confidence intervals
Collaborate with the safety team to identify and remediate failure modes before deployment
Document prompt architecture decisions in a format accessible to engineers, product managers, and leadership
Requirements
4+ years of engineering or applied research with 2+ years of focused prompt engineering on production LLM systems
Deep hands-on experience with OpenAI and Anthropic APIs — you understand temperature, top-p, and logit bias in production, not just in theory
Experience designing automated evaluation pipelines for LLM output quality: helpfulness, accuracy, and safety metrics
LangChain or LlamaIndex for RAG pipeline construction and continuous improvement
Python for evaluation automation, pipeline tooling, and data analysis
Familiarity with structured output techniques: function calling, JSON mode, and constrained generation
Strong written communication — every prompt change you ship has a documented rationale and evaluation result
Benefits
Work at the intersection of LLM research and consumer product at real scale — you see how every change performs on 1.4 million daily users
Full remote, US time zones
$120,000 – $150,000 base salary + equity
$2,500 annual research and conference budget
Access to production evaluation data — real signal, not synthetic benchmarks