We run a document automation platform for insurance companies. Our core feature — extracting structured data from unstructured policy documents — currently runs at 71% accuracy. The target is 90%+. Previous attempts to improve accuracy by swapping models failed because the problem is prompt architecture, not model capability. We need a prompt engineer who treats prompt design as a systems engineering problem: structured evaluation, regression testing, measurable improvement, documented rationale. You'll work across GPT-4o, Claude 3.5, and two fine-tuned variants. The work is systematic and rigorous — not creative writing. If you enjoy precision and have built prompt evaluation pipelines before, this is the role for you.
Responsibilities
Audit and systematically improve prompts across our document extraction pipeline
Build and maintain a prompt regression test suite with 200+ validated test cases
Design controlled experiments to evaluate prompt changes against a baseline
Document every prompt change with rationale, test results, and known edge cases
Work with the ML team on fine-tuning data curation informed by prompt failure analysis
Requirements
3+ years engineering with LLM APIs in a production environment
Experience building structured prompt evaluation and regression test suites
Deep fluency in few-shot design, chain-of-thought prompting, and structured output extraction
Hands-on with the OpenAI API and at least one other major provider
Strong Python for scripting evaluation pipelines and result analysis
Familiarity with LangChain for prompt chaining and output parsing
Experience with Hugging Face for model evaluation metrics is a bonus
Benefits
Real precision engineering challenge — not prompt guessing
Full remote, async-first team across US and EU
$85,000 – $108,000 base salary
Direct access to the ML team and model providers
Quarterly tool budget for API access and evaluation tooling