Profile cover
A

AI Infrastructure Engineer

Amara Diallo

Full-time · Mid-level · New_York

About the role

We run 20 ML models in production. Some of them are fine. Some of them catch fire in ways that are not always predictable. Our current infrastructure is a collection of decisions that were each correct at the time, assembled over three years without an overall architecture in mind. It works. It's also getting slower to deploy to, harder to explain to a new hire in under an hour, and increasingly dependent on institutional knowledge that only two people hold. We are looking for an AI Infrastructure Engineer to join our four-person platform team and do three things: understand what we have, make it less fragile, and help us build what comes next. The stack is Python, AWS, Kubernetes on EKS, Terraform, MLflow, and GitHub Actions. The work is engineering work — infrastructure code, debugging, on-call rotations, documentation. Not research. Not strategy decks. If you get genuine satisfaction from watching a deployment time drop from 18 minutes to 4 minutes and then writing it up so someone else can repeat it, apply.

Responsibilities

  • Audit current ML infrastructure and produce a written map of what exists, who owns it, and what needs attention in priority order
  • Reduce model deployment time by at least 50% in the first quarter through automation and pipeline improvements
  • Implement infrastructure as code for all components currently managed manually
  • Build observability for model serving: latency, error rate, prediction distribution, and resource utilisation
  • Participate in on-call rotation (two weeks per quarter) and conduct written post-incident reviews

Requirements

  • 3–5 years in platform, DevOps, or ML infrastructure engineering
  • Kubernetes — you read pod specs, debug CrashLoopBackOffs, and configure resource limits without consulting the docs for syntax
  • AWS at operational depth: EKS, ECR, S3, IAM, CloudWatch — not certification memorisation
  • Terraform for infrastructure as code — you've written modules, managed state, and resolved state drift
  • MLflow for model registry and serving in a production environment
  • Python for scripting, automation, and infrastructure tooling
  • GitHub Actions or equivalent CI/CD system in production

Benefits

  • Real infrastructure ownership — you fix things that are actually broken, not hypothetically broken
  • Full remote, US time zones preferred
  • $95,000 – $120,000 base salary + equity
  • $1,200 annual tooling budget
  • On-call is two weeks per quarter — not continuous, and taken seriously as a cost by the team

Job Type

Full-time

Level

Mid-level

Language

English

Salary Range

$95,000 – $120,000

AI Expertise

MLOps & AI Infrastructure

Ready to apply for this role?

Create a free talent account in under 2 minutes.

  • Apply to verified AI companies
  • Get AI-matched job recommendations
  • Message hiring managers directly
  • Build your public AI talent profile