AI Infrastructure Engineer

Amara Diallo

Full-time · Mid-level · New_York

About the role

We run 20 ML models in production. Some of them are fine. Some of them catch fire in ways that are not always predictable. Our current infrastructure is a collection of decisions that were each correct at the time, assembled over three years without an overall architecture in mind. It works. It's also getting slower to deploy to, harder to explain to a new hire in under an hour, and increasingly dependent on institutional knowledge that only two people hold. We are looking for an AI Infrastructure Engineer to join our four-person platform team and do three things: understand what we have, make it less fragile, and help us build what comes next. The stack is Python, AWS, Kubernetes on EKS, Terraform, MLflow, and GitHub Actions. The work is engineering work — infrastructure code, debugging, on-call rotations, documentation. Not research. Not strategy decks. If you get genuine satisfaction from watching a deployment time drop from 18 minutes to 4 minutes and then writing it up so someone else can repeat it, apply.

Responsibilities

Audit current ML infrastructure and produce a written map of what exists, who owns it, and what needs attention in priority order
Reduce model deployment time by at least 50% in the first quarter through automation and pipeline improvements
Implement infrastructure as code for all components currently managed manually
Build observability for model serving: latency, error rate, prediction distribution, and resource utilisation
Participate in on-call rotation (two weeks per quarter) and conduct written post-incident reviews

Requirements

3–5 years in platform, DevOps, or ML infrastructure engineering
Kubernetes — you read pod specs, debug CrashLoopBackOffs, and configure resource limits without consulting the docs for syntax
AWS at operational depth: EKS, ECR, S3, IAM, CloudWatch — not certification memorisation
Terraform for infrastructure as code — you've written modules, managed state, and resolved state drift
MLflow for model registry and serving in a production environment
Python for scripting, automation, and infrastructure tooling
GitHub Actions or equivalent CI/CD system in production

Benefits

Real infrastructure ownership — you fix things that are actually broken, not hypothetically broken
Full remote, US time zones preferred
$95,000 – $120,000 base salary + equity
$1,200 annual tooling budget
On-call is two weeks per quarter — not continuous, and taken seriously as a cost by the team

Job Type

Full-time

Level

Mid-level

Language

English

Salary Range

$95,000 – $120,000

AI Expertise

MLOps & AI Infrastructure

Ready to apply for this role?

Create a free talent account in under 2 minutes.

Apply to verified AI companies
Get AI-matched job recommendations
Message hiring managers directly
Build your public AI talent profile

Create free account & apply Log in

AI Infrastructure Engineer

Report this job

Report submitted

Apply for
AI Infrastructure Engineer

About the role

Responsibilities

Requirements

Benefits

AI Expertise

AI Infrastructure Engineer

Report this job

Report submitted

Apply for AI Infrastructure Engineer

About the role

Responsibilities

Requirements

Benefits

AI Expertise

Apply for
AI Infrastructure Engineer