We are the company behind an open source ML experiment tracking and model registry tool with 9,200 GitHub stars and a contributor community of 340 people across 41 countries. Three years ago we were a side project. Two years ago we incorporated. Last year we launched our managed cloud offering and crossed $2M ARR. Our business model is open core: the library stays free and MIT-licensed forever, and we sell managed hosting and enterprise features to organisations that want to run it at scale. The engineering culture here is shaped by the open source project. We write in public. We discuss architectural decisions in GitHub issues before we make them. We merge pull requests from contributors we've never met based on code quality and rationale alone. If that sounds uncomfortable, this is probably not the right place. If it sounds like the way software should be built, you'll feel at home immediately. We're looking for an ML Platform Engineer to work on the infrastructure that powers both the open source project and the managed cloud service. The work spans Kubernetes, CI/CD, observability, and the specific infrastructure challenges of a tool that needs to work identically whether a user runs it on a laptop or at Fortune 500 scale.
Responsibilities
Own and improve the Kubernetes infrastructure powering our managed cloud service across GCP
Design and implement CI/CD pipelines for the open source library and the managed product
Build observability into the platform: metrics, tracing, alerting, and on-call runbooks
Contribute to the open source SDK — your platform knowledge should improve the developer experience of the library itself
Review infrastructure contributions from community members and provide constructive technical feedback
Requirements
3–5 years of platform, infrastructure, or MLOps engineering
Kubernetes — you design multi-tenant workloads, write Helm charts, and debug networking issues without the docs open
Docker — images, multi-stage builds, registry management, and security scanning
GCP — GKE, Cloud Run, Cloud Storage, and IAM — you've run production workloads, not just development environments
CI/CD pipeline design and ownership — GitHub Actions at a minimum, experience with release automation and versioning
Python for infrastructure tooling, automation scripts, and SDK contributions
A genuine history of open source participation — contributed code, filed thoughtful issues, or maintained a project
Benefits
Your work is public — the infrastructure decisions you make are visible to thousands of ML engineers
Full remote, async-first — most of our community is in time zones we'll never share anyway