We process 500 million events per day. Stack is Kafka + Spark on AWS, orchestrated via Airflow. It's held up, but end-to-end latency on our critical event streams has crept from 90 seconds to 6+ minutes over the past year and our ML team keeps missing their feature freshness SLAs.
What we actually need:
– Scale our Kafka cluster from 12 to 30+ brokers without a weekend maintenance window
– Get P95 stream latency back under 2 minutes
– Build a feature store (evaluating Feast vs Tecton) so ML models can query real-time features without going directly to Kafka
– Improve observability — when latency spikes we have no idea where in the pipeline the bottleneck is
The ML team is 8 scientists who are vocal about what they need. You'll work directly with them. They're not always patient — if constant stakeholder requests frustrate you, this will be hard. If you like having users who depend on your work every day, it's great.
DevOps owns the infra, you own the pipeline. The boundary is usually clear.
Requirements
– Production experience with Kafka and Spark at scale
– Strong Python, ideally PySpark
– AWS experience (MSK, EMR, Glue, or equivalents)
– Experience building real-time feature pipelines for ML