Profile cover
K

ML Data Pipeline Migration – Hadoop to AWS Spark

Kenji Nakamura

Freelance · Mid-level · Chicago

About the role

We are migrating our entire ML data processing layer from a legacy Hadoop and Hive environment to a modern Spark-on-AWS architecture. The existing pipeline consists of approximately 180,000 lines of PySpark and SQL across 90 ETL jobs that run nightly to prepare training data for our recommendation and fraud detection models. The data volumes are significant — roughly 4TB processed per night. We need an experienced data engineer or MLOps specialist who has executed this type of migration before to audit the existing code, design the migration approach, execute it in stages, set up Airflow orchestration on AWS Managed Workflows for Apache Airflow (MWAA), and validate output parity with the legacy system before we decommission anything. Scope is clearly defined. Deliverables are documented. Budget is approved. If you have migrated legacy Hadoop pipelines before and have the experience to show for it, we want to hear from you.

Key Deliverables

List the expected deliverables for this project.

  • Audit all 90 existing ETL jobs and produce a migration priority and risk assessment
  • Execute the migration in defined stages with output parity validation at each stage
  • Set up Airflow orchestration on AWS MWAA to replace the legacy scheduler
  • Document the full post-migration architecture and data lineage
  • Conduct a final handover review with the internal data engineering team

Requirements

Technical stack needed for this mission.

  • Proven experience migrating Hadoop/Hive ETL workloads to a modern Spark environment
  • Strong PySpark and SQL across large-scale datasets with demonstrable performance tuning
  • AWS data services: MWAA, EMR, S3, Glue — hands-on, not theoretical
  • Airflow DAG design and migration at production scale
  • ETL best practices — idempotency, lineage tracking, failure handling
  • Databricks familiarity is a meaningful bonus for the post-migration state

Contract Type

Fixed-price

Level

Mid-level

Budget Range

$14,000 – $24,000

Duration

1 – 3 months

AI Expertise

MLOps & AI Infrastructure

Ready to apply for this role?

Create a free talent account in under 2 minutes.

  • Apply to verified AI companies
  • Get AI-matched job recommendations
  • Message hiring managers directly
  • Build your public AI talent profile