We're training large transformer models (7B–30B parameters) and our training setup is inefficient. Single-machine multi-GPU only. No FSDP, no ZeRO optimisation, no gradient checkpointing.
We need an ML engineer to set up proper distributed training across our cluster (16 × H100 on GKE). Deliverables: working distributed training config using PyTorch FSDP, training speed benchmarks before and after, runbook for our research team.
Bonus if you can also set up experiment tracking integration (we use Weights & Biases).