There's something philosophically interesting about reinforcement learning that doesn't show up in supervised learning: the agent is responsible for generating its own experience. It decides what to try, what to ignore, and — in a sense — what to learn. We find that interesting. We also find it practically useful for what we build: adaptive tutoring systems that adjust lesson sequencing and difficulty in real time based on how individual students are responding, without needing a human curriculum designer in the loop for every edge case. The RL work here is applied and grounded — not robotics or game environments, but real student interaction data, real learning outcomes, and real constraints around data sparsity, safety, and interpretability. We're looking for an engineer who has implemented RL algorithms from the literature — not just used a gym environment with a tutorial PPO implementation, but actually reasoned about why an algorithm is or isn't working and done something about it. Python and PyTorch are our core stack. Stable-Baselines3 is used for baselines. Custom environments in most cases.
Responsibilities
Design and implement RL-based lesson sequencing policies for the adaptive tutoring engine
Build and maintain custom simulation environments modelling student learning trajectories
Run controlled experiments comparing RL policies against curriculum baselines
Analyse policy behaviour and failure modes and document findings clearly
Contribute to our offline evaluation framework for safe policy validation before deployment
Requirements
3–5 years of software engineering with at least 2 years of focused RL work
PyTorch for implementing and debugging custom RL algorithms
Deep understanding of policy gradient methods — PPO, A3C, SAC — not just at the API level
Experience designing and implementing custom RL environments
NumPy for efficient numerical computation in training loops
Familiarity with Stable-Baselines3 or similar frameworks for rapid prototyping and baselines
Exposure to offline RL or imitation learning is a genuine plus for our data constraints
Benefits
RL applied to something that genuinely improves how people learn
Full remote across US and EU time zones
$105,000 – $130,000 base salary + equity
Research reading time built into the schedule — two hours per week, formally
Small team where your experimental results directly inform product direction