About the Role

This opportunity is focused on applying reinforcement learning to large-scale production systems that improve web data extraction, structured output generation, and intelligent agent workflows. The role combines reinforcement learning, large language model systems, training infrastructure, experimentation, and production deployment.

This is a highly hands-on engineering and research position where ownership includes building training systems, reward pipelines, evaluation frameworks, and production-ready model infrastructure from the ground up.

The role is designed for someone who moves quickly, runs rapid experiments, and bridges classical reinforcement learning approaches with modern LLM agent systems.

What You’ll Do

Build training infrastructure and reward pipelines from scratch for model training and evaluation
Own the complete model lifecycle including data collection, reward modeling, training runs, evaluation, and deployment
Design and maintain custom infrastructure required for reinforcement learning workflows
Fine-tune foundation models for advanced web data extraction and structured content generation
Improve model quality through rigorous experimentation and optimization techniques
Apply reinforcement learning methods to multi-step LLM agent workflows
Design reward signals and policy optimization strategies for agent behavior improvement
Identify where classical reinforcement learning approaches outperform prompting methods and vice versa
Run fast, iterative experiments based on meaningful hypotheses and measurable outcomes
Interpret experiment results quickly and make rapid technical decisions
Communicate reinforcement learning concepts clearly to engineers, product teams, and leadership stakeholders
Explain reward functions, model behavior, and optimization strategies in practical business terms
Collaborate closely with research engineers and engineering teams on search, ranking, and product improvements
Connect reinforcement learning improvements directly to production product systems
Deploy and maintain models serving real production traffic
Balance tradeoffs between model quality, latency, scalability, and infrastructure cost

Qualifications

3+ years of experience in applied reinforcement learning, machine learning engineering, or production model training
Strong experience building custom training infrastructure and reward pipelines independently
Experience designing and operating training loops, reward models, data pipelines, and evaluation frameworks
Hands-on experience managing GPU clusters and large-scale training runs
Ability to debug convergence issues and production model behavior
Proven experience fine-tuning models to achieve high-performance results
Deep understanding of data curation, training dynamics, hyperparameter optimization, and evaluation methodology
Strong knowledge of PPO, RLHF, reward modeling, policy optimization, and reinforcement learning systems
Experience working with modern large language model agents and agent workflows
Ability to integrate reinforcement learning techniques with LLM-based systems
Production mindset with experience deploying models into real-world applications
Ability to make practical tradeoffs between quality, latency, and operational cost
Strong experimentation skills with rapid iteration cycles
Excellent written and verbal communication skills
Ability to explain complex technical findings clearly to non-specialists
Experience collaborating in fast-moving engineering and research environments

Preferred Backgrounds

Experience as a reinforcement learning engineer at AI labs or applied machine learning teams
Background in RLHF or reward modeling for language model systems
Experience building machine learning training infrastructure at startups
Experience combining reinforcement learning with language model systems
Background in applied research environments with strong production focus
Experience working on intelligent agent systems

What This Role Is Not Looking For

Purely theoretical researchers without production experience
Candidates who rely on dedicated platform teams for infrastructure setup
Professionals experienced only in reinforcement learning or only in large language models without overlap between both domains
Slow experimentation cycles with lengthy iteration timelines
Communication styles that rely heavily on technical jargon without practical clarity

Work Environment & Pace

This role operates in a fast-paced environment with a strong emphasis on rapid iteration, technical ownership, experimentation speed, and production impact.

Compensation: $180,000 - $290,000 per year

Benefits & Perks

Competitive compensation based on impact and contribution
Equity participation up to 0.15%
Generous paid time off including 15 mandatory PTO days
Flexible additional PTO approval process
12 weeks of fully paid parental leave
$100 monthly wellness stipend for health and wellness expenses
Up to $1,000 annually for learning and professional development
Company-sponsored team offsites
Three-month paid sabbatical after four years of employment

Benefits for US-Based Employees

Medical, dental, and vision coverage with 100% employee coverage and 50% dependent coverage
Employer-paid life insurance and disability coverage
Optional accident, critical illness, hospital indemnity, and voluntary life insurance plans
Telehealth access through Doctegrity
401(k) retirement plan
Pre-tax FSAs and commuter benefits
Pet insurance coverage

Additional Perks for San Francisco Employees

Office snacks, beverages, and team lunches
Collaborative startup office environment
Access to a loaner electric bike for city transportation
Show more Show less

Research Engineer - $86.54 - $139.42 per hour

Similar Jobs