ML Systems Engineer — Training & Inference Optimization (MBMB)

We are building large-scale embodied intelligence systems designed to operate in complex real-world environments. Our work spans robot foundation models, high-performance training infrastructure, and on-device inference systems that run directly on robotic hardware.
We are seeking ML Systems Engineers to optimize both training and on-robot inference stacks. This role is focused on pushing performance boundaries across hardware, software, and model design — where improvements are still step-function rather than incremental.
Internally, this team is known as MBMB (More Big More Better).

What You’ll Do
Push Training and Inference Performance to the Limit
Optimize both large-scale training systems and on-robot inference stacks
Deliver meaningful, step-function improvements in throughput, latency, and efficiency
Improve end-to-end system performance across distributed training and deployment environments

Make GPUs Perform at Maximum Efficiency
Identify and remove bottlenecks across the full compute stack
Optimize GPU utilization across training and inference workloads
Improve performance of transformer and diffusion-based architectures under real-world constraints

Engineer Across the Full Stack
Implement ML, hardware-aware, and software-level optimizations that materially improve system performance
Work across:
CUDA kernels and low-level GPU execution
ML model architecture and compute efficiency
CPU bottlenecks and data pipelines
Network and distributed systems performance (NVLink, interconnects, and cluster communication)
Python, NumPy, and PyTorch-level inefficiencies

Drive System-Level Improvements
Evaluate and implement changes that lead to measurable gains in training and inference efficiency
Collaborate with ML researchers and systems engineers to identify high-leverage optimization opportunities
Continuously profile, benchmark, and improve system performance across evolving workloads

What We’re Looking For
Strong experience with performance optimization in ML systems
Up-to-date knowledge of modern training and inference techniques for transformer and diffusion models
Ability to reason across the full stack, including:
GPU and CUDA-level optimization
Model architecture efficiency
CPU, memory, and I/O bottlenecks
Distributed networking and communication overhead
Framework-level performance (PyTorch, NumPy, Python)
Strong systems intuition and ability to identify bottlenecks quickly
Comfort operating in fast-moving environments where large performance gains are still available

Preferred Experience
Experience optimizing large-scale training or inference systems
Deep familiarity with GPU programming and kernel optimization
Experience working with distributed ML systems at scale
Exposure to model architecture-level efficiency improvements
Background spanning both systems engineering and machine learning

Why This Role Matters
Direct impact on both training speed and real-time robot performance
Work on problems where improvements are still large and measurable
Shape the efficiency and scalability of next-generation embodied intelligence systems
Operate across the full stack — from hardware execution to model design

About the Company
We are a research-driven AI and robotics company focused on building scalable embodied intelligence systems. By combining advances in machine learning, systems engineering, and robotics, we aim to push the frontier of efficient, real-world AI.

We are committed to building an inclusive and diverse workplace and encourage applicants from all backgrounds to apply.
Show more Show less

Software Engineer: ML Optimization

Similar Jobs