Staff / Principal ML Training Systems Engineer

We are building next-generation intelligent systems capable of operating in complex, real-world environments. Our team develops the full stack — from high-performance hardware and distributed systems infrastructure to large-scale multimodal foundation models powering autonomous decision-making.

Backed by significant funding and operating at the intersection of AI, systems engineering, and large-scale compute infrastructure, we are investing heavily in research, infrastructure, and scalable training systems to push the frontier of embodied intelligence.
We are seeking a Staff / Principal ML Training Systems Engineer to lead training systems performance across large-scale multimodal AI workloads. This is a core systems engineering role focused on scalability, efficiency, and correctness at massive GPU scale. Your work will directly impact infrastructure utilization, training throughput, and research iteration speed.

What You’ll Do
Own Training Performance End-to-End
Diagnose and optimize performance for large-scale multimodal training workloads involving vision, video, language, sensor data, and sequential decision-making
Build systematic performance attribution tooling, including:
Step-time decomposition
Compute vs communication analysis
Input pipeline profiling
Scaling curve analysis across cluster sizes
Bottleneck identification and prioritization

Drive Efficiency Improvements Across the Stack
Improve distributed training efficiency through:
Communication/computation overlap
Gradient bucketization
Topology-aware workload placement
Parallelism optimization strategies

Improve compute efficiency through:
Kernel optimization
Operator fusion
Attention optimization
Runtime and framework overhead reduction

Improve memory efficiency through:
Activation checkpointing
Sequence packing and bucketing
Memory fragmentation reduction

Design and Evolve Training Systems
Define and optimize data, tensor, pipeline, sharded, and hybrid parallelism strategies
Improve execution efficiency through:
Communication scheduling and overlap
Graph capture and execution optimization
Runtime-level improvements
Extend and improve internal training frameworks where necessary

Make Performance Observable and Measurable
Establish source-of-truth performance metrics including:
Step-time breakdowns
Model FLOPs utilization (MFU)
Throughput and scaling efficiency
Build tooling to:
Detect bottlenecks quickly
Compare scaling behavior across model families and cluster configurations
Track performance regressions over time
Develop automated benchmarking and regression detection systems

Partner Closely With Research Teams
Collaborate directly with research scientists and ML engineers in a highly integrated environment
Translate novel model architectures and research ideas into scalable, production-ready implementations
Advise on training tradeoffs involving:
Long-horizon sequence modeling
Multimodal and variable-length data
Evaluation cadence and rollout efficiency

Improve Cluster-Level Efficiency
Work with infrastructure and reliability teams to optimize utilization across large distributed workloads
Analyze the impact of networking, collectives, and cluster topology on training efficiency
Improve topology-aware scheduling and large-scale scaling behavior

What We’re Looking For
Proven track record optimizing large-scale distributed ML training systems
Deep hands-on experience with modern ML frameworks (PyTorch required; JAX is a plus)
Strong understanding of:
Data, tensor, and pipeline parallelism
FSDP / ZeRO-style sharded training
Communication overlap strategies
Large-scale GPU cluster scaling behavior
Strong systems intuition across compute, communication, and memory bottlenecks
Exceptional debugging and performance analysis skills
High ownership mindset and comfort operating in fast-moving, highly technical environments

Preferred Experience
GPU kernel or compiler-level optimization experience (CUDA, Triton, graph capture, operator fusion)
Experience with multimodal or video training involving variable-length sequences and packing strategies
Experience building or extending distributed training frameworks and runtimes
Familiarity with cluster networking, topology-aware scheduling, and large-scale infrastructure effects

Why This Role Matters
Direct impact on research velocity — every efficiency improvement accelerates model development across the organization
Opportunity to shape the scalability and performance of next-generation multimodal training systems
High-leverage engineering work with compounding impact across all training workloads
Small, highly technical team with significant ownership and autonomy
About the Company

We are a research-driven AI company focused on building scalable intelligent systems capable of robust operation in dynamic environments. By combining advances in machine learning, distributed systems, and infrastructure engineering, we aim to push the frontier of large-scale AI systems.

We are committed to building an inclusive and diverse workplace and encourage applicants from all backgrounds to apply.
Show more Show less

Research Scientist: Post-Training

Similar Jobs