Systems Engineer — Distributed Training & GPU Infrastructure

We are building advanced large-scale AI systems that push beyond conventional training and inference paradigms. Our work focuses on long-horizon model training, novel architectures, and distributed systems that must adapt quickly as model designs evolve.
We are seeking a Systems Engineer to architect and operate the infrastructure powering large-scale training and inference workloads across GPU clusters. This role focuses on reliability, performance, and flexibility in environments where standard assumptions about model training no longer hold.

You will work on the core systems that enable efficient, fault-tolerant execution of long-running distributed jobs at scale.

What You’ll Do
Design Distributed Training Systems
Architect and implement distributed training strategies for novel model architectures
Optimize parallelism approaches across GPU clusters (data, tensor, pipeline, and hybrid methods)
Adapt system design to evolving model requirements and training semantics

Improve Reliability and Fault Tolerance
Build systems for health monitoring, failure detection, and automated recovery
Design and implement fast, asynchronous checkpointing mechanisms for long-running training jobs
Ensure high availability and resilience for large-scale distributed workloads

Optimize GPU Cluster Performance
Improve efficiency of large-scale GPU training and inference systems
Profile and resolve system bottlenecks across compute, communication, and storage layers
Work closely with GPU profiling tools to identify and fix performance issues

Build and Maintain Core Infrastructure
Work with cluster orchestration systems such as Slurm and Kubernetes
Develop systems-level tooling in C++ and Python for distributed training environments
Integrate and optimize PyTorch Distributed workloads, including FSDP-based training setups

What We’re Looking For
3+ years of experience in HPC, cloud infrastructure, or distributed machine learning systems
Deep expertise in PyTorch Distributed, including FSDP and collective communication primitives
Strong systems programming skills in C++ and Python
Experience with cluster orchestration systems such as Slurm or Kubernetes
Familiarity with GPU profiling and performance analysis tools such as:
NVIDIA Nsight Systems
PyTorch Profiler

Preferred Experience
Experience training large-scale language models (~10B+ parameters)
Exposure to experimental model architectures such as:
Encoder-decoder models
Chunked attention mechanisms
Experience with fp16/fp8 training, mixed precision, or quantization techniques
Contributions to open-source distributed training frameworks (e.g., PyTorch, Megatron-LM)
Experience working in environments with rapidly evolving model architectures

Why This Role Matters
Build the infrastructure layer for next-generation AI systems with non-standard training dynamics
Enable reliable and efficient execution of long-horizon distributed training workloads
Work at the intersection of systems engineering, ML infrastructure, and large-scale AI research
Direct impact on the performance and scalability of frontier model development

About the Company
We are a research-driven AI organization focused on building scalable, adaptive machine learning systems. By combining advances in distributed systems, GPU infrastructure, and model architecture research, we aim to push the boundaries of what large-scale AI systems can achieve.

We are committed to building an inclusive and diverse workplace and encourage applicants from all backgrounds to apply.
Show more Show less

Systems Engineer, AI Infrastructure

Similar Jobs