AP
Validation Engineer
Accepting applicationsArena Physica · New York City Metropolitan Area
Full-Time Entry AIPCIePythonaiate
Posted
14 May
Category
Test
Experience
Entry
Country
N/A
Who we are
Arena Physica is on a mission to accelerate hardware innovation that powers human progress. Our name is inspired by Theodore Roosevelt's 'Citizenship in a Republic' speech. To us, entering the Arena means committing fully and accepting the risk of failure in pursuit of an audacious, worthy cause. We believe the future belongs to those brave enough to build it.
Our team of 50 combines AI engineering and applied physics expertise with deep experience in enterprise deployments. We're headquartered in NYC with presences in San Francisco and Los Angeles, backed by ~$90M from Initialized, Founders Fund, Goldcrest Capital, Fifth Down Capital, and Shield Capital.
If you're ready to do the most important work of your career, join us in the Arena.
What we do
At Arena Physica, we're building electromagnetic superintelligence. Our AI platform Atlas operationalizes physics-grounded intelligence to verify, debug, and optimize hardware across its lifecycle. Atlas is already trusted globally by the world's most advanced hardware companies, including AMD, Anduril, and Bausch & Lomb, for applications across R&D, integration testing, production assembly, and field repair.
About the role
As a Validation Engineer, you will be the domain expert who makes Atlas indispensable for our customers' datacenter and cluster validation workflows. You'll work at the intersection of deep technical expertise in system validation and cluster testing, customer engagement, and product development — using Atlas to solve real problems for hardware validation teams at leading companies while translating those workflows and insights back to our engineering and product teams.
Most validation engineers work inside discrete platforms, executing program by program. Here, your expertise will be the training signal that compounds Atlas’s intelligence for every customer. You'll own outcomes across the technical, product, and customer dimensions, leveraging your deep domain knowledge to scale beyond your direct work.
How you will contribute
Be the validation & performance expert - Execute datacenter validation and cluster performance testing across GPU/CPU/memory/BIOS/BMC/networking/storage subsystems; benchmark, profile, and optimize system and cluster performance; debug complex hardware/firmware/software interactions and drive root-cause analysis.
Deploy Atlas with customers - Embed at customer sites to validate datacenter hardware using Atlas as your primary tool, augmenting with your own expertise where needed. Build credibility through technical depth and results.
Codify and scale — Your value here isn’t just what you fix in the field — it’s what you teach Atlas. Establish validation methodologies for Atlas across common subsystems and testing phases (EVT, DVT, PVT). Alongside these, translate customer workflows and pain points into product requirements and work closely with our engineering team to encode that expertise into Atlas. Every deployment should compound value for Atlas more broadly.
You have
Elite datacenter validation expertise - 4+ years with AI/ML datacenter infrastructure, GPU cluster validation, or large-scale hardware validation at leading hardware companies or cloud providers; you're the person that hardware teams call to debug complex system issues.
Full-stack hardware debugging mastery - Deep understanding of GPU/CPU architecture, memory subsystems, BIOS/UEFI/BMC firmware, high-speed interconnects (PCIe/CXL/InfiniBand/RoCE), NVMe storage, and power/thermal management; experience validating systems from deployment through production at node and cluster scale; proven track record debugging issues across hardware, firmware, drivers, and software in distributed ML infrastructure.
Performance optimization at scale - Strong experience benchmarking and tuning GPU clusters at multiple scales (cluster/rack/node); expertise with profiling tools, GPU utilization patterns, memory bandwidth bottlenecks, interconnect performance, and distributed training efficiency.
Customer-facing technical leadership - You earn trust through technical credibility, understand workflows and pain points, communicate complex concepts clearly, and build strong relationships.
Automation & software engineering skills - Proficiency in Python, Bash, or similar for building validation frameworks and automating tests at scale; comfortable with APIs, CI/CD environments, and collaborating with software engineers to productize workflows.
Platform expertise - Experience with AMD and / or NVIDIA HW and Software stacks - EPYC CPUs, Instinct GPUs, ROCm software stack, or AMD networking technologies, and/or NVIDIA Grace CPUs, H100/B200/GB200 GPUs, CUDA/cuDNN/NCCL/TensorRT software stack and InfiniBand/NVLink networking technologies.
Travel domestically and internationally (30-40% of your time)
Work in person at Arena Physica's NYC HQ when not traveling
Benefits & Perks Include:
100% of the monthly premiums covered with Aetna medical vision, and dental insurance for you and your dependents
401(k) Retirement Plan
Unlimited PTO
Lunch every day from local restaurants via Sharebite
Relocation support provided
Show more Show less
Arena Physica is on a mission to accelerate hardware innovation that powers human progress. Our name is inspired by Theodore Roosevelt's 'Citizenship in a Republic' speech. To us, entering the Arena means committing fully and accepting the risk of failure in pursuit of an audacious, worthy cause. We believe the future belongs to those brave enough to build it.
Our team of 50 combines AI engineering and applied physics expertise with deep experience in enterprise deployments. We're headquartered in NYC with presences in San Francisco and Los Angeles, backed by ~$90M from Initialized, Founders Fund, Goldcrest Capital, Fifth Down Capital, and Shield Capital.
If you're ready to do the most important work of your career, join us in the Arena.
What we do
At Arena Physica, we're building electromagnetic superintelligence. Our AI platform Atlas operationalizes physics-grounded intelligence to verify, debug, and optimize hardware across its lifecycle. Atlas is already trusted globally by the world's most advanced hardware companies, including AMD, Anduril, and Bausch & Lomb, for applications across R&D, integration testing, production assembly, and field repair.
About the role
As a Validation Engineer, you will be the domain expert who makes Atlas indispensable for our customers' datacenter and cluster validation workflows. You'll work at the intersection of deep technical expertise in system validation and cluster testing, customer engagement, and product development — using Atlas to solve real problems for hardware validation teams at leading companies while translating those workflows and insights back to our engineering and product teams.
Most validation engineers work inside discrete platforms, executing program by program. Here, your expertise will be the training signal that compounds Atlas’s intelligence for every customer. You'll own outcomes across the technical, product, and customer dimensions, leveraging your deep domain knowledge to scale beyond your direct work.
How you will contribute
Be the validation & performance expert - Execute datacenter validation and cluster performance testing across GPU/CPU/memory/BIOS/BMC/networking/storage subsystems; benchmark, profile, and optimize system and cluster performance; debug complex hardware/firmware/software interactions and drive root-cause analysis.
Deploy Atlas with customers - Embed at customer sites to validate datacenter hardware using Atlas as your primary tool, augmenting with your own expertise where needed. Build credibility through technical depth and results.
Codify and scale — Your value here isn’t just what you fix in the field — it’s what you teach Atlas. Establish validation methodologies for Atlas across common subsystems and testing phases (EVT, DVT, PVT). Alongside these, translate customer workflows and pain points into product requirements and work closely with our engineering team to encode that expertise into Atlas. Every deployment should compound value for Atlas more broadly.
You have
Elite datacenter validation expertise - 4+ years with AI/ML datacenter infrastructure, GPU cluster validation, or large-scale hardware validation at leading hardware companies or cloud providers; you're the person that hardware teams call to debug complex system issues.
Full-stack hardware debugging mastery - Deep understanding of GPU/CPU architecture, memory subsystems, BIOS/UEFI/BMC firmware, high-speed interconnects (PCIe/CXL/InfiniBand/RoCE), NVMe storage, and power/thermal management; experience validating systems from deployment through production at node and cluster scale; proven track record debugging issues across hardware, firmware, drivers, and software in distributed ML infrastructure.
Performance optimization at scale - Strong experience benchmarking and tuning GPU clusters at multiple scales (cluster/rack/node); expertise with profiling tools, GPU utilization patterns, memory bandwidth bottlenecks, interconnect performance, and distributed training efficiency.
Customer-facing technical leadership - You earn trust through technical credibility, understand workflows and pain points, communicate complex concepts clearly, and build strong relationships.
Automation & software engineering skills - Proficiency in Python, Bash, or similar for building validation frameworks and automating tests at scale; comfortable with APIs, CI/CD environments, and collaborating with software engineers to productize workflows.
Platform expertise - Experience with AMD and / or NVIDIA HW and Software stacks - EPYC CPUs, Instinct GPUs, ROCm software stack, or AMD networking technologies, and/or NVIDIA Grace CPUs, H100/B200/GB200 GPUs, CUDA/cuDNN/NCCL/TensorRT software stack and InfiniBand/NVLink networking technologies.
Travel domestically and internationally (30-40% of your time)
Work in person at Arena Physica's NYC HQ when not traveling
Benefits & Perks Include:
100% of the monthly premiums covered with Aetna medical vision, and dental insurance for you and your dependents
401(k) Retirement Plan
Unlimited PTO
Lunch every day from local restaurants via Sharebite
Relocation support provided
Show more Show less
Similar Jobs
M
New College Grad - DRAM Product Reliability Characterization Engineer
Micron · Boise, United States, North America
K
Business Development Manager - Strategic Business Unit
KLA · Milpitas, United States, North America
I
Graduate Talent (GenAI Software Solutions Engineer)
Intel · Penang, Malaysia, Asia
AM
Material Handler III
Applied Materials · Austin, United States, North America