AA
CPU & Microarchitecture Performance Engineer
Accepting applicationsAramas AI · San Francisco Bay Area
Full-Time Mid ARMC++PCIe
Posted
1d ago
Category
Design
Experience
Mid
Country
United States
The mission
Every claim we make about latency, throughput, and efficiency has to be earned — measured, modeled, proven. We do not ship projections. This role is the instrument that generates the numbers that change what we build before it is too late to change it.
What you own
Low-level cycle modeling and performance analysis of our accelerator's execution engine. Branch behavior, cache pressure, memory access choreography, queue depth — translated into architectural feedback that drives tape-out decisions.
What you'll do
Build cycle-accurate or statistical performance models of agent workloads running on our hardware
Profile on real silicon using hardware performance counters and pre-silicon emulation environments
Identify bottlenecks across the data path and produce latency budgets that drive upstream architecture decisions
Own the performance regression suite as the team and hardware scale
Produce analysis rigorous enough to change a tape-out decision
What we're looking for
Deep fluency in x86 or ARM microarchitecture — knows what a pipeline stall costs in wall-clock terms, not just in theory
Has written or used architectural simulators (gem5, Sniper, ZSim, or equivalent) to model real systems
Reads assembly to diagnose problems, not just to write code
Background in CPU performance engineering at Intel, AMD, Qualcomm, Apple, Ampere, or equivalent
Bonus: experience with PCIe device driver performance or accelerator bring-up
Signal keywords
PMU countersgem5 / Sniper / ZSimMicroarchitectureCache modelingPipeline analysisC / C++
Show more Show less
Every claim we make about latency, throughput, and efficiency has to be earned — measured, modeled, proven. We do not ship projections. This role is the instrument that generates the numbers that change what we build before it is too late to change it.
What you own
Low-level cycle modeling and performance analysis of our accelerator's execution engine. Branch behavior, cache pressure, memory access choreography, queue depth — translated into architectural feedback that drives tape-out decisions.
What you'll do
Build cycle-accurate or statistical performance models of agent workloads running on our hardware
Profile on real silicon using hardware performance counters and pre-silicon emulation environments
Identify bottlenecks across the data path and produce latency budgets that drive upstream architecture decisions
Own the performance regression suite as the team and hardware scale
Produce analysis rigorous enough to change a tape-out decision
What we're looking for
Deep fluency in x86 or ARM microarchitecture — knows what a pipeline stall costs in wall-clock terms, not just in theory
Has written or used architectural simulators (gem5, Sniper, ZSim, or equivalent) to model real systems
Reads assembly to diagnose problems, not just to write code
Background in CPU performance engineering at Intel, AMD, Qualcomm, Apple, Ampere, or equivalent
Bonus: experience with PCIe device driver performance or accelerator bring-up
Signal keywords
PMU countersgem5 / Sniper / ZSimMicroarchitectureCache modelingPipeline analysisC / C++
Show more Show less
Similar Jobs
IG
Application Specific Integrated Circuit Design Engineer
Insight Global · St Paul, MN
TI
Application Specific Integrated Circuit Design Engineer
Trilyon, Inc. · San Jose, CA
HI
FPGA Firmware Engineer
Haigh-Farr, Inc. · Bedford, NH
AW
Physical Design Engineer - Static Timing Analysis, Annapurna Labs, Cloud Scale Machine Learning
Amazon Web Services (AWS) · Cupertino, CA