AL
Member of Technical Staff - Architect
Accepting applicationsArchitect Labs · Palo Alto, CA
Full-Time Mid_senior AIASICC++FPGAPython
Posted
1d ago
Category
Design
Experience
Mid_senior
Country
United States
About Architect
Architect is a frontier AI lab for chip design. We build AI models and tools for on-demand custom ASICs at scale. Our goal is to co-design custom ASICs alongside evolving ML workloads, and enable a new era of domain-specific chips that unlock capabilities impossible with current hardware paradigms. Born out of Stanford Research, our team blends AI with Silicon with a founding team from Anthropic, Google DeepMind, Meta SuperIntelligence, xAI, Apple and Intel.
What You'll Do
As a Founding Member of the Technical Staff on the Architecture team at Architect, you'll own the microarchitecture definition of our specialized, high-performance XPUs (such as AI/ML or NPU cores), discovered and explored using our in-house AI system. You'll drive the HW-SW co-design in collaboration with the compiler and systems teams, and carry the architecture from spec through RTL handoff to silicon bring-up.
Define and own the microarchitecture of the AI cores targeting best-in-class performance-per-watt.
Review and monitor AI-explored microarchitectural specifications into production SystemVerilog RTL for the AI core blocks: PE/MAC arrays, Vector engines, scratchpad memory controllers, SRAM banks and arbiters, DMA engines, and datapath logic.
Build and maintain cycle-accurate architectural models (C++/SystemC) to evaluate PPA trade-offs across compute density, memory bandwidth, numeric precision, and power before RTL commitment.
Drive HW/SW co-design with Compiler and Systems teams by defining ISA-level abstractions, instruction scheduling constraints, and data movement patterns that map efficiently to target ML workloads (convolution, attention, elementwise).
Deliver complete microarchitectural specifications with interface definitions, modular cut-lines, and architectural validation models for handoff to RTL and DV teams.
Work with DV and DMA teams to define interface contracts (AXI/AXI-Stream), architectural checkpoints, and validation-ready reference models.
What We'd Like to See
Qualifications & Skills:
Degree: Bachelor's, Master's, or PhD in Electrical Engineering, Computer Engineering, or a closely related field.
Tapeout Experience: 5+ years (10+ preferred) in advanced-node tapeouts at top chip companies or fast-moving silicon startups.
Domain Background: Deep expertise in NPU and ML accelerator architecture, ideally with experience on Apple Neural Engine, Qualcomm Hexagon NPU, Google TPU, AMD XDNA, Samsung NPU, MediaTek APU, or accelerators at Groq, d-Matrix, Cerebras, MatX, or similar.
SystemVerilog: Clear, synthesizable, lint-clean RTL with strong design habits such as parameterization, modularity, reuse and configurability.
Compute Datapaths: Hands-on experience designing systolic arrays, MAC units, vector/SIMD engines, or VLIW execution pipelines.
Memory Hierarchy: Experience with on-chip SRAM banking, scratchpad management, data reuse strategies, and bandwidth balancing against compute throughput.
Modeling: Strong architectural modeling skills in C++, SystemC, or equivalent. Proficiency in SystemVerilog and Python.
End-to-End Ownership: Proven track record taking an architecture from specification through RTL handoff to silicon bring-up and validation.
Bonus:
ISA design or programmable accelerator architecture experience.
Understanding of model quantization, mixed-precision inference, and numeric format trade-offs (INT, FP/BF, MX etc.).
Advanced power optimization techniques for edge: clock gating, power gating, voltage scaling.
NoC design, on-chip interconnect fabrics, or AMBA protocols (AXI, AHB, APB).
Hands-on FPGA prototyping for architecture validation (ideally Xilinx).
Domain-specific expertise: Track record on research and development on SOTA XPU architectures, such as but not limited to CIMs / PIMs, mixed-precision and reconfigurable MAC arithmetics, reduced data movement and lossless compression, targeted PPA optimizations for the highest perf per watt.
What We Offer
Competitive salary and meaningful equity stake
Fast-paced startup with autonomy and visible impact
Cutting-edge challenges at the intersection of AI and silicon design
Show more Show less
Architect is a frontier AI lab for chip design. We build AI models and tools for on-demand custom ASICs at scale. Our goal is to co-design custom ASICs alongside evolving ML workloads, and enable a new era of domain-specific chips that unlock capabilities impossible with current hardware paradigms. Born out of Stanford Research, our team blends AI with Silicon with a founding team from Anthropic, Google DeepMind, Meta SuperIntelligence, xAI, Apple and Intel.
What You'll Do
As a Founding Member of the Technical Staff on the Architecture team at Architect, you'll own the microarchitecture definition of our specialized, high-performance XPUs (such as AI/ML or NPU cores), discovered and explored using our in-house AI system. You'll drive the HW-SW co-design in collaboration with the compiler and systems teams, and carry the architecture from spec through RTL handoff to silicon bring-up.
Define and own the microarchitecture of the AI cores targeting best-in-class performance-per-watt.
Review and monitor AI-explored microarchitectural specifications into production SystemVerilog RTL for the AI core blocks: PE/MAC arrays, Vector engines, scratchpad memory controllers, SRAM banks and arbiters, DMA engines, and datapath logic.
Build and maintain cycle-accurate architectural models (C++/SystemC) to evaluate PPA trade-offs across compute density, memory bandwidth, numeric precision, and power before RTL commitment.
Drive HW/SW co-design with Compiler and Systems teams by defining ISA-level abstractions, instruction scheduling constraints, and data movement patterns that map efficiently to target ML workloads (convolution, attention, elementwise).
Deliver complete microarchitectural specifications with interface definitions, modular cut-lines, and architectural validation models for handoff to RTL and DV teams.
Work with DV and DMA teams to define interface contracts (AXI/AXI-Stream), architectural checkpoints, and validation-ready reference models.
What We'd Like to See
Qualifications & Skills:
Degree: Bachelor's, Master's, or PhD in Electrical Engineering, Computer Engineering, or a closely related field.
Tapeout Experience: 5+ years (10+ preferred) in advanced-node tapeouts at top chip companies or fast-moving silicon startups.
Domain Background: Deep expertise in NPU and ML accelerator architecture, ideally with experience on Apple Neural Engine, Qualcomm Hexagon NPU, Google TPU, AMD XDNA, Samsung NPU, MediaTek APU, or accelerators at Groq, d-Matrix, Cerebras, MatX, or similar.
SystemVerilog: Clear, synthesizable, lint-clean RTL with strong design habits such as parameterization, modularity, reuse and configurability.
Compute Datapaths: Hands-on experience designing systolic arrays, MAC units, vector/SIMD engines, or VLIW execution pipelines.
Memory Hierarchy: Experience with on-chip SRAM banking, scratchpad management, data reuse strategies, and bandwidth balancing against compute throughput.
Modeling: Strong architectural modeling skills in C++, SystemC, or equivalent. Proficiency in SystemVerilog and Python.
End-to-End Ownership: Proven track record taking an architecture from specification through RTL handoff to silicon bring-up and validation.
Bonus:
ISA design or programmable accelerator architecture experience.
Understanding of model quantization, mixed-precision inference, and numeric format trade-offs (INT, FP/BF, MX etc.).
Advanced power optimization techniques for edge: clock gating, power gating, voltage scaling.
NoC design, on-chip interconnect fabrics, or AMBA protocols (AXI, AHB, APB).
Hands-on FPGA prototyping for architecture validation (ideally Xilinx).
Domain-specific expertise: Track record on research and development on SOTA XPU architectures, such as but not limited to CIMs / PIMs, mixed-precision and reconfigurable MAC arithmetics, reduced data movement and lossless compression, targeted PPA optimizations for the highest perf per watt.
What We Offer
Competitive salary and meaningful equity stake
Fast-paced startup with autonomy and visible impact
Cutting-edge challenges at the intersection of AI and silicon design
Show more Show less