R
Research Engineer, AI Agents
Accepting applicationsRecruitSeq · San Francisco, CA
Full-Time Mid_senior AIPythonaiate
Posted
3d ago
Category
Manufacturing
Experience
Mid_senior
Country
United States
Research Engineer, AI Agents
San Francisco, CA (On-SIte M-F)
Our client is a well-funded, early-stage AI infrastructure startup in the San Francisco Bay Area building core tooling for monitoring and improving production AI agents used in high-stakes workflows. The team focuses on analyzing agent behavior at scale so customers can detect failures, improve reliability, and ship continuously better agents into production.
About the Role
You will build AI systems that turn large-scale agent interaction data into concrete evaluation signals and self-improving feedback loops. This is a highly technical, end-to-end role: you will design retrieval and search mechanisms, develop custom evaluation harnesses, and work directly with messy real-world production data to understand and improve long-running agent behaviors. You will work closely with a small onsite team in the San Francisco area, owning projects from initial exploration through deployment in customer-facing environments.
Responsibilities
Design and implement retrieval and search systems so agent workflows can access the right context quickly and make correct decisions in production.
Build custom evaluation harnesses and tooling to inspect, manipulate, and understand complex agent trajectories beyond what off-the-shelf frameworks provide.
Develop sandboxed, always-on environments where evaluation agents can run autonomously to monitor, score, and improve production agents over time.
Aggregate, clean, and analyze large-scale agent interaction data to extract meaningful evaluation signals, define benchmarks, and drive behavior improvements.
Build and maintain internal infrastructure, pipelines, and tools for rapid experimentation, post-training, and optimization on top of production agent data.
Collaborate with the core engineering and product team to translate customer reliability issues into concrete monitoring, alerting, and agent-optimization solutions.
Ship changes end-to-end in an onsite, fast-paced environment, including hands-on debugging of long-running agent behaviors in live systems.
Qualifications
1–4 years of industry experience in applied AI or generative AI, with hands-on work building, evaluating, or operating agents in production environments.
Strong software engineering skills in Python and modern ML/AI tooling, with the ability to own systems end-to-end from data to deployment.
Experience working with large, noisy real-world datasets, including building pipelines for data curation, feature extraction, and evaluation.
Familiarity with LLM-based systems, retrieval-augmented generation, and common approaches to agent evaluation (e.g., LLM-as-a-judge, rule-based metrics, or tracing-based analysis).
Comfort operating in ambiguous, zero-to-one environments with high ownership and a strong bias toward shipping.
Ability to work on-site in the San Francisco Bay Area 5–6 days per week.
Preferred Skills
Prior experience at a top-tier tech company or high-growth startup working on agents, LLM infrastructure, or evaluation/observability tooling.
Background in building retrieval systems, search infrastructure, or evaluation pipelines for LLM or agent-based products.
Exposure to post-training and optimization techniques such as SFT, RL, or preference-based methods applied to agent behavior.
Experience analyzing long-running, tool-using, or multi-step reasoning agents in production settings.
Prior work in small, high-intensity teams with a strong in-person culture in the San Francisco ecosystem.
Show more Show less
San Francisco, CA (On-SIte M-F)
Our client is a well-funded, early-stage AI infrastructure startup in the San Francisco Bay Area building core tooling for monitoring and improving production AI agents used in high-stakes workflows. The team focuses on analyzing agent behavior at scale so customers can detect failures, improve reliability, and ship continuously better agents into production.
About the Role
You will build AI systems that turn large-scale agent interaction data into concrete evaluation signals and self-improving feedback loops. This is a highly technical, end-to-end role: you will design retrieval and search mechanisms, develop custom evaluation harnesses, and work directly with messy real-world production data to understand and improve long-running agent behaviors. You will work closely with a small onsite team in the San Francisco area, owning projects from initial exploration through deployment in customer-facing environments.
Responsibilities
Design and implement retrieval and search systems so agent workflows can access the right context quickly and make correct decisions in production.
Build custom evaluation harnesses and tooling to inspect, manipulate, and understand complex agent trajectories beyond what off-the-shelf frameworks provide.
Develop sandboxed, always-on environments where evaluation agents can run autonomously to monitor, score, and improve production agents over time.
Aggregate, clean, and analyze large-scale agent interaction data to extract meaningful evaluation signals, define benchmarks, and drive behavior improvements.
Build and maintain internal infrastructure, pipelines, and tools for rapid experimentation, post-training, and optimization on top of production agent data.
Collaborate with the core engineering and product team to translate customer reliability issues into concrete monitoring, alerting, and agent-optimization solutions.
Ship changes end-to-end in an onsite, fast-paced environment, including hands-on debugging of long-running agent behaviors in live systems.
Qualifications
1–4 years of industry experience in applied AI or generative AI, with hands-on work building, evaluating, or operating agents in production environments.
Strong software engineering skills in Python and modern ML/AI tooling, with the ability to own systems end-to-end from data to deployment.
Experience working with large, noisy real-world datasets, including building pipelines for data curation, feature extraction, and evaluation.
Familiarity with LLM-based systems, retrieval-augmented generation, and common approaches to agent evaluation (e.g., LLM-as-a-judge, rule-based metrics, or tracing-based analysis).
Comfort operating in ambiguous, zero-to-one environments with high ownership and a strong bias toward shipping.
Ability to work on-site in the San Francisco Bay Area 5–6 days per week.
Preferred Skills
Prior experience at a top-tier tech company or high-growth startup working on agents, LLM infrastructure, or evaluation/observability tooling.
Background in building retrieval systems, search infrastructure, or evaluation pipelines for LLM or agent-based products.
Exposure to post-training and optimization techniques such as SFT, RL, or preference-based methods applied to agent behavior.
Experience analyzing long-running, tool-using, or multi-step reasoning agents in production settings.
Prior work in small, high-intensity teams with a strong in-person culture in the San Francisco ecosystem.
Show more Show less