J

Product Reliability Engineer

Accepting applications

Jobgether · United States

Full-Time Mid_senior AIPythonaiaterf
Posted
1d ago
Category
Test
Experience
Mid_senior
Country
United States
This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Product Reliability Engineer based in United States.

This role sits at the critical intersection of software engineering, customer reliability, and production operations for infrastructure software deployed in complex, real-world environments. You will ensure that production systems running in customer-owned Kubernetes environments remain stable, observable, and continuously improvable. The work goes beyond incident response, focusing on eliminating entire categories of failures through better tooling, automation, and product design. You will partner closely with customers, engineers, and solution teams to investigate complex issues, drive root-cause analysis, and translate findings into long-term system improvements. This is a highly hands-on role where debugging, automation, and product thinking come together to define reliability as a core product capability. Your work will directly shape how enterprise customers experience stability, performance, and trust in the platform.

Accountabilities

Partner with customers and internal teams to investigate and resolve complex production issues across Kubernetes-based on-prem and hybrid deployments.
Lead deep root-cause analysis for escalations, reproduce issues, and collaborate with engineering teams to implement durable fixes.
Build and maintain reliability tooling such as diagnostics systems, health checks, support bundles, and environment validation utilities.
Own and improve test automation frameworks, focusing on CI stability, reducing flaky tests, and strengthening integration and end-to-end coverage.
Define and maintain performance baselines, regression testing frameworks, and reliability gates to prevent production regressions.
Improve installation, upgrade, and deployment reliability by identifying recurring failure patterns and building preventive solutions.
Develop production-grade internal tools and product enhancements using Python, Go, or Rust to strengthen observability and system resilience.
Establish a closed feedback loop from customer issues to engineering improvements in testing, observability, documentation, and defaults.

Requirements

4-7 years of experience in production engineering, SRE, platform engineering, or similar roles focused on reliability and distributed systems.
Strong software engineering fundamentals, including debugging, testing, system design, and production-grade coding practices.
Hands-on Kubernetes expertise, including troubleshooting workloads, networking, storage, RBAC, and multi-environment deployments.
Strong experience with observability tools and techniques, including logs, metrics, and tracing for distributed system debugging.
Proficiency in at least one programming language such as Python, Go, or Rust, with experience building internal tools or production systems.
Strong analytical and communication skills, with the ability to break down complex incidents into clear root causes and actionable recommendations.
Experience working in cross-functional environments with engineering, product, and customer-facing teams in fast-moving contexts.
Self-directed and comfortable working in remote-first environments with shifting priorities driven by customer needs and escalations.

Benefits

Competitive compensation package aligned with experience and seniority
Fully remote work environment across Canada and the United States
Opportunity to work on real-world production infrastructure used in complex enterprise environments
Strong technical ownership with high impact on product reliability and customer experience
Collaboration with experienced engineers in infrastructure, automation, and platform engineering
Learning and growth opportunities in Kubernetes, observability, and large-scale distributed systems
Inclusive and diverse team culture focused on collaboration and continuous improvement
Exposure to open-source-driven infrastructure innovation

How Jobgether Works

We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.

We appreciate your interest and wish you the best!

Why Apply Through Jobgether?

Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.

We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.

Show more Show less