RESUME AND JOB

Software Engineer, Reliability Careers at OpenAI - San Francisco, California | Apply Now!

OpenAI

Software Engineer, Reliability Careers at OpenAI - San Francisco, California | Apply Now!

OpenAI

full-timePosted: Feb 10, 2026

Job Description

Software Engineer, Reliability at OpenAI - San Francisco, CA

Role Overview

Join OpenAI's mission to ensure safe AGI benefits all of humanity as a Software Engineer, Reliability in San Francisco. This senior-level role sits at the heart of our Applied Engineering team, bridging research breakthroughs with production reality. You'll architect the resilient infrastructure powering ChatGPT, DALL-E, and future frontier models serving millions worldwide.

In this high-impact position, you'll champion reliability amidst unprecedented scaling challenges. OpenAI's systems must balance researcher velocity with enterprise-grade stability, handling explosive traffic growth while maintaining sub-second latencies for real-time AI interactions. Your work directly enables safe deployment of transformative technology to consumers and businesses globally.

The Reliability Engineering team operates as force multipliers, empowering 100+ engineers with self-service tooling that accelerates safe iteration. You'll tackle complex distributed systems problems unique to AI infrastructure - GPU orchestration at exabyte scale, real-time inference latency optimization, and chaos resilience for mission-critical services. This role demands deep systems expertise combined with collaborative humility to succeed in our fast-paced, mission-driven culture.

Success metrics include reducing MTTR by 50%, achieving 99.99% uptime across core services, and enabling 10x capacity growth without proportional headcount increases. You'll partner with world-class researchers, product leaders, and fellow engineers to deliver reliable AI at global scale.

Key Responsibilities

As a Software Engineer, Reliability at OpenAI, your impact spans the full reliability spectrum:

Scalability Architecture: Design infrastructure handling 100x traffic growth, from millions to billions of daily requests across global edge locations.
Testing Platform Ownership: Build load, chaos, and synthetic testing frameworks used by every development team, ensuring proactive reliability before production.
Automation Leadership: Create self-service tools eliminating toil, from auto-scaling GPU clusters to one-click rollback capabilities.
Resource Lifecycle Management: Develop platforms optimizing CPU/GPU/storage utilization across thousands of nodes, driving multi-million dollar efficiency gains.
Fault Tolerance Engineering: Implement resilient patterns surviving correlated failures, network partitions, and hardware faults at planetary scale.
SLO/SLI Framework: Establish measurable reliability objectives guiding engineering decisions across research and product teams.
Cross-Functional Partnership: Collaborate with researchers deploying bleeding-edge models and PMs launching consumer features, ensuring reliability from day zero.
Incident Leadership: Lead on-call response for critical outages, driving post-mortems that prevent recurrence through systemic improvements.
Performance Engineering: Systematically identify and eliminate bottlenecks across the stack, from kernel scheduling to API response times.
Platform Enablement: Build internal developer platforms accelerating safe velocity for 100+ engineering teams.
Capacity Planning: Forecast infrastructure needs supporting exponential AI capability growth over 12-24 month horizons.
Safety Integration: Embed reliability practices ensuring safe AI deployment aligns with OpenAI's core safety commitments.
Tooling Innovation: Pioneer next-generation observability solving AI-specific monitoring challenges like model drift detection.
Mentorship: Guide junior engineers while learning from OpenAI's world-class reliability leadership.

Qualifications

Technical Excellence (Required):

5+ years production SRE/ reliability engineering in hyper-growth environments
Deep cloud expertise (AWS/GCP) managing 1000+ node clusters
Advanced container orchestration (Kubernetes) at massive scale
Strong systems programming (Python/Go/C++) building mission-critical services
Proven IaC mastery (Terraform/CloudFormation) across multi-cloud

AI/ML Infrastructure Bonus:

GPU cluster management experience (NVIDIA/CUDA)
Experience with Ray, Kubernetes operators for ML workloads
Familiarity with vector databases, inference serving frameworks

Cultural Fit (Critical):

End-to-end ownership mindset - runs services soup-to-nuts
Humble collaborator thriving in flat, high-trust teams
Relentless curiosity mastering new domains rapidly
Customer-obsessed serving internal teams as "external" customers

Salary & Benefits

Competitive Compensation: Total cash compensation $220K-$380K+ (base + bonus), substantial equity, comprehensive benefits package.

Exceptional Benefits:

Top-tier medical/dental/vision with $0 premiums
401k with 4%+ match, immediate vesting
Unlimited PTO - take what you need
$3,000+ annual learning stipend
16+ weeks parental leave
Fertility benefits, mental health support
Daily catered meals, gym reimbursement
Latest hardware, relocation support

Why Join OpenAI?

OpenAI isn't just another tech company - we're building safe artificial general intelligence to benefit humanity. Your reliability engineering directly enables this historic mission.

Unmatched Impact: Every system you build reaches hundreds of millions weekly. ChatGPT alone serves 100M+ users/month.

Technical Challenge: Solve problems no one else faces - exabyte-scale AI infrastructure, sub-100ms global inference, resilient chaos at planetary scale.

World-Class Team: Collaborate with PhDs from Stanford/MIT, ex-Google/DeepMind researchers, builders from Stripe/Uber at their peak.

Mission-Driven Culture: Safety > Growth. We move deliberately, prioritizing long-term correctness over short-term metrics.

Explosive Growth: 10x'd team size in 18 months. Unlimited internal mobility across research, infra, product.

How to Apply

Review Fit: Ensure 5+ years reliability experience at scale
Prepare Materials: Resume highlighting production impact metrics
Submit Application: Click "Apply Now" - include GitHub/portfolio links
Interview Process: Recruiter screen → Technical deep-dive → System design → Team match
Timeline: 2-4 weeks end-to-end

OpenAI is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.

Locations

San Francisco, California, United States

Salary

Estimated Salary Rangehigh confidence

231,000 - 418,000 USD / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

Site Reliability Engineering (SRE)intermediate
Infrastructure as Code (IaC)intermediate
Cloud Infrastructure (AWS, GCP, Azure)intermediate
Containerization (Docker, Kubernetes)intermediate
Chaos Engineeringintermediate
Load Testingintermediate
Synthetic Monitoringintermediate
Service Level Objectives (SLOs)intermediate
Service Level Indicators (SLIs)intermediate
Fault-Tolerant Designintermediate
Automation Scripting (Python, Go)intermediate
GPU Resource Managementintermediate
Network Lifecycle Managementintermediate
On-Call Incident Responseintermediate
Performance Optimizationintermediate
Scalability Engineeringintermediate
CI/CD Pipelinesintermediate
Monitoring Tools (Prometheus, Grafana)intermediate
Distributed Systemsintermediate
Resource Optimizationintermediate

Required Qualifications

Bachelor's degree in Computer Science, Information Technology, or related field (or equivalent experience) (experience)
5+ years of proven experience as Software Engineer focused on reliability in fast-paced scaling environments (experience)
Strong proficiency in cloud infrastructure platforms like AWS, GCP, or Azure (experience)
Advanced programming skills in Python, Go, or similar languages (experience)
Hands-on experience with containerization technologies (Docker, Kubernetes) (experience)
Demonstrated track record building load testing, chaos engineering, and synthetic monitoring systems (experience)
Expertise in Infrastructure as Code tools (Terraform, Ansible) (experience)
Experience developing and maintaining SLOs/SLIs for production systems (experience)
Proven ability to design fault-tolerant and resilient distributed systems (experience)
Comfortable with on-call rotations and 24/7 incident response (experience)
Strong collaboration skills with cross-functional teams (researchers, PMs, designers) (experience)
Experience managing CPU/GPU/storage lifecycle in high-scale environments (experience)
Passion for performance optimization and bottleneck identification (experience)

Responsibilities

Design scalable infrastructure solutions to handle rapidly growing AI workloads
Build and maintain comprehensive load testing frameworks for development teams
Develop chaos engineering tools to test system resilience under failure conditions
Create synthetic monitoring systems to proactively detect service degradation
Implement automation tools to eliminate repetitive operational tasks
Manage lifecycle of CPU, GPU, storage, and network resources for optimal efficiency
Architect fault-tolerant design patterns minimizing service disruptions
Establish and maintain SLOs/SLIs across all production services
Collaborate with research teams to deploy experimental AI capabilities reliably
Participate in on-call rotation ensuring 24/7 system availability
Partner with product managers to balance rapid iteration with reliability guarantees
Drive performance improvements through systematic bottleneck analysis
Build self-service reliability platforms empowering engineering teams
Optimize resource allocation supporting dynamic AI model training demands

Benefits

general: Comprehensive medical, dental, and vision insurance coverage
general: 401(k) retirement plan with generous company matching
general: Unlimited PTO policy with encouraged recharge periods
general: Annual learning and development stipend ($3,000+)
general: Comprehensive parental leave (16+ weeks)
general: Fertility assistance and family planning benefits
general: Mental health support through dedicated counseling services
general: Gym membership reimbursement and wellness programs
general: Commuter benefits for San Francisco employees
general: Catered meals and fully stocked kitchens daily
general: Generous equity package with significant growth potential
general: Relocation assistance for qualifying candidates
general: Regular team offsites and company-wide retreats
general: Cutting-edge hardware including latest MacBook Pros
general: Volunteer time off and charitable donation matching

Target Your Resume for "Software Engineer, Reliability Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Get personalized recommendations to optimize your resume specifically for Software Engineer, Reliability Careers at OpenAI - San Francisco, California | Apply Now!. Takes only 15 seconds!

AI-powered keyword optimization

Skills matching & gap analysis

Experience alignment suggestions

Check Your ATS Score for "Software Engineer, Reliability Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check

Keyword optimization analysis

Skill matching & gap identification

Format & readability score

Tags & Categories

OpenAI reliability engineer jobssoftware engineer reliability San FranciscoSRE jobs OpenAIAI infrastructure engineer careerssenior SRE OpenAI San FranciscoChatGPT reliability engineeringGPU infrastructure engineer jobschaos engineering OpenAI careerssite reliability engineer AI companyOpenAI infrastructure jobs Californiaproduction SRE artificial intelligencescalable AI systems engineerOpenAI engineering careers SFfault tolerant systems engineerSLO SLI engineer OpenAIdistributed systems reliability jobsKubernetes SRE OpenAIcloud infrastructure AI careersperformance engineering OpenAIon-call SRE San Francisco jobsmachine learning infrastructure engineerOpenAI reliability platform engineerApplied AI

Answer 10 quick questions to check your fit for Software Engineer, Reliability Careers at OpenAI - San Francisco, California | Apply Now! @ OpenAI.

10 Questions

~2 Minutes

Instant Score

Related Books and Jobs

No related jobs found at the moment.

Privacy Terms & Conditions About Us Refund Policy Recruiter Login Sitemap

Software Engineer, Reliability Careers at OpenAI - San Francisco, California | Apply Now!

OpenAI

Software Engineer, Reliability Careers at OpenAI - San Francisco, California | Apply Now!

OpenAI

full-timePosted: Feb 10, 2026

Job Description

Software Engineer, Reliability at OpenAI - San Francisco, CA

Role Overview

Key Responsibilities

As a Software Engineer, Reliability at OpenAI, your impact spans the full reliability spectrum:

Scalability Architecture: Design infrastructure handling 100x traffic growth, from millions to billions of daily requests across global edge locations.
Testing Platform Ownership: Build load, chaos, and synthetic testing frameworks used by every development team, ensuring proactive reliability before production.
Automation Leadership: Create self-service tools eliminating toil, from auto-scaling GPU clusters to one-click rollback capabilities.
Resource Lifecycle Management: Develop platforms optimizing CPU/GPU/storage utilization across thousands of nodes, driving multi-million dollar efficiency gains.
Fault Tolerance Engineering: Implement resilient patterns surviving correlated failures, network partitions, and hardware faults at planetary scale.
SLO/SLI Framework: Establish measurable reliability objectives guiding engineering decisions across research and product teams.
Cross-Functional Partnership: Collaborate with researchers deploying bleeding-edge models and PMs launching consumer features, ensuring reliability from day zero.
Incident Leadership: Lead on-call response for critical outages, driving post-mortems that prevent recurrence through systemic improvements.
Performance Engineering: Systematically identify and eliminate bottlenecks across the stack, from kernel scheduling to API response times.
Platform Enablement: Build internal developer platforms accelerating safe velocity for 100+ engineering teams.
Capacity Planning: Forecast infrastructure needs supporting exponential AI capability growth over 12-24 month horizons.
Safety Integration: Embed reliability practices ensuring safe AI deployment aligns with OpenAI's core safety commitments.
Tooling Innovation: Pioneer next-generation observability solving AI-specific monitoring challenges like model drift detection.
Mentorship: Guide junior engineers while learning from OpenAI's world-class reliability leadership.

Qualifications

Technical Excellence (Required):

5+ years production SRE/ reliability engineering in hyper-growth environments
Deep cloud expertise (AWS/GCP) managing 1000+ node clusters
Advanced container orchestration (Kubernetes) at massive scale
Strong systems programming (Python/Go/C++) building mission-critical services
Proven IaC mastery (Terraform/CloudFormation) across multi-cloud

AI/ML Infrastructure Bonus:

GPU cluster management experience (NVIDIA/CUDA)
Experience with Ray, Kubernetes operators for ML workloads
Familiarity with vector databases, inference serving frameworks

Cultural Fit (Critical):

End-to-end ownership mindset - runs services soup-to-nuts
Humble collaborator thriving in flat, high-trust teams
Relentless curiosity mastering new domains rapidly
Customer-obsessed serving internal teams as "external" customers

Salary & Benefits

Competitive Compensation: Total cash compensation $220K-$380K+ (base + bonus), substantial equity, comprehensive benefits package.

Exceptional Benefits:

Top-tier medical/dental/vision with $0 premiums
401k with 4%+ match, immediate vesting
Unlimited PTO - take what you need
$3,000+ annual learning stipend
16+ weeks parental leave
Fertility benefits, mental health support
Daily catered meals, gym reimbursement
Latest hardware, relocation support

Why Join OpenAI?

OpenAI isn't just another tech company - we're building safe artificial general intelligence to benefit humanity. Your reliability engineering directly enables this historic mission.

Unmatched Impact: Every system you build reaches hundreds of millions weekly. ChatGPT alone serves 100M+ users/month.

Technical Challenge: Solve problems no one else faces - exabyte-scale AI infrastructure, sub-100ms global inference, resilient chaos at planetary scale.

World-Class Team: Collaborate with PhDs from Stanford/MIT, ex-Google/DeepMind researchers, builders from Stripe/Uber at their peak.

Mission-Driven Culture: Safety > Growth. We move deliberately, prioritizing long-term correctness over short-term metrics.

Explosive Growth: 10x'd team size in 18 months. Unlimited internal mobility across research, infra, product.

How to Apply

Review Fit: Ensure 5+ years reliability experience at scale
Prepare Materials: Resume highlighting production impact metrics
Submit Application: Click "Apply Now" - include GitHub/portfolio links
Interview Process: Recruiter screen → Technical deep-dive → System design → Team match
Timeline: 2-4 weeks end-to-end

OpenAI is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.

Locations

San Francisco, California, United States

Salary

Estimated Salary Rangehigh confidence

231,000 - 418,000 USD / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

Site Reliability Engineering (SRE)intermediate
Infrastructure as Code (IaC)intermediate
Cloud Infrastructure (AWS, GCP, Azure)intermediate
Containerization (Docker, Kubernetes)intermediate
Chaos Engineeringintermediate
Load Testingintermediate
Synthetic Monitoringintermediate
Service Level Objectives (SLOs)intermediate
Service Level Indicators (SLIs)intermediate
Fault-Tolerant Designintermediate
Automation Scripting (Python, Go)intermediate
GPU Resource Managementintermediate
Network Lifecycle Managementintermediate
On-Call Incident Responseintermediate
Performance Optimizationintermediate
Scalability Engineeringintermediate
CI/CD Pipelinesintermediate
Monitoring Tools (Prometheus, Grafana)intermediate
Distributed Systemsintermediate
Resource Optimizationintermediate

Required Qualifications

Bachelor's degree in Computer Science, Information Technology, or related field (or equivalent experience) (experience)
5+ years of proven experience as Software Engineer focused on reliability in fast-paced scaling environments (experience)
Strong proficiency in cloud infrastructure platforms like AWS, GCP, or Azure (experience)
Advanced programming skills in Python, Go, or similar languages (experience)
Hands-on experience with containerization technologies (Docker, Kubernetes) (experience)
Demonstrated track record building load testing, chaos engineering, and synthetic monitoring systems (experience)
Expertise in Infrastructure as Code tools (Terraform, Ansible) (experience)
Experience developing and maintaining SLOs/SLIs for production systems (experience)
Proven ability to design fault-tolerant and resilient distributed systems (experience)
Comfortable with on-call rotations and 24/7 incident response (experience)
Strong collaboration skills with cross-functional teams (researchers, PMs, designers) (experience)
Experience managing CPU/GPU/storage lifecycle in high-scale environments (experience)
Passion for performance optimization and bottleneck identification (experience)

Responsibilities

Design scalable infrastructure solutions to handle rapidly growing AI workloads
Build and maintain comprehensive load testing frameworks for development teams
Develop chaos engineering tools to test system resilience under failure conditions
Create synthetic monitoring systems to proactively detect service degradation
Implement automation tools to eliminate repetitive operational tasks
Manage lifecycle of CPU, GPU, storage, and network resources for optimal efficiency
Architect fault-tolerant design patterns minimizing service disruptions
Establish and maintain SLOs/SLIs across all production services
Collaborate with research teams to deploy experimental AI capabilities reliably
Participate in on-call rotation ensuring 24/7 system availability
Partner with product managers to balance rapid iteration with reliability guarantees
Drive performance improvements through systematic bottleneck analysis
Build self-service reliability platforms empowering engineering teams
Optimize resource allocation supporting dynamic AI model training demands

Benefits

general: Comprehensive medical, dental, and vision insurance coverage
general: 401(k) retirement plan with generous company matching
general: Unlimited PTO policy with encouraged recharge periods
general: Annual learning and development stipend ($3,000+)
general: Comprehensive parental leave (16+ weeks)
general: Fertility assistance and family planning benefits
general: Mental health support through dedicated counseling services
general: Gym membership reimbursement and wellness programs
general: Commuter benefits for San Francisco employees
general: Catered meals and fully stocked kitchens daily
general: Generous equity package with significant growth potential
general: Relocation assistance for qualifying candidates
general: Regular team offsites and company-wide retreats
general: Cutting-edge hardware including latest MacBook Pros
general: Volunteer time off and charitable donation matching

Target Your Resume for "Software Engineer, Reliability Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Get personalized recommendations to optimize your resume specifically for Software Engineer, Reliability Careers at OpenAI - San Francisco, California | Apply Now!. Takes only 15 seconds!

AI-powered keyword optimization

Skills matching & gap analysis

Experience alignment suggestions

Check Your ATS Score for "Software Engineer, Reliability Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check

Keyword optimization analysis

Skill matching & gap identification

Format & readability score

Tags & Categories

Answer 10 quick questions to check your fit for Software Engineer, Reliability Careers at OpenAI - San Francisco, California | Apply Now! @ OpenAI.

10 Questions

~2 Minutes

Instant Score

Related Books and Jobs

No related jobs found at the moment.

Privacy Terms & Conditions About Us Refund Policy Recruiter Login Sitemap