Resume and JobRESUME AND JOB
OpenAI logo

Software Engineer, Reliability Careers at OpenAI - San Francisco, California | Apply Now!

OpenAI

Software Engineer, Reliability Careers at OpenAI - San Francisco, California | Apply Now!

full-timePosted: Feb 10, 2026

Job Description

Software Engineer, Reliability at OpenAI - San Francisco, CA

Role Overview

Join OpenAI's mission to ensure safe AGI benefits all of humanity as a Software Engineer, Reliability in San Francisco. This senior-level role sits at the heart of our Applied Engineering team, bridging research breakthroughs with production reality. You'll architect the resilient infrastructure powering ChatGPT, DALL-E, and future frontier models serving millions worldwide.

In this high-impact position, you'll champion reliability amidst unprecedented scaling challenges. OpenAI's systems must balance researcher velocity with enterprise-grade stability, handling explosive traffic growth while maintaining sub-second latencies for real-time AI interactions. Your work directly enables safe deployment of transformative technology to consumers and businesses globally.

The Reliability Engineering team operates as force multipliers, empowering 100+ engineers with self-service tooling that accelerates safe iteration. You'll tackle complex distributed systems problems unique to AI infrastructure - GPU orchestration at exabyte scale, real-time inference latency optimization, and chaos resilience for mission-critical services. This role demands deep systems expertise combined with collaborative humility to succeed in our fast-paced, mission-driven culture.

Success metrics include reducing MTTR by 50%, achieving 99.99% uptime across core services, and enabling 10x capacity growth without proportional headcount increases. You'll partner with world-class researchers, product leaders, and fellow engineers to deliver reliable AI at global scale.

Key Responsibilities

As a Software Engineer, Reliability at OpenAI, your impact spans the full reliability spectrum:

  • Scalability Architecture: Design infrastructure handling 100x traffic growth, from millions to billions of daily requests across global edge locations.
  • Testing Platform Ownership: Build load, chaos, and synthetic testing frameworks used by every development team, ensuring proactive reliability before production.
  • Automation Leadership: Create self-service tools eliminating toil, from auto-scaling GPU clusters to one-click rollback capabilities.
  • Resource Lifecycle Management: Develop platforms optimizing CPU/GPU/storage utilization across thousands of nodes, driving multi-million dollar efficiency gains.
  • Fault Tolerance Engineering: Implement resilient patterns surviving correlated failures, network partitions, and hardware faults at planetary scale.
  • SLO/SLI Framework: Establish measurable reliability objectives guiding engineering decisions across research and product teams.
  • Cross-Functional Partnership: Collaborate with researchers deploying bleeding-edge models and PMs launching consumer features, ensuring reliability from day zero.
  • Incident Leadership: Lead on-call response for critical outages, driving post-mortems that prevent recurrence through systemic improvements.
  • Performance Engineering: Systematically identify and eliminate bottlenecks across the stack, from kernel scheduling to API response times.
  • Platform Enablement: Build internal developer platforms accelerating safe velocity for 100+ engineering teams.
  • Capacity Planning: Forecast infrastructure needs supporting exponential AI capability growth over 12-24 month horizons.
  • Safety Integration: Embed reliability practices ensuring safe AI deployment aligns with OpenAI's core safety commitments.
  • Tooling Innovation: Pioneer next-generation observability solving AI-specific monitoring challenges like model drift detection.
  • Mentorship: Guide junior engineers while learning from OpenAI's world-class reliability leadership.

Qualifications

Technical Excellence (Required):

  • 5+ years production SRE/ reliability engineering in hyper-growth environments
  • Deep cloud expertise (AWS/GCP) managing 1000+ node clusters
  • Advanced container orchestration (Kubernetes) at massive scale
  • Strong systems programming (Python/Go/C++) building mission-critical services
  • Proven IaC mastery (Terraform/CloudFormation) across multi-cloud

AI/ML Infrastructure Bonus:

  • GPU cluster management experience (NVIDIA/CUDA)
  • Experience with Ray, Kubernetes operators for ML workloads
  • Familiarity with vector databases, inference serving frameworks

Cultural Fit (Critical):

  • End-to-end ownership mindset - runs services soup-to-nuts
  • Humble collaborator thriving in flat, high-trust teams
  • Relentless curiosity mastering new domains rapidly
  • Customer-obsessed serving internal teams as "external" customers

Salary & Benefits

Competitive Compensation: Total cash compensation $220K-$380K+ (base + bonus), substantial equity, comprehensive benefits package.

Exceptional Benefits:

  • Top-tier medical/dental/vision with $0 premiums
  • 401k with 4%+ match, immediate vesting
  • Unlimited PTO - take what you need
  • $3,000+ annual learning stipend
  • 16+ weeks parental leave
  • Fertility benefits, mental health support
  • Daily catered meals, gym reimbursement
  • Latest hardware, relocation support

Why Join OpenAI?

OpenAI isn't just another tech company - we're building safe artificial general intelligence to benefit humanity. Your reliability engineering directly enables this historic mission.

Unmatched Impact: Every system you build reaches hundreds of millions weekly. ChatGPT alone serves 100M+ users/month.

Technical Challenge: Solve problems no one else faces - exabyte-scale AI infrastructure, sub-100ms global inference, resilient chaos at planetary scale.

World-Class Team: Collaborate with PhDs from Stanford/MIT, ex-Google/DeepMind researchers, builders from Stripe/Uber at their peak.

Mission-Driven Culture: Safety > Growth. We move deliberately, prioritizing long-term correctness over short-term metrics.

Explosive Growth: 10x'd team size in 18 months. Unlimited internal mobility across research, infra, product.

How to Apply

  1. Review Fit: Ensure 5+ years reliability experience at scale
  2. Prepare Materials: Resume highlighting production impact metrics
  3. Submit Application: Click "Apply Now" - include GitHub/portfolio links
  4. Interview Process: Recruiter screen → Technical deep-dive → System design → Team match
  5. Timeline: 2-4 weeks end-to-end

OpenAI is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.

Locations

  • San Francisco, California, United States

Salary

Estimated Salary Rangehigh confidence

231,000 - 418,000 USD / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

  • Site Reliability Engineering (SRE)intermediate
  • Infrastructure as Code (IaC)intermediate
  • Cloud Infrastructure (AWS, GCP, Azure)intermediate
  • Containerization (Docker, Kubernetes)intermediate
  • Chaos Engineeringintermediate
  • Load Testingintermediate
  • Synthetic Monitoringintermediate
  • Service Level Objectives (SLOs)intermediate
  • Service Level Indicators (SLIs)intermediate
  • Fault-Tolerant Designintermediate
  • Automation Scripting (Python, Go)intermediate
  • GPU Resource Managementintermediate
  • Network Lifecycle Managementintermediate
  • On-Call Incident Responseintermediate
  • Performance Optimizationintermediate
  • Scalability Engineeringintermediate
  • CI/CD Pipelinesintermediate
  • Monitoring Tools (Prometheus, Grafana)intermediate
  • Distributed Systemsintermediate
  • Resource Optimizationintermediate

Required Qualifications

  • Bachelor's degree in Computer Science, Information Technology, or related field (or equivalent experience) (experience)
  • 5+ years of proven experience as Software Engineer focused on reliability in fast-paced scaling environments (experience)
  • Strong proficiency in cloud infrastructure platforms like AWS, GCP, or Azure (experience)
  • Advanced programming skills in Python, Go, or similar languages (experience)
  • Hands-on experience with containerization technologies (Docker, Kubernetes) (experience)
  • Demonstrated track record building load testing, chaos engineering, and synthetic monitoring systems (experience)
  • Expertise in Infrastructure as Code tools (Terraform, Ansible) (experience)
  • Experience developing and maintaining SLOs/SLIs for production systems (experience)
  • Proven ability to design fault-tolerant and resilient distributed systems (experience)
  • Comfortable with on-call rotations and 24/7 incident response (experience)
  • Strong collaboration skills with cross-functional teams (researchers, PMs, designers) (experience)
  • Experience managing CPU/GPU/storage lifecycle in high-scale environments (experience)
  • Passion for performance optimization and bottleneck identification (experience)

Responsibilities

  • Design scalable infrastructure solutions to handle rapidly growing AI workloads
  • Build and maintain comprehensive load testing frameworks for development teams
  • Develop chaos engineering tools to test system resilience under failure conditions
  • Create synthetic monitoring systems to proactively detect service degradation
  • Implement automation tools to eliminate repetitive operational tasks
  • Manage lifecycle of CPU, GPU, storage, and network resources for optimal efficiency
  • Architect fault-tolerant design patterns minimizing service disruptions
  • Establish and maintain SLOs/SLIs across all production services
  • Collaborate with research teams to deploy experimental AI capabilities reliably
  • Participate in on-call rotation ensuring 24/7 system availability
  • Partner with product managers to balance rapid iteration with reliability guarantees
  • Drive performance improvements through systematic bottleneck analysis
  • Build self-service reliability platforms empowering engineering teams
  • Optimize resource allocation supporting dynamic AI model training demands

Benefits

  • general: Comprehensive medical, dental, and vision insurance coverage
  • general: 401(k) retirement plan with generous company matching
  • general: Unlimited PTO policy with encouraged recharge periods
  • general: Annual learning and development stipend ($3,000+)
  • general: Comprehensive parental leave (16+ weeks)
  • general: Fertility assistance and family planning benefits
  • general: Mental health support through dedicated counseling services
  • general: Gym membership reimbursement and wellness programs
  • general: Commuter benefits for San Francisco employees
  • general: Catered meals and fully stocked kitchens daily
  • general: Generous equity package with significant growth potential
  • general: Relocation assistance for qualifying candidates
  • general: Regular team offsites and company-wide retreats
  • general: Cutting-edge hardware including latest MacBook Pros
  • general: Volunteer time off and charitable donation matching

Target Your Resume for "Software Engineer, Reliability Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Get personalized recommendations to optimize your resume specifically for Software Engineer, Reliability Careers at OpenAI - San Francisco, California | Apply Now!. Takes only 15 seconds!

AI-powered keyword optimization
Skills matching & gap analysis
Experience alignment suggestions

Check Your ATS Score for "Software Engineer, Reliability Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check
Keyword optimization analysis
Skill matching & gap identification
Format & readability score

Tags & Categories

OpenAI reliability engineer jobssoftware engineer reliability San FranciscoSRE jobs OpenAIAI infrastructure engineer careerssenior SRE OpenAI San FranciscoChatGPT reliability engineeringGPU infrastructure engineer jobschaos engineering OpenAI careerssite reliability engineer AI companyOpenAI infrastructure jobs Californiaproduction SRE artificial intelligencescalable AI systems engineerOpenAI engineering careers SFfault tolerant systems engineerSLO SLI engineer OpenAIdistributed systems reliability jobsKubernetes SRE OpenAIcloud infrastructure AI careersperformance engineering OpenAIon-call SRE San Francisco jobsmachine learning infrastructure engineerOpenAI reliability platform engineerApplied AI

Answer 10 quick questions to check your fit for Software Engineer, Reliability Careers at OpenAI - San Francisco, California | Apply Now! @ OpenAI.

Quiz Challenge
10 Questions
~2 Minutes
Instant Score

Related Books and Jobs

No related jobs found at the moment.

OpenAI logo

Software Engineer, Reliability Careers at OpenAI - San Francisco, California | Apply Now!

OpenAI

Software Engineer, Reliability Careers at OpenAI - San Francisco, California | Apply Now!

full-timePosted: Feb 10, 2026

Job Description

Software Engineer, Reliability at OpenAI - San Francisco, CA

Role Overview

Join OpenAI's mission to ensure safe AGI benefits all of humanity as a Software Engineer, Reliability in San Francisco. This senior-level role sits at the heart of our Applied Engineering team, bridging research breakthroughs with production reality. You'll architect the resilient infrastructure powering ChatGPT, DALL-E, and future frontier models serving millions worldwide.

In this high-impact position, you'll champion reliability amidst unprecedented scaling challenges. OpenAI's systems must balance researcher velocity with enterprise-grade stability, handling explosive traffic growth while maintaining sub-second latencies for real-time AI interactions. Your work directly enables safe deployment of transformative technology to consumers and businesses globally.

The Reliability Engineering team operates as force multipliers, empowering 100+ engineers with self-service tooling that accelerates safe iteration. You'll tackle complex distributed systems problems unique to AI infrastructure - GPU orchestration at exabyte scale, real-time inference latency optimization, and chaos resilience for mission-critical services. This role demands deep systems expertise combined with collaborative humility to succeed in our fast-paced, mission-driven culture.

Success metrics include reducing MTTR by 50%, achieving 99.99% uptime across core services, and enabling 10x capacity growth without proportional headcount increases. You'll partner with world-class researchers, product leaders, and fellow engineers to deliver reliable AI at global scale.

Key Responsibilities

As a Software Engineer, Reliability at OpenAI, your impact spans the full reliability spectrum:

  • Scalability Architecture: Design infrastructure handling 100x traffic growth, from millions to billions of daily requests across global edge locations.
  • Testing Platform Ownership: Build load, chaos, and synthetic testing frameworks used by every development team, ensuring proactive reliability before production.
  • Automation Leadership: Create self-service tools eliminating toil, from auto-scaling GPU clusters to one-click rollback capabilities.
  • Resource Lifecycle Management: Develop platforms optimizing CPU/GPU/storage utilization across thousands of nodes, driving multi-million dollar efficiency gains.
  • Fault Tolerance Engineering: Implement resilient patterns surviving correlated failures, network partitions, and hardware faults at planetary scale.
  • SLO/SLI Framework: Establish measurable reliability objectives guiding engineering decisions across research and product teams.
  • Cross-Functional Partnership: Collaborate with researchers deploying bleeding-edge models and PMs launching consumer features, ensuring reliability from day zero.
  • Incident Leadership: Lead on-call response for critical outages, driving post-mortems that prevent recurrence through systemic improvements.
  • Performance Engineering: Systematically identify and eliminate bottlenecks across the stack, from kernel scheduling to API response times.
  • Platform Enablement: Build internal developer platforms accelerating safe velocity for 100+ engineering teams.
  • Capacity Planning: Forecast infrastructure needs supporting exponential AI capability growth over 12-24 month horizons.
  • Safety Integration: Embed reliability practices ensuring safe AI deployment aligns with OpenAI's core safety commitments.
  • Tooling Innovation: Pioneer next-generation observability solving AI-specific monitoring challenges like model drift detection.
  • Mentorship: Guide junior engineers while learning from OpenAI's world-class reliability leadership.

Qualifications

Technical Excellence (Required):

  • 5+ years production SRE/ reliability engineering in hyper-growth environments
  • Deep cloud expertise (AWS/GCP) managing 1000+ node clusters
  • Advanced container orchestration (Kubernetes) at massive scale
  • Strong systems programming (Python/Go/C++) building mission-critical services
  • Proven IaC mastery (Terraform/CloudFormation) across multi-cloud

AI/ML Infrastructure Bonus:

  • GPU cluster management experience (NVIDIA/CUDA)
  • Experience with Ray, Kubernetes operators for ML workloads
  • Familiarity with vector databases, inference serving frameworks

Cultural Fit (Critical):

  • End-to-end ownership mindset - runs services soup-to-nuts
  • Humble collaborator thriving in flat, high-trust teams
  • Relentless curiosity mastering new domains rapidly
  • Customer-obsessed serving internal teams as "external" customers

Salary & Benefits

Competitive Compensation: Total cash compensation $220K-$380K+ (base + bonus), substantial equity, comprehensive benefits package.

Exceptional Benefits:

  • Top-tier medical/dental/vision with $0 premiums
  • 401k with 4%+ match, immediate vesting
  • Unlimited PTO - take what you need
  • $3,000+ annual learning stipend
  • 16+ weeks parental leave
  • Fertility benefits, mental health support
  • Daily catered meals, gym reimbursement
  • Latest hardware, relocation support

Why Join OpenAI?

OpenAI isn't just another tech company - we're building safe artificial general intelligence to benefit humanity. Your reliability engineering directly enables this historic mission.

Unmatched Impact: Every system you build reaches hundreds of millions weekly. ChatGPT alone serves 100M+ users/month.

Technical Challenge: Solve problems no one else faces - exabyte-scale AI infrastructure, sub-100ms global inference, resilient chaos at planetary scale.

World-Class Team: Collaborate with PhDs from Stanford/MIT, ex-Google/DeepMind researchers, builders from Stripe/Uber at their peak.

Mission-Driven Culture: Safety > Growth. We move deliberately, prioritizing long-term correctness over short-term metrics.

Explosive Growth: 10x'd team size in 18 months. Unlimited internal mobility across research, infra, product.

How to Apply

  1. Review Fit: Ensure 5+ years reliability experience at scale
  2. Prepare Materials: Resume highlighting production impact metrics
  3. Submit Application: Click "Apply Now" - include GitHub/portfolio links
  4. Interview Process: Recruiter screen → Technical deep-dive → System design → Team match
  5. Timeline: 2-4 weeks end-to-end

OpenAI is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.

Locations

  • San Francisco, California, United States

Salary

Estimated Salary Rangehigh confidence

231,000 - 418,000 USD / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

  • Site Reliability Engineering (SRE)intermediate
  • Infrastructure as Code (IaC)intermediate
  • Cloud Infrastructure (AWS, GCP, Azure)intermediate
  • Containerization (Docker, Kubernetes)intermediate
  • Chaos Engineeringintermediate
  • Load Testingintermediate
  • Synthetic Monitoringintermediate
  • Service Level Objectives (SLOs)intermediate
  • Service Level Indicators (SLIs)intermediate
  • Fault-Tolerant Designintermediate
  • Automation Scripting (Python, Go)intermediate
  • GPU Resource Managementintermediate
  • Network Lifecycle Managementintermediate
  • On-Call Incident Responseintermediate
  • Performance Optimizationintermediate
  • Scalability Engineeringintermediate
  • CI/CD Pipelinesintermediate
  • Monitoring Tools (Prometheus, Grafana)intermediate
  • Distributed Systemsintermediate
  • Resource Optimizationintermediate

Required Qualifications

  • Bachelor's degree in Computer Science, Information Technology, or related field (or equivalent experience) (experience)
  • 5+ years of proven experience as Software Engineer focused on reliability in fast-paced scaling environments (experience)
  • Strong proficiency in cloud infrastructure platforms like AWS, GCP, or Azure (experience)
  • Advanced programming skills in Python, Go, or similar languages (experience)
  • Hands-on experience with containerization technologies (Docker, Kubernetes) (experience)
  • Demonstrated track record building load testing, chaos engineering, and synthetic monitoring systems (experience)
  • Expertise in Infrastructure as Code tools (Terraform, Ansible) (experience)
  • Experience developing and maintaining SLOs/SLIs for production systems (experience)
  • Proven ability to design fault-tolerant and resilient distributed systems (experience)
  • Comfortable with on-call rotations and 24/7 incident response (experience)
  • Strong collaboration skills with cross-functional teams (researchers, PMs, designers) (experience)
  • Experience managing CPU/GPU/storage lifecycle in high-scale environments (experience)
  • Passion for performance optimization and bottleneck identification (experience)

Responsibilities

  • Design scalable infrastructure solutions to handle rapidly growing AI workloads
  • Build and maintain comprehensive load testing frameworks for development teams
  • Develop chaos engineering tools to test system resilience under failure conditions
  • Create synthetic monitoring systems to proactively detect service degradation
  • Implement automation tools to eliminate repetitive operational tasks
  • Manage lifecycle of CPU, GPU, storage, and network resources for optimal efficiency
  • Architect fault-tolerant design patterns minimizing service disruptions
  • Establish and maintain SLOs/SLIs across all production services
  • Collaborate with research teams to deploy experimental AI capabilities reliably
  • Participate in on-call rotation ensuring 24/7 system availability
  • Partner with product managers to balance rapid iteration with reliability guarantees
  • Drive performance improvements through systematic bottleneck analysis
  • Build self-service reliability platforms empowering engineering teams
  • Optimize resource allocation supporting dynamic AI model training demands

Benefits

  • general: Comprehensive medical, dental, and vision insurance coverage
  • general: 401(k) retirement plan with generous company matching
  • general: Unlimited PTO policy with encouraged recharge periods
  • general: Annual learning and development stipend ($3,000+)
  • general: Comprehensive parental leave (16+ weeks)
  • general: Fertility assistance and family planning benefits
  • general: Mental health support through dedicated counseling services
  • general: Gym membership reimbursement and wellness programs
  • general: Commuter benefits for San Francisco employees
  • general: Catered meals and fully stocked kitchens daily
  • general: Generous equity package with significant growth potential
  • general: Relocation assistance for qualifying candidates
  • general: Regular team offsites and company-wide retreats
  • general: Cutting-edge hardware including latest MacBook Pros
  • general: Volunteer time off and charitable donation matching

Target Your Resume for "Software Engineer, Reliability Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Get personalized recommendations to optimize your resume specifically for Software Engineer, Reliability Careers at OpenAI - San Francisco, California | Apply Now!. Takes only 15 seconds!

AI-powered keyword optimization
Skills matching & gap analysis
Experience alignment suggestions

Check Your ATS Score for "Software Engineer, Reliability Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check
Keyword optimization analysis
Skill matching & gap identification
Format & readability score

Tags & Categories

OpenAI reliability engineer jobssoftware engineer reliability San FranciscoSRE jobs OpenAIAI infrastructure engineer careerssenior SRE OpenAI San FranciscoChatGPT reliability engineeringGPU infrastructure engineer jobschaos engineering OpenAI careerssite reliability engineer AI companyOpenAI infrastructure jobs Californiaproduction SRE artificial intelligencescalable AI systems engineerOpenAI engineering careers SFfault tolerant systems engineerSLO SLI engineer OpenAIdistributed systems reliability jobsKubernetes SRE OpenAIcloud infrastructure AI careersperformance engineering OpenAIon-call SRE San Francisco jobsmachine learning infrastructure engineerOpenAI reliability platform engineerApplied AI

Answer 10 quick questions to check your fit for Software Engineer, Reliability Careers at OpenAI - San Francisco, California | Apply Now! @ OpenAI.

Quiz Challenge
10 Questions
~2 Minutes
Instant Score

Related Books and Jobs

No related jobs found at the moment.