Resume and JobRESUME AND JOB
OpenAI logo

Software Engineer, Frontier Clusters Infrastructure Careers at OpenAI - San Francisco, California | Apply Now!

OpenAI

Software Engineer, Frontier Clusters Infrastructure Careers at OpenAI - San Francisco, California | Apply Now!

full-timePosted: Feb 10, 2026

Job Description

Software Engineer, Frontier Clusters Infrastructure at OpenAI - San Francisco, CA

Join OpenAI's Frontier Systems team and build the world's largest supercomputers powering cutting-edge AI model training. This senior-level role combines distributed systems engineering with hands-on infrastructure operations at massive scale.

Role Overview

The Frontier Clusters Infrastructure Software Engineer role at OpenAI represents a rare opportunity to work at the absolute forefront of AI infrastructure. Our Frontier Systems team doesn't just manage data centers—they transform ambitious designs into operational reality, supporting the hyperscale supercomputers that train OpenAI's most advanced models.

In this position, you'll operate next-generation compute clusters blending distributed systems expertise with practical infrastructure engineering. Expect to scale Kubernetes to unprecedented sizes, automate bare-metal deployments across thousands of nodes, and create software layers that abstract multi-data-center complexity for seamless AI training workloads.

This is hands-on systems engineering where reliability meets velocity. You'll troubleshoot fires during critical training runs, build automation that scales with our ambitions, and continuously push the boundaries of what's possible in large-scale computing infrastructure.

OpenAI's mission to develop safe AGI requires infrastructure that doesn't just work—it excels under extreme conditions. Your work directly enables breakthroughs in artificial general intelligence that will shape humanity's future.

Key Responsibilities

As a Software Engineer on Frontier Clusters Infrastructure, your impact will span the full stack from bare metal to application workloads:

  • Kubernetes at Massive Scale: Architect, deploy, and scale Kubernetes clusters serving millions of pods across multiple data centers, implementing advanced provisioning, auto-scaling, and lifecycle automation.
  • Bare-Metal Mastery: Own the complete node lifecycle from rack delivery through firmware flashing, OS installation, and Kubernetes integration, achieving deployment times measured in minutes rather than hours.
  • Software Abstractions: Build distributed systems that present unified interfaces across heterogeneous clusters, enabling training workloads to scale seamlessly regardless of underlying hardware topology.
  • Operational Excellence: Drive metrics like cluster recovery time (target: <15 minutes), firmware upgrade velocity, and hardware utilization rates through relentless automation and observability.
  • End-to-End Reliability: Integrate server BMCs, switch telemetry, power systems, and cooling infrastructure into comprehensive health monitoring platforms that predict failures before they impact training.
  • Observability Engineering: Design monitoring systems capable of handling millions of metrics per second, implementing anomaly detection, capacity forecasting, and automated remediation at hyperscale.
  • Cross-Team Collaboration: Partner with AI researchers, hardware engineers, and networking teams to deliver infrastructure that meets the uncompromising demands of frontier model training.

Success requires balancing careful systems design with the urgency of production training runs—where minutes of downtime cost millions in compute and delay humanity's progress toward safe AGI.

Qualifications

We're seeking senior engineers who thrive in ambiguous, high-stakes environments:

  • 5+ years operating Kubernetes or similar orchestration at significant scale (1000+ nodes)
  • Deep expertise in Linux systems, networking protocols, and GPU computing stacks
  • Strong programming background (Python/Go preferred) with production Infrastructure-as-Code experience
  • Proven ability to build automation that eliminates toil across large engineering teams
  • Experience with bare-metal provisioning, firmware management, or data center operations
  • Track record of improving operational metrics in mission-critical environments

Bonus points for: High-performance computing experience, RDMA networking, cluster federation patterns, or prior work with AI training infrastructure.

Salary & Benefits

Compensation: $250,000 - $450,000 base salary + equity + comprehensive benefits (total compensation varies based on experience and location).

Exceptional Benefits Package:

  • Comprehensive medical, dental, vision coverage (100% premiums covered)
  • Unlimited PTO with wellness days
  • Generous parental leave (16+ weeks)
  • 401(k) with company match
  • Fitness stipend + wellness programs
  • Daily catered meals and fully stocked kitchens
  • Learning stipend for conferences and courses
  • Relocation support for SF move

OpenAI offers equity that could be life-changing as we approach AGI milestones.

Why Join OpenAI?

You'll work with brilliant minds pushing AI boundaries while building infrastructure that scales to meet them. OpenAI offers:

  • Mission Impact: Direct contribution to safe AGI development benefiting humanity
  • Technical Challenge: Largest supercomputers in the world with bleeding-edge hardware
  • Career Growth: Rapid learning environment with world-class peers
  • Culture: Mission-driven teams valuing curiosity, urgency, and impact

San Francisco headquarters provides access to top AI talent and hardware innovation ecosystem.

How to Apply

Ready to build the infrastructure powering AGI? Submit your resume and a brief note about your most impactful infrastructure project. We're moving quickly—top candidates interview within 1 week.

Application Link: [Apply Now]

OpenAI is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.

Locations

  • San Francisco, California, United States

Salary

Estimated Salary Rangehigh confidence

262,500 - 495,000 USD / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

  • Kubernetes cluster scalingintermediate
  • Distributed systems engineeringintermediate
  • Bare-metal provisioningintermediate
  • Infrastructure as Code (IaC)intermediate
  • Python programmingintermediate
  • Go programmingintermediate
  • Terraform automationintermediate
  • Linux systems administrationintermediate
  • GPU workload managementintermediate
  • Firmware upgradesintermediate
  • Networking integrationintermediate
  • Observability systemsintermediate
  • Cluster lifecycle managementintermediate
  • High-performance computing (HPC)intermediate
  • Container orchestrationintermediate
  • Automation scriptingintermediate
  • Data center operationsintermediate
  • Monitoring tools (Prometheus)intermediate
  • CI/CD pipelinesintermediate
  • Hyperscale infrastructureintermediate

Required Qualifications

  • 5+ years experience as infrastructure, systems, or distributed systems engineer in large-scale environments (experience)
  • Deep knowledge of Kubernetes internals, scaling patterns, and containerized workloads (experience)
  • Proficiency in cloud infrastructure concepts including compute, networking, storage, and security (experience)
  • Strong programming skills in Python, Go, or similar languages (experience)
  • Hands-on experience with Infrastructure-as-Code tools like Terraform or CloudFormation (experience)
  • Comfortable operating bare-metal Linux environments and GPU hardware (experience)
  • Experience with large-scale networking and data center infrastructure (experience)
  • Proven track record of building automation to eliminate manual operations (experience)
  • Ability to diagnose and resolve issues in fast-moving, mission-critical systems (experience)
  • Familiarity with monitoring and observability systems for extreme workloads (experience)
  • Bonus: Background in GPU workloads, firmware management, or high-performance computing (experience)
  • Excellent problem-solving skills under pressure with high-impact operational challenges (experience)

Responsibilities

  • Spin up and scale massive Kubernetes clusters with automation for provisioning and bootstrapping
  • Develop software abstractions unifying multiple clusters for seamless training workloads
  • Own complete node bring-up process from bare metal through firmware upgrades at scale
  • Optimize operational metrics like reducing cluster restart times from hours to minutes
  • Accelerate firmware and OS upgrade cycles through automation and repeatable processes
  • Integrate networking systems with hardware health monitoring for end-to-end reliability
  • Build comprehensive monitoring and observability systems for early issue detection
  • Maintain hyperscale supercomputers during frontier model training runs
  • Diagnose and resolve production issues during high-pressure training operations
  • Collaborate with research teams to ensure compute infrastructure meets AI training needs
  • Implement cluster lifecycle management including scaling, upgrades, and decommissioning
  • Continuously improve automation to eliminate toil and manual intervention
  • Work cross-functionally with hardware, networking, and software engineering teams

Benefits

  • general: Competitive salary with equity package in a high-growth AI company
  • general: Comprehensive health, dental, and vision insurance coverage
  • general: 401(k) matching program for retirement savings
  • general: Unlimited PTO with encouraged recharge periods
  • general: Generous parental leave policies for new parents
  • general: Fully covered medical, dental, and vision premiums
  • general: Fitness stipend and wellness program membership
  • general: Commuter benefits and transportation allowances
  • general: Catered lunches daily and fully stocked kitchens
  • general: Learning and development stipend for professional growth
  • general: Relocation assistance for new hires
  • general: Work with world-class researchers and engineers on frontier AI
  • general: Mission-driven culture focused on safe AGI development
  • general: Modern office in San Francisco with latest hardware

Target Your Resume for "Software Engineer, Frontier Clusters Infrastructure Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Get personalized recommendations to optimize your resume specifically for Software Engineer, Frontier Clusters Infrastructure Careers at OpenAI - San Francisco, California | Apply Now!. Takes only 15 seconds!

AI-powered keyword optimization
Skills matching & gap analysis
Experience alignment suggestions

Check Your ATS Score for "Software Engineer, Frontier Clusters Infrastructure Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check
Keyword optimization analysis
Skill matching & gap identification
Format & readability score

Tags & Categories

OpenAI software engineer jobsKubernetes engineer San FranciscoFrontier infrastructure careersHyperscale cluster engineerDistributed systems OpenAIGPU infrastructure jobsAI training infrastructureBare metal automation engineerTerraform Kubernetes jobsData center operations OpenAIHigh performance computing careersSoftware engineer AGI infrastructureSan Francisco tech infrastructure jobsOpenAI frontier systems teamKubernetes scaling engineerFirmware automation specialistObservability systems engineerAI supercomputer infrastructureOpenAI infrastructure careersDistributed systems AI trainingHyperscale Kubernetes operationsScaling

Answer 10 quick questions to check your fit for Software Engineer, Frontier Clusters Infrastructure Careers at OpenAI - San Francisco, California | Apply Now! @ OpenAI.

Quiz Challenge
10 Questions
~2 Minutes
Instant Score

Related Books and Jobs

No related jobs found at the moment.

OpenAI logo

Software Engineer, Frontier Clusters Infrastructure Careers at OpenAI - San Francisco, California | Apply Now!

OpenAI

Software Engineer, Frontier Clusters Infrastructure Careers at OpenAI - San Francisco, California | Apply Now!

full-timePosted: Feb 10, 2026

Job Description

Software Engineer, Frontier Clusters Infrastructure at OpenAI - San Francisco, CA

Join OpenAI's Frontier Systems team and build the world's largest supercomputers powering cutting-edge AI model training. This senior-level role combines distributed systems engineering with hands-on infrastructure operations at massive scale.

Role Overview

The Frontier Clusters Infrastructure Software Engineer role at OpenAI represents a rare opportunity to work at the absolute forefront of AI infrastructure. Our Frontier Systems team doesn't just manage data centers—they transform ambitious designs into operational reality, supporting the hyperscale supercomputers that train OpenAI's most advanced models.

In this position, you'll operate next-generation compute clusters blending distributed systems expertise with practical infrastructure engineering. Expect to scale Kubernetes to unprecedented sizes, automate bare-metal deployments across thousands of nodes, and create software layers that abstract multi-data-center complexity for seamless AI training workloads.

This is hands-on systems engineering where reliability meets velocity. You'll troubleshoot fires during critical training runs, build automation that scales with our ambitions, and continuously push the boundaries of what's possible in large-scale computing infrastructure.

OpenAI's mission to develop safe AGI requires infrastructure that doesn't just work—it excels under extreme conditions. Your work directly enables breakthroughs in artificial general intelligence that will shape humanity's future.

Key Responsibilities

As a Software Engineer on Frontier Clusters Infrastructure, your impact will span the full stack from bare metal to application workloads:

  • Kubernetes at Massive Scale: Architect, deploy, and scale Kubernetes clusters serving millions of pods across multiple data centers, implementing advanced provisioning, auto-scaling, and lifecycle automation.
  • Bare-Metal Mastery: Own the complete node lifecycle from rack delivery through firmware flashing, OS installation, and Kubernetes integration, achieving deployment times measured in minutes rather than hours.
  • Software Abstractions: Build distributed systems that present unified interfaces across heterogeneous clusters, enabling training workloads to scale seamlessly regardless of underlying hardware topology.
  • Operational Excellence: Drive metrics like cluster recovery time (target: <15 minutes), firmware upgrade velocity, and hardware utilization rates through relentless automation and observability.
  • End-to-End Reliability: Integrate server BMCs, switch telemetry, power systems, and cooling infrastructure into comprehensive health monitoring platforms that predict failures before they impact training.
  • Observability Engineering: Design monitoring systems capable of handling millions of metrics per second, implementing anomaly detection, capacity forecasting, and automated remediation at hyperscale.
  • Cross-Team Collaboration: Partner with AI researchers, hardware engineers, and networking teams to deliver infrastructure that meets the uncompromising demands of frontier model training.

Success requires balancing careful systems design with the urgency of production training runs—where minutes of downtime cost millions in compute and delay humanity's progress toward safe AGI.

Qualifications

We're seeking senior engineers who thrive in ambiguous, high-stakes environments:

  • 5+ years operating Kubernetes or similar orchestration at significant scale (1000+ nodes)
  • Deep expertise in Linux systems, networking protocols, and GPU computing stacks
  • Strong programming background (Python/Go preferred) with production Infrastructure-as-Code experience
  • Proven ability to build automation that eliminates toil across large engineering teams
  • Experience with bare-metal provisioning, firmware management, or data center operations
  • Track record of improving operational metrics in mission-critical environments

Bonus points for: High-performance computing experience, RDMA networking, cluster federation patterns, or prior work with AI training infrastructure.

Salary & Benefits

Compensation: $250,000 - $450,000 base salary + equity + comprehensive benefits (total compensation varies based on experience and location).

Exceptional Benefits Package:

  • Comprehensive medical, dental, vision coverage (100% premiums covered)
  • Unlimited PTO with wellness days
  • Generous parental leave (16+ weeks)
  • 401(k) with company match
  • Fitness stipend + wellness programs
  • Daily catered meals and fully stocked kitchens
  • Learning stipend for conferences and courses
  • Relocation support for SF move

OpenAI offers equity that could be life-changing as we approach AGI milestones.

Why Join OpenAI?

You'll work with brilliant minds pushing AI boundaries while building infrastructure that scales to meet them. OpenAI offers:

  • Mission Impact: Direct contribution to safe AGI development benefiting humanity
  • Technical Challenge: Largest supercomputers in the world with bleeding-edge hardware
  • Career Growth: Rapid learning environment with world-class peers
  • Culture: Mission-driven teams valuing curiosity, urgency, and impact

San Francisco headquarters provides access to top AI talent and hardware innovation ecosystem.

How to Apply

Ready to build the infrastructure powering AGI? Submit your resume and a brief note about your most impactful infrastructure project. We're moving quickly—top candidates interview within 1 week.

Application Link: [Apply Now]

OpenAI is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.

Locations

  • San Francisco, California, United States

Salary

Estimated Salary Rangehigh confidence

262,500 - 495,000 USD / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

  • Kubernetes cluster scalingintermediate
  • Distributed systems engineeringintermediate
  • Bare-metal provisioningintermediate
  • Infrastructure as Code (IaC)intermediate
  • Python programmingintermediate
  • Go programmingintermediate
  • Terraform automationintermediate
  • Linux systems administrationintermediate
  • GPU workload managementintermediate
  • Firmware upgradesintermediate
  • Networking integrationintermediate
  • Observability systemsintermediate
  • Cluster lifecycle managementintermediate
  • High-performance computing (HPC)intermediate
  • Container orchestrationintermediate
  • Automation scriptingintermediate
  • Data center operationsintermediate
  • Monitoring tools (Prometheus)intermediate
  • CI/CD pipelinesintermediate
  • Hyperscale infrastructureintermediate

Required Qualifications

  • 5+ years experience as infrastructure, systems, or distributed systems engineer in large-scale environments (experience)
  • Deep knowledge of Kubernetes internals, scaling patterns, and containerized workloads (experience)
  • Proficiency in cloud infrastructure concepts including compute, networking, storage, and security (experience)
  • Strong programming skills in Python, Go, or similar languages (experience)
  • Hands-on experience with Infrastructure-as-Code tools like Terraform or CloudFormation (experience)
  • Comfortable operating bare-metal Linux environments and GPU hardware (experience)
  • Experience with large-scale networking and data center infrastructure (experience)
  • Proven track record of building automation to eliminate manual operations (experience)
  • Ability to diagnose and resolve issues in fast-moving, mission-critical systems (experience)
  • Familiarity with monitoring and observability systems for extreme workloads (experience)
  • Bonus: Background in GPU workloads, firmware management, or high-performance computing (experience)
  • Excellent problem-solving skills under pressure with high-impact operational challenges (experience)

Responsibilities

  • Spin up and scale massive Kubernetes clusters with automation for provisioning and bootstrapping
  • Develop software abstractions unifying multiple clusters for seamless training workloads
  • Own complete node bring-up process from bare metal through firmware upgrades at scale
  • Optimize operational metrics like reducing cluster restart times from hours to minutes
  • Accelerate firmware and OS upgrade cycles through automation and repeatable processes
  • Integrate networking systems with hardware health monitoring for end-to-end reliability
  • Build comprehensive monitoring and observability systems for early issue detection
  • Maintain hyperscale supercomputers during frontier model training runs
  • Diagnose and resolve production issues during high-pressure training operations
  • Collaborate with research teams to ensure compute infrastructure meets AI training needs
  • Implement cluster lifecycle management including scaling, upgrades, and decommissioning
  • Continuously improve automation to eliminate toil and manual intervention
  • Work cross-functionally with hardware, networking, and software engineering teams

Benefits

  • general: Competitive salary with equity package in a high-growth AI company
  • general: Comprehensive health, dental, and vision insurance coverage
  • general: 401(k) matching program for retirement savings
  • general: Unlimited PTO with encouraged recharge periods
  • general: Generous parental leave policies for new parents
  • general: Fully covered medical, dental, and vision premiums
  • general: Fitness stipend and wellness program membership
  • general: Commuter benefits and transportation allowances
  • general: Catered lunches daily and fully stocked kitchens
  • general: Learning and development stipend for professional growth
  • general: Relocation assistance for new hires
  • general: Work with world-class researchers and engineers on frontier AI
  • general: Mission-driven culture focused on safe AGI development
  • general: Modern office in San Francisco with latest hardware

Target Your Resume for "Software Engineer, Frontier Clusters Infrastructure Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Get personalized recommendations to optimize your resume specifically for Software Engineer, Frontier Clusters Infrastructure Careers at OpenAI - San Francisco, California | Apply Now!. Takes only 15 seconds!

AI-powered keyword optimization
Skills matching & gap analysis
Experience alignment suggestions

Check Your ATS Score for "Software Engineer, Frontier Clusters Infrastructure Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check
Keyword optimization analysis
Skill matching & gap identification
Format & readability score

Tags & Categories

OpenAI software engineer jobsKubernetes engineer San FranciscoFrontier infrastructure careersHyperscale cluster engineerDistributed systems OpenAIGPU infrastructure jobsAI training infrastructureBare metal automation engineerTerraform Kubernetes jobsData center operations OpenAIHigh performance computing careersSoftware engineer AGI infrastructureSan Francisco tech infrastructure jobsOpenAI frontier systems teamKubernetes scaling engineerFirmware automation specialistObservability systems engineerAI supercomputer infrastructureOpenAI infrastructure careersDistributed systems AI trainingHyperscale Kubernetes operationsScaling

Answer 10 quick questions to check your fit for Software Engineer, Frontier Clusters Infrastructure Careers at OpenAI - San Francisco, California | Apply Now! @ OpenAI.

Quiz Challenge
10 Questions
~2 Minutes
Instant Score

Related Books and Jobs

No related jobs found at the moment.