RESUME AND JOB

Software Engineer, Frontier Clusters Infrastructure Careers at OpenAI - San Francisco, California | Apply Now!

OpenAI

Software Engineer, Frontier Clusters Infrastructure Careers at OpenAI - San Francisco, California | Apply Now!

OpenAI

full-timePosted: Feb 10, 2026

Job Description

Software Engineer, Frontier Clusters Infrastructure at OpenAI - San Francisco, CA

Join OpenAI's Frontier Systems team and build the world's largest supercomputers powering cutting-edge AI model training. This senior-level role combines distributed systems engineering with hands-on infrastructure operations at massive scale.

Role Overview

The Frontier Clusters Infrastructure Software Engineer role at OpenAI represents a rare opportunity to work at the absolute forefront of AI infrastructure. Our Frontier Systems team doesn't just manage data centers—they transform ambitious designs into operational reality, supporting the hyperscale supercomputers that train OpenAI's most advanced models.

In this position, you'll operate next-generation compute clusters blending distributed systems expertise with practical infrastructure engineering. Expect to scale Kubernetes to unprecedented sizes, automate bare-metal deployments across thousands of nodes, and create software layers that abstract multi-data-center complexity for seamless AI training workloads.

This is hands-on systems engineering where reliability meets velocity. You'll troubleshoot fires during critical training runs, build automation that scales with our ambitions, and continuously push the boundaries of what's possible in large-scale computing infrastructure.

OpenAI's mission to develop safe AGI requires infrastructure that doesn't just work—it excels under extreme conditions. Your work directly enables breakthroughs in artificial general intelligence that will shape humanity's future.

Key Responsibilities

As a Software Engineer on Frontier Clusters Infrastructure, your impact will span the full stack from bare metal to application workloads:

Kubernetes at Massive Scale: Architect, deploy, and scale Kubernetes clusters serving millions of pods across multiple data centers, implementing advanced provisioning, auto-scaling, and lifecycle automation.
Bare-Metal Mastery: Own the complete node lifecycle from rack delivery through firmware flashing, OS installation, and Kubernetes integration, achieving deployment times measured in minutes rather than hours.
Software Abstractions: Build distributed systems that present unified interfaces across heterogeneous clusters, enabling training workloads to scale seamlessly regardless of underlying hardware topology.
Operational Excellence: Drive metrics like cluster recovery time (target: <15 minutes), firmware upgrade velocity, and hardware utilization rates through relentless automation and observability.
End-to-End Reliability: Integrate server BMCs, switch telemetry, power systems, and cooling infrastructure into comprehensive health monitoring platforms that predict failures before they impact training.
Observability Engineering: Design monitoring systems capable of handling millions of metrics per second, implementing anomaly detection, capacity forecasting, and automated remediation at hyperscale.
Cross-Team Collaboration: Partner with AI researchers, hardware engineers, and networking teams to deliver infrastructure that meets the uncompromising demands of frontier model training.

Success requires balancing careful systems design with the urgency of production training runs—where minutes of downtime cost millions in compute and delay humanity's progress toward safe AGI.

Qualifications

We're seeking senior engineers who thrive in ambiguous, high-stakes environments:

5+ years operating Kubernetes or similar orchestration at significant scale (1000+ nodes)
Deep expertise in Linux systems, networking protocols, and GPU computing stacks
Strong programming background (Python/Go preferred) with production Infrastructure-as-Code experience
Proven ability to build automation that eliminates toil across large engineering teams
Experience with bare-metal provisioning, firmware management, or data center operations
Track record of improving operational metrics in mission-critical environments

Bonus points for: High-performance computing experience, RDMA networking, cluster federation patterns, or prior work with AI training infrastructure.

Salary & Benefits

Compensation: $250,000 - $450,000 base salary + equity + comprehensive benefits (total compensation varies based on experience and location).

Exceptional Benefits Package:

Comprehensive medical, dental, vision coverage (100% premiums covered)
Unlimited PTO with wellness days
Generous parental leave (16+ weeks)
401(k) with company match
Fitness stipend + wellness programs
Daily catered meals and fully stocked kitchens
Learning stipend for conferences and courses
Relocation support for SF move

OpenAI offers equity that could be life-changing as we approach AGI milestones.

Why Join OpenAI?

You'll work with brilliant minds pushing AI boundaries while building infrastructure that scales to meet them. OpenAI offers:

Mission Impact: Direct contribution to safe AGI development benefiting humanity
Technical Challenge: Largest supercomputers in the world with bleeding-edge hardware
Career Growth: Rapid learning environment with world-class peers
Culture: Mission-driven teams valuing curiosity, urgency, and impact

San Francisco headquarters provides access to top AI talent and hardware innovation ecosystem.

How to Apply

Ready to build the infrastructure powering AGI? Submit your resume and a brief note about your most impactful infrastructure project. We're moving quickly—top candidates interview within 1 week.

Application Link: [Apply Now]

OpenAI is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.

Locations

San Francisco, California, United States

Salary

Estimated Salary Rangehigh confidence

262,500 - 495,000 USD / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

Kubernetes cluster scalingintermediate
Distributed systems engineeringintermediate
Bare-metal provisioningintermediate
Infrastructure as Code (IaC)intermediate
Python programmingintermediate
Go programmingintermediate
Terraform automationintermediate
Linux systems administrationintermediate
GPU workload managementintermediate
Firmware upgradesintermediate
Networking integrationintermediate
Observability systemsintermediate
Cluster lifecycle managementintermediate
High-performance computing (HPC)intermediate
Container orchestrationintermediate
Automation scriptingintermediate
Data center operationsintermediate
Monitoring tools (Prometheus)intermediate
CI/CD pipelinesintermediate
Hyperscale infrastructureintermediate

Required Qualifications

5+ years experience as infrastructure, systems, or distributed systems engineer in large-scale environments (experience)
Deep knowledge of Kubernetes internals, scaling patterns, and containerized workloads (experience)
Proficiency in cloud infrastructure concepts including compute, networking, storage, and security (experience)
Strong programming skills in Python, Go, or similar languages (experience)
Hands-on experience with Infrastructure-as-Code tools like Terraform or CloudFormation (experience)
Comfortable operating bare-metal Linux environments and GPU hardware (experience)
Experience with large-scale networking and data center infrastructure (experience)
Proven track record of building automation to eliminate manual operations (experience)
Ability to diagnose and resolve issues in fast-moving, mission-critical systems (experience)
Familiarity with monitoring and observability systems for extreme workloads (experience)
Bonus: Background in GPU workloads, firmware management, or high-performance computing (experience)
Excellent problem-solving skills under pressure with high-impact operational challenges (experience)

Responsibilities

Spin up and scale massive Kubernetes clusters with automation for provisioning and bootstrapping
Develop software abstractions unifying multiple clusters for seamless training workloads
Own complete node bring-up process from bare metal through firmware upgrades at scale
Optimize operational metrics like reducing cluster restart times from hours to minutes
Accelerate firmware and OS upgrade cycles through automation and repeatable processes
Integrate networking systems with hardware health monitoring for end-to-end reliability
Build comprehensive monitoring and observability systems for early issue detection
Maintain hyperscale supercomputers during frontier model training runs
Diagnose and resolve production issues during high-pressure training operations
Collaborate with research teams to ensure compute infrastructure meets AI training needs
Implement cluster lifecycle management including scaling, upgrades, and decommissioning
Continuously improve automation to eliminate toil and manual intervention
Work cross-functionally with hardware, networking, and software engineering teams

Benefits

general: Competitive salary with equity package in a high-growth AI company
general: Comprehensive health, dental, and vision insurance coverage
general: 401(k) matching program for retirement savings
general: Unlimited PTO with encouraged recharge periods
general: Generous parental leave policies for new parents
general: Fully covered medical, dental, and vision premiums
general: Fitness stipend and wellness program membership
general: Commuter benefits and transportation allowances
general: Catered lunches daily and fully stocked kitchens
general: Learning and development stipend for professional growth
general: Relocation assistance for new hires
general: Work with world-class researchers and engineers on frontier AI
general: Mission-driven culture focused on safe AGI development
general: Modern office in San Francisco with latest hardware

Target Your Resume for "Software Engineer, Frontier Clusters Infrastructure Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Get personalized recommendations to optimize your resume specifically for Software Engineer, Frontier Clusters Infrastructure Careers at OpenAI - San Francisco, California | Apply Now!. Takes only 15 seconds!

AI-powered keyword optimization

Skills matching & gap analysis

Experience alignment suggestions

Check Your ATS Score for "Software Engineer, Frontier Clusters Infrastructure Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check

Keyword optimization analysis

Skill matching & gap identification

Format & readability score

Tags & Categories

OpenAI software engineer jobsKubernetes engineer San FranciscoFrontier infrastructure careersHyperscale cluster engineerDistributed systems OpenAIGPU infrastructure jobsAI training infrastructureBare metal automation engineerTerraform Kubernetes jobsData center operations OpenAIHigh performance computing careersSoftware engineer AGI infrastructureSan Francisco tech infrastructure jobsOpenAI frontier systems teamKubernetes scaling engineerFirmware automation specialistObservability systems engineerAI supercomputer infrastructureOpenAI infrastructure careersDistributed systems AI trainingHyperscale Kubernetes operationsScaling

Answer 10 quick questions to check your fit for Software Engineer, Frontier Clusters Infrastructure Careers at OpenAI - San Francisco, California | Apply Now! @ OpenAI.

10 Questions

~2 Minutes

Instant Score

Related Books and Jobs

No related jobs found at the moment.

Privacy Terms & Conditions About Us Refund Policy Recruiter Login Sitemap

Software Engineer, Frontier Clusters Infrastructure Careers at OpenAI - San Francisco, California | Apply Now!

OpenAI

Software Engineer, Frontier Clusters Infrastructure Careers at OpenAI - San Francisco, California | Apply Now!

OpenAI

full-timePosted: Feb 10, 2026

Job Description

Software Engineer, Frontier Clusters Infrastructure at OpenAI - San Francisco, CA

Role Overview

Key Responsibilities

As a Software Engineer on Frontier Clusters Infrastructure, your impact will span the full stack from bare metal to application workloads:

Kubernetes at Massive Scale: Architect, deploy, and scale Kubernetes clusters serving millions of pods across multiple data centers, implementing advanced provisioning, auto-scaling, and lifecycle automation.
Bare-Metal Mastery: Own the complete node lifecycle from rack delivery through firmware flashing, OS installation, and Kubernetes integration, achieving deployment times measured in minutes rather than hours.
Software Abstractions: Build distributed systems that present unified interfaces across heterogeneous clusters, enabling training workloads to scale seamlessly regardless of underlying hardware topology.
Operational Excellence: Drive metrics like cluster recovery time (target: <15 minutes), firmware upgrade velocity, and hardware utilization rates through relentless automation and observability.
End-to-End Reliability: Integrate server BMCs, switch telemetry, power systems, and cooling infrastructure into comprehensive health monitoring platforms that predict failures before they impact training.
Observability Engineering: Design monitoring systems capable of handling millions of metrics per second, implementing anomaly detection, capacity forecasting, and automated remediation at hyperscale.
Cross-Team Collaboration: Partner with AI researchers, hardware engineers, and networking teams to deliver infrastructure that meets the uncompromising demands of frontier model training.

Success requires balancing careful systems design with the urgency of production training runs—where minutes of downtime cost millions in compute and delay humanity's progress toward safe AGI.

Qualifications

We're seeking senior engineers who thrive in ambiguous, high-stakes environments:

5+ years operating Kubernetes or similar orchestration at significant scale (1000+ nodes)
Deep expertise in Linux systems, networking protocols, and GPU computing stacks
Strong programming background (Python/Go preferred) with production Infrastructure-as-Code experience
Proven ability to build automation that eliminates toil across large engineering teams
Experience with bare-metal provisioning, firmware management, or data center operations
Track record of improving operational metrics in mission-critical environments

Bonus points for: High-performance computing experience, RDMA networking, cluster federation patterns, or prior work with AI training infrastructure.

Salary & Benefits

Compensation: $250,000 - $450,000 base salary + equity + comprehensive benefits (total compensation varies based on experience and location).

Exceptional Benefits Package:

Comprehensive medical, dental, vision coverage (100% premiums covered)
Unlimited PTO with wellness days
Generous parental leave (16+ weeks)
401(k) with company match
Fitness stipend + wellness programs
Daily catered meals and fully stocked kitchens
Learning stipend for conferences and courses
Relocation support for SF move

OpenAI offers equity that could be life-changing as we approach AGI milestones.

Why Join OpenAI?

You'll work with brilliant minds pushing AI boundaries while building infrastructure that scales to meet them. OpenAI offers:

Mission Impact: Direct contribution to safe AGI development benefiting humanity
Technical Challenge: Largest supercomputers in the world with bleeding-edge hardware
Career Growth: Rapid learning environment with world-class peers
Culture: Mission-driven teams valuing curiosity, urgency, and impact

San Francisco headquarters provides access to top AI talent and hardware innovation ecosystem.

How to Apply

Ready to build the infrastructure powering AGI? Submit your resume and a brief note about your most impactful infrastructure project. We're moving quickly—top candidates interview within 1 week.

Application Link: [Apply Now]

OpenAI is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.

Locations

San Francisco, California, United States

Salary

Estimated Salary Rangehigh confidence

262,500 - 495,000 USD / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

Kubernetes cluster scalingintermediate
Distributed systems engineeringintermediate
Bare-metal provisioningintermediate
Infrastructure as Code (IaC)intermediate
Python programmingintermediate
Go programmingintermediate
Terraform automationintermediate
Linux systems administrationintermediate
GPU workload managementintermediate
Firmware upgradesintermediate
Networking integrationintermediate
Observability systemsintermediate
Cluster lifecycle managementintermediate
High-performance computing (HPC)intermediate
Container orchestrationintermediate
Automation scriptingintermediate
Data center operationsintermediate
Monitoring tools (Prometheus)intermediate
CI/CD pipelinesintermediate
Hyperscale infrastructureintermediate

Required Qualifications

5+ years experience as infrastructure, systems, or distributed systems engineer in large-scale environments (experience)
Deep knowledge of Kubernetes internals, scaling patterns, and containerized workloads (experience)
Proficiency in cloud infrastructure concepts including compute, networking, storage, and security (experience)
Strong programming skills in Python, Go, or similar languages (experience)
Hands-on experience with Infrastructure-as-Code tools like Terraform or CloudFormation (experience)
Comfortable operating bare-metal Linux environments and GPU hardware (experience)
Experience with large-scale networking and data center infrastructure (experience)
Proven track record of building automation to eliminate manual operations (experience)
Ability to diagnose and resolve issues in fast-moving, mission-critical systems (experience)
Familiarity with monitoring and observability systems for extreme workloads (experience)
Bonus: Background in GPU workloads, firmware management, or high-performance computing (experience)
Excellent problem-solving skills under pressure with high-impact operational challenges (experience)

Responsibilities

Spin up and scale massive Kubernetes clusters with automation for provisioning and bootstrapping
Develop software abstractions unifying multiple clusters for seamless training workloads
Own complete node bring-up process from bare metal through firmware upgrades at scale
Optimize operational metrics like reducing cluster restart times from hours to minutes
Accelerate firmware and OS upgrade cycles through automation and repeatable processes
Integrate networking systems with hardware health monitoring for end-to-end reliability
Build comprehensive monitoring and observability systems for early issue detection
Maintain hyperscale supercomputers during frontier model training runs
Diagnose and resolve production issues during high-pressure training operations
Collaborate with research teams to ensure compute infrastructure meets AI training needs
Implement cluster lifecycle management including scaling, upgrades, and decommissioning
Continuously improve automation to eliminate toil and manual intervention
Work cross-functionally with hardware, networking, and software engineering teams

Benefits

general: Competitive salary with equity package in a high-growth AI company
general: Comprehensive health, dental, and vision insurance coverage
general: 401(k) matching program for retirement savings
general: Unlimited PTO with encouraged recharge periods
general: Generous parental leave policies for new parents
general: Fully covered medical, dental, and vision premiums
general: Fitness stipend and wellness program membership
general: Commuter benefits and transportation allowances
general: Catered lunches daily and fully stocked kitchens
general: Learning and development stipend for professional growth
general: Relocation assistance for new hires
general: Work with world-class researchers and engineers on frontier AI
general: Mission-driven culture focused on safe AGI development
general: Modern office in San Francisco with latest hardware

Target Your Resume for "Software Engineer, Frontier Clusters Infrastructure Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

AI-powered keyword optimization

Skills matching & gap analysis

Experience alignment suggestions

Check Your ATS Score for "Software Engineer, Frontier Clusters Infrastructure Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check

Keyword optimization analysis

Skill matching & gap identification

Format & readability score

Tags & Categories

Answer 10 quick questions to check your fit for Software Engineer, Frontier Clusters Infrastructure Careers at OpenAI - San Francisco, California | Apply Now! @ OpenAI.

10 Questions

~2 Minutes

Instant Score

Related Books and Jobs

No related jobs found at the moment.

Privacy Terms & Conditions About Us Refund Policy Recruiter Login Sitemap