RESUME AND JOB

Software Engineer, Frontier Systems Careers at OpenAI - San Francisco, California | Apply Now!

OpenAI

Software Engineer, Frontier Systems Careers at OpenAI - San Francisco, California | Apply Now!

OpenAI

full-timePosted: Feb 10, 2026

Job Description

Software Engineer, Frontier Systems at OpenAI - San Francisco, CA

Join OpenAI's Frontier Systems team and build the backbone of the world's most powerful AI supercomputers. This senior-level Software Engineer role in San Francisco offers a chance to own critical infrastructure that powers cutting-edge model training. If you have 7+ years in software engineering, expertise in Python, shell scripting, and data analysis tools like SQL and Pandas, apply now to make a real impact on humanity's AI future.

Role Overview

The Frontier Systems team at OpenAI is at the forefront of AI infrastructure innovation. We design, build, launch, and maintain the largest supercomputers on the planet, purpose-built for training OpenAI's frontier AI models. As a Software Engineer on this team, you'll transform data center blueprints into reliable, high-performance systems capable of running uninterrupted training runs for the most advanced AI systems.

Your work directly supports OpenAI's mission to develop safe artificial general intelligence (AGI) that benefits all of humanity. Even a single hardware failure can cost weeks of compute time and millions in resources, so reliability is paramount. Engineers here are trusted with full ownership—from diagnosing complex system issues to deploying automation that scales across thousands of GPUs and nodes.

This isn't just ops work; it's systems engineering at the bleeding edge. You'll dive deep into root causes, build tools that prevent failures proactively, and collaborate with top AI researchers to keep training pipelines humming 24/7. No prior hardware experience required—we'll teach you the low-level details like PCIe protocols, Infiniband networking, and kernel tuning as you go.

Based in San Francisco, this role offers the excitement of working on infrastructure that pushes the boundaries of what's possible in AI compute.

Key Responsibilities

In this high-impact role, you'll take end-to-end ownership of the systems that power frontier model training. Here's what you'll do daily:

Own system health checks that ensure hyperscale supercomputers remain stable during multi-week training runs.
Lead investigations into hardware failures, analyzing petabytes of telemetry to pinpoint root causes.
Build Python-based automation to monitor and remediate issues across 10,000+ machines in real-time.
Dig into noisy logs and metrics using SQL queries, PromQL, and Pandas for reproducible insights.
Develop shell scripts and tools for low-level hardware interactions, like power cycling nodes or diagnosing Infiniband link flaps.
Optimize kernel parameters for peak GPU utilization and minimal latency in distributed training.
Create dashboards and visualizations for data center-wide health monitoring.
Collaborate with hardware vendors to resolve systemic issues at exascale.
Design failover logic to handle node failures without interrupting model training.
Scale monitoring systems as clusters grow from thousands to tens of thousands of GPUs.
Document failure modes and build preventive automation to eliminate them.
Support 24/7 on-call rotations with a focus on minimizing researcher downtime.
Contribute to open-source tools for large-scale systems management (where possible).

Expect to wear multiple hats: builder, detective, and optimizer—all while shipping code that runs the future of AI.

Qualifications

We're looking for senior engineers who thrive in ambiguity and scale. Required:

7+ years of software engineering experience in production environments.
Expert Python and shell scripting for automation and tooling.
Proven ability to wrangle noisy data with SQL, PromQL, Pandas, or similar.
Track record of building reproducible analyses that drive decisions.
Balanced skills in software development and operational reliability.

Bonus points for:

Hands-on experience with hardware internals (PCIe, Infiniband, networking).
Data center visualization and monitoring expertise.
Network ops, power management, or Linux kernel tuning.

Prior supercomputing or AI infra experience is a plus, but not required. Strong systems thinkers from cloud, HPC, or large-scale web services will excel here.

Salary & Benefits

Competitive total compensation for senior Software Engineers in San Francisco ranges from $250,000 to $450,000 base, plus equity and bonuses. OpenAI offers one of the best packages in tech:

Top-tier medical, dental, vision coverage.
401(k) with 4%+ match.
Unlimited vacation and flexible hours.
16+ weeks parental leave.
Wellness stipends and gym reimbursements.
Relocation support for SF move.
Stock options in a unicorn shaping AGI.

Full details shared during interviews.

Why Join OpenAI?

OpenAI isn't just another tech company—we're building AGI to benefit humanity. Your code will run on supercomputers training models that could solve climate change, cure diseases, and accelerate scientific discovery. Join a mission-driven team of the world's best, with a culture that values impact over bureaucracy.

San Francisco HQ offers vibrant collaboration, stocked kitchens, and events. We're equal opportunity employers committed to diversity.

How to Apply

Submit your resume and a note on why you're excited about Frontier Systems. Interviews include technical deep dives, systems debugging, and team fit. No agencies, please.

Apply now and power the next era of AI!

Locations

San Francisco, California, United States

Salary

Estimated Salary Rangehigh confidence

262,500 - 495,000 USD / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

Python programmingintermediate
Shell scriptingintermediate
SQL queryingintermediate
PromQLintermediate
Pandas data analysisintermediate
Linux systems administrationintermediate
Hardware troubleshootingintermediate
System health monitoringintermediate
Automation scriptingintermediate
Root cause analysisintermediate
Data center operationsintermediate
Network protocols (PCIe, Infiniband)intermediate
Power managementintermediate
Kernel performance tuningintermediate
Reproducible analysesintermediate
Large-scale infrastructureintermediate
Visualization toolsintermediate
Network operationsintermediate
Hyperscale computingintermediate
AI model training supportintermediate

Required Qualifications

7+ years of industry experience in software engineering (experience)
Proficiency with Python and shell scripting for automation (experience)
High comfort level digging into noisy data using SQL, PromQL, and Pandas (experience)
Experience developing reproducible analyses for system diagnostics (experience)
Strong balance of building software and operationalizing infrastructure (experience)
Ability to own end-to-end system health checks for hyperscale supercomputers (experience)
Expertise in leading deep dives into hardware failures and system-level bugs (experience)
Comfort with low-level hardware details like PCIe, Infiniband, and networking (bonus) (experience)
Experience with Linux tooling for power management and kernel perf tuning (bonus) (experience)
Familiarity with visualization of large data centers and networks (bonus) (experience)
Proven track record in network operations and tooling (bonus) (experience)
Passion for stabilizing systems during cutting-edge AI model training (experience)

Responsibilities

Own and continuously improve system health checks for hyperscale supercomputers
Lead deep-dive investigations into hardware failures at massive scale
Perform root cause analysis on system-level bugs impacting model training
Build and deploy automation tools to monitor thousands of machines
Develop scripts to automatically detect and fix issues without human intervention
Analyze noisy telemetry data using SQL, PromQL, and Python Pandas
Create reproducible analyses to document and prevent recurring failures
Collaborate with researchers to minimize disruptions during frontier model training
Optimize power management and stabilization across data center infrastructure
Tune Linux kernel performance for high-efficiency supercomputing workloads
Visualize and monitor large-scale data center networks and hardware states
Design and implement failover mechanisms for critical training infrastructure
Support the launch and scaling of the world's largest AI supercomputers
Integrate hardware protocols like PCIe and Infiniband into monitoring systems

Benefits

general: Comprehensive health, dental, and vision insurance plans
general: 401(k) retirement savings with generous company matching
general: Unlimited PTO policy to promote work-life balance
general: Generous parental leave for new parents
general: Mental health support through partnered counseling services
general: Fitness reimbursement and wellness stipends
general: Fully stocked kitchens with healthy snacks and meals
general: Learning and development stipend for conferences and courses
general: Equity stock options in a high-growth AI company
general: Commuter benefits for San Francisco public transit
general: Relocation assistance for out-of-state candidates
general: Team offsites and social events to build camaraderie
general: Cutting-edge work on world's largest supercomputers
general: Mission-driven culture focused on safe AGI for humanity

Target Your Resume for "Software Engineer, Frontier Systems Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Get personalized recommendations to optimize your resume specifically for Software Engineer, Frontier Systems Careers at OpenAI - San Francisco, California | Apply Now!. Takes only 15 seconds!

AI-powered keyword optimization

Skills matching & gap analysis

Experience alignment suggestions

Check Your ATS Score for "Software Engineer, Frontier Systems Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check

Keyword optimization analysis

Skill matching & gap identification

Format & readability score

Tags & Categories

software engineer openaifrontier systems openaiopenai careers san franciscohyperscale supercomputer engineerai infrastructure jobspython systems engineer openaidata center automation engineersenior software engineer aiopenai frontier model traininglinux kernel tuning jobsinfiniband networking engineerpcie hardware troubleshootingpromql sql pandas jobsroot cause analysis ai infrapower management data centersoftware engineer supercomputeropenai san francisco jobsai research infrastructurelarge scale systems engineeropenai software engineer salaryhyperscale computing careersagi infrastructure rolesScaling

Answer 10 quick questions to check your fit for Software Engineer, Frontier Systems Careers at OpenAI - San Francisco, California | Apply Now! @ OpenAI.

10 Questions

~2 Minutes

Instant Score

Related Books and Jobs

No related jobs found at the moment.

Privacy Terms & Conditions About Us Refund Policy Recruiter Login Sitemap

Software Engineer, Frontier Systems Careers at OpenAI - San Francisco, California | Apply Now!

OpenAI

Software Engineer, Frontier Systems Careers at OpenAI - San Francisco, California | Apply Now!

OpenAI

full-timePosted: Feb 10, 2026

Job Description

Software Engineer, Frontier Systems at OpenAI - San Francisco, CA

Role Overview

Based in San Francisco, this role offers the excitement of working on infrastructure that pushes the boundaries of what's possible in AI compute.

Key Responsibilities

In this high-impact role, you'll take end-to-end ownership of the systems that power frontier model training. Here's what you'll do daily:

Own system health checks that ensure hyperscale supercomputers remain stable during multi-week training runs.
Lead investigations into hardware failures, analyzing petabytes of telemetry to pinpoint root causes.
Build Python-based automation to monitor and remediate issues across 10,000+ machines in real-time.
Dig into noisy logs and metrics using SQL queries, PromQL, and Pandas for reproducible insights.
Develop shell scripts and tools for low-level hardware interactions, like power cycling nodes or diagnosing Infiniband link flaps.
Optimize kernel parameters for peak GPU utilization and minimal latency in distributed training.
Create dashboards and visualizations for data center-wide health monitoring.
Collaborate with hardware vendors to resolve systemic issues at exascale.
Design failover logic to handle node failures without interrupting model training.
Scale monitoring systems as clusters grow from thousands to tens of thousands of GPUs.
Document failure modes and build preventive automation to eliminate them.
Support 24/7 on-call rotations with a focus on minimizing researcher downtime.
Contribute to open-source tools for large-scale systems management (where possible).

Expect to wear multiple hats: builder, detective, and optimizer—all while shipping code that runs the future of AI.

Qualifications

We're looking for senior engineers who thrive in ambiguity and scale. Required:

7+ years of software engineering experience in production environments.
Expert Python and shell scripting for automation and tooling.
Proven ability to wrangle noisy data with SQL, PromQL, Pandas, or similar.
Track record of building reproducible analyses that drive decisions.
Balanced skills in software development and operational reliability.

Bonus points for:

Hands-on experience with hardware internals (PCIe, Infiniband, networking).
Data center visualization and monitoring expertise.
Network ops, power management, or Linux kernel tuning.

Prior supercomputing or AI infra experience is a plus, but not required. Strong systems thinkers from cloud, HPC, or large-scale web services will excel here.

Salary & Benefits

Competitive total compensation for senior Software Engineers in San Francisco ranges from $250,000 to $450,000 base, plus equity and bonuses. OpenAI offers one of the best packages in tech:

Top-tier medical, dental, vision coverage.
401(k) with 4%+ match.
Unlimited vacation and flexible hours.
16+ weeks parental leave.
Wellness stipends and gym reimbursements.
Relocation support for SF move.
Stock options in a unicorn shaping AGI.

Full details shared during interviews.

Why Join OpenAI?

San Francisco HQ offers vibrant collaboration, stocked kitchens, and events. We're equal opportunity employers committed to diversity.

How to Apply

Submit your resume and a note on why you're excited about Frontier Systems. Interviews include technical deep dives, systems debugging, and team fit. No agencies, please.

Apply now and power the next era of AI!

Locations

San Francisco, California, United States

Salary

Estimated Salary Rangehigh confidence

262,500 - 495,000 USD / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

Python programmingintermediate
Shell scriptingintermediate
SQL queryingintermediate
PromQLintermediate
Pandas data analysisintermediate
Linux systems administrationintermediate
Hardware troubleshootingintermediate
System health monitoringintermediate
Automation scriptingintermediate
Root cause analysisintermediate
Data center operationsintermediate
Network protocols (PCIe, Infiniband)intermediate
Power managementintermediate
Kernel performance tuningintermediate
Reproducible analysesintermediate
Large-scale infrastructureintermediate
Visualization toolsintermediate
Network operationsintermediate
Hyperscale computingintermediate
AI model training supportintermediate

Required Qualifications

7+ years of industry experience in software engineering (experience)
Proficiency with Python and shell scripting for automation (experience)
High comfort level digging into noisy data using SQL, PromQL, and Pandas (experience)
Experience developing reproducible analyses for system diagnostics (experience)
Strong balance of building software and operationalizing infrastructure (experience)
Ability to own end-to-end system health checks for hyperscale supercomputers (experience)
Expertise in leading deep dives into hardware failures and system-level bugs (experience)
Comfort with low-level hardware details like PCIe, Infiniband, and networking (bonus) (experience)
Experience with Linux tooling for power management and kernel perf tuning (bonus) (experience)
Familiarity with visualization of large data centers and networks (bonus) (experience)
Proven track record in network operations and tooling (bonus) (experience)
Passion for stabilizing systems during cutting-edge AI model training (experience)

Responsibilities

Own and continuously improve system health checks for hyperscale supercomputers
Lead deep-dive investigations into hardware failures at massive scale
Perform root cause analysis on system-level bugs impacting model training
Build and deploy automation tools to monitor thousands of machines
Develop scripts to automatically detect and fix issues without human intervention
Analyze noisy telemetry data using SQL, PromQL, and Python Pandas
Create reproducible analyses to document and prevent recurring failures
Collaborate with researchers to minimize disruptions during frontier model training
Optimize power management and stabilization across data center infrastructure
Tune Linux kernel performance for high-efficiency supercomputing workloads
Visualize and monitor large-scale data center networks and hardware states
Design and implement failover mechanisms for critical training infrastructure
Support the launch and scaling of the world's largest AI supercomputers
Integrate hardware protocols like PCIe and Infiniband into monitoring systems

Benefits

general: Comprehensive health, dental, and vision insurance plans
general: 401(k) retirement savings with generous company matching
general: Unlimited PTO policy to promote work-life balance
general: Generous parental leave for new parents
general: Mental health support through partnered counseling services
general: Fitness reimbursement and wellness stipends
general: Fully stocked kitchens with healthy snacks and meals
general: Learning and development stipend for conferences and courses
general: Equity stock options in a high-growth AI company
general: Commuter benefits for San Francisco public transit
general: Relocation assistance for out-of-state candidates
general: Team offsites and social events to build camaraderie
general: Cutting-edge work on world's largest supercomputers
general: Mission-driven culture focused on safe AGI for humanity

Target Your Resume for "Software Engineer, Frontier Systems Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Get personalized recommendations to optimize your resume specifically for Software Engineer, Frontier Systems Careers at OpenAI - San Francisco, California | Apply Now!. Takes only 15 seconds!

AI-powered keyword optimization

Skills matching & gap analysis

Experience alignment suggestions

Check Your ATS Score for "Software Engineer, Frontier Systems Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check

Keyword optimization analysis

Skill matching & gap identification

Format & readability score

Tags & Categories

Answer 10 quick questions to check your fit for Software Engineer, Frontier Systems Careers at OpenAI - San Francisco, California | Apply Now! @ OpenAI.

10 Questions

~2 Minutes

Instant Score

Related Books and Jobs

No related jobs found at the moment.

Privacy Terms & Conditions About Us Refund Policy Recruiter Login Sitemap