RESUME AND JOB

Software Engineer, GPU Infrastructure - HPC Careers at OpenAI - San Francisco, California | Apply Now!

OpenAI

Software Engineer, GPU Infrastructure - HPC Careers at OpenAI - San Francisco, California | Apply Now!

OpenAI

full-timePosted: Feb 10, 2026

Job Description

Software Engineer, GPU Infrastructure - HPC Careers at OpenAI - San Francisco, California

Join OpenAI's Fleet HPC team and become a critical guardian of the world's most advanced AI supercomputing infrastructure. As a Software Engineer, GPU Infrastructure - HPC, you'll ensure the reliability of massive GPU clusters powering groundbreaking research and products like ChatGPT. This senior-level role in San Francisco offers unparalleled impact on AI's future.

Role Overview

The Fleet team at OpenAI manages the computing backbone that drives our cutting-edge AI research and deployment. Spanning massive data centers filled with GPUs, high-speed networking, and custom infrastructure, our systems must deliver flawless performance at unprecedented scale. A single hardware failure can halt multi-billion-parameter model training or disrupt millions of ChatGPT users worldwide.

In this pivotal role, you'll pioneer solutions for state-of-the-art supercomputers that don't exist anywhere else. We're not just operating infrastructure—we're inventing the operational paradigms for AI at exascale. You'll enjoy high autonomy, deep ownership, and the ability to drive meaningful change across OpenAI's compute ecosystem.

Prior hardware expertise isn't required. We value engineers who excel at system-level thinking, automation, and turning chaos into reliable systems. If you thrive on comprehensive investigations, building at scale, and eliminating toil through code, this is your opportunity to shape AI infrastructure's frontier.

Key Responsibilities

Your day-to-day will focus on making OpenAI's compute fleet unbreakable:

Architect automation systems that provision and manage tens of thousands of GPU servers across global data centers
Build sophisticated monitoring solutions tracking server health, thermal performance, memory errors, and network fabric integrity
Lead cross-team collaborations with cluster orchestration, networking, and capacity planning groups
Work directly with colocation partners to enforce rigorous quality standards and rapid issue resolution
Profile and eliminate performance bottlenecks across the entire GPU stack—from PCIe lanes to inter-node Infiniband fabrics
Engineer continuous improvements that slash manual operations by orders of magnitude
Perform exhaustive root cause analysis on fleet-wide anomalies using every tool in your arsenal
Develop proactive detection systems that predict failures before they cascade
Fine-tune Linux kernels and hardware firmware for optimal HPC workloads
Manage full hardware lifecycles from procurement through decommissioning at supercomputer scale
Integrate observability across Prometheus, Grafana, custom agents, and ML-powered anomaly detection
Troubleshoot bleeding-edge systems where you're writing the playbooks as you go

Every decision you make directly impacts OpenAI's ability to push AI boundaries responsibly and reliably.

Qualifications

We're looking for engineers who combine builder and operator DNA:

Proven experience managing production server fleets at significant scale (10k+ nodes preferred)
Demonstrated ability to build AND reliably operate mission-critical infrastructure
Strong proficiency in Python, Go, Rust, or similar for infrastructure automation
Deep Linux expertise including kernel debugging, networking stack, and system tuning
Experience wrangling messy telemetry data with SQL, PromQL, Pandas, or equivalent
Track record of deep-dive investigations that uncover root causes others miss
Passion for automation that scales linearly while humans scale logarithmically

Bonus points for: Low-level hardware familiarity (PCIe, Infiniband, IPMI/Redfish), HPC experience, distributed systems design, or monitoring tool mastery. But don't hesitate if your background is software-heavy—we teach the hardware side.

Salary & Benefits

Compensation Range: $250,000 - $450,000 USD base salary (San Francisco), plus competitive equity package. Total compensation calibrated to senior infrastructure engineering market rates for top HPC talent.

Comprehensive Benefits Package:

Market-leading medical, dental, vision coverage
401(k) with generous matching
Unlimited PTO with recharge encouragement
Generous parental leave (primary + secondary)
Fertility assistance benefits
Mental health professional support
Professional growth stipend
Relocation support for SF move
Premium office perks and wellness reimbursement

OpenAI structures compensation to attract and retain exceptional talent building humanity's most important technology.

Why Join OpenAI?

OpenAI isn't just another tech company—we're creating artificial general intelligence to benefit all humanity. Your work directly enables:

Training next-generation foundation models that redefine what's possible
Serving hundreds of millions through products like ChatGPT
Safety-first deployment of increasingly powerful systems
Research pushing fundamental AI boundaries

Work with brilliant peers who value diverse perspectives. Enjoy high ownership, rapid iteration, and tangible impact. We're building infrastructure that will power AGI, and you can be at the forefront.

OpenAI is an equal opportunity employer committed to diversity across all dimensions.

How to Apply

Ready to safeguard the compute powering AGI? Submit your resume and a brief note about your most impactful infrastructure project. Tell us about a time you automated away months of manual work or solved a fleet-wide mystery. We're excited to hear from builder-operators passionate about reliable systems at scale.

Locations: San Francisco, CA (primary). Exceptional candidates considered for remote within US time zones.

Applications reviewed on rolling basis. OpenAI provides reasonable accommodation during the application process.

Keywords: OpenAI HPC jobs, GPU infrastructure engineer, software engineer HPC San Francisco, fleet engineering careers, AI infrastructure roles, supercomputer operations, distributed systems engineering OpenAI

Locations

San Francisco, California, United States

Salary

Estimated Salary Rangehigh confidence

262,500 - 495,000 USD / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

Python programmingintermediate
Go programmingintermediate
Linux systems administrationintermediate
Networking protocolsintermediate
Server hardware managementintermediate
HPC infrastructureintermediate
GPU cluster managementintermediate
Prometheus monitoringintermediate
Grafana dashboardsintermediate
SQL queryingintermediate
PromQLintermediate
Pandas data analysisintermediate
Automation scriptingintermediate
Distributed systemsintermediate
Infiniband networkingintermediate
PCIe protocolsintermediate
IPMI hardware managementintermediate
Redfish protocolsintermediate
Kernel performance tuningintermediate
Power management systemsintermediate

Required Qualifications

Experience managing large-scale server environments at production scale (experience)
Proven balance of building software tools and operationalizing infrastructure (experience)
Proficiency in Python, Go, or equivalent systems programming languages (experience)
Strong knowledge of Linux operating systems, networking, and server hardware (experience)
Comfortable analyzing noisy data using SQL, PromQL, Pandas, or similar tools (experience)
Ability to perform deep system-level investigations and root cause analysis (experience)
Experience developing automation for detection, alerting, and remediation (experience)
Familiarity with high-performance computing (HPC) environments (experience)
Understanding of hardware management protocols like IPMI and Redfish (experience)
Knowledge of low-level hardware details including PCIe, Infiniband, and power management (experience)
Experience with monitoring tools such as Prometheus and Grafana (experience)
Prior work with distributed systems or supercomputing infrastructure (experience)

Responsibilities

Build and maintain automation systems for provisioning thousands of servers across data centers
Develop comprehensive monitoring tools for server health, performance metrics, and lifecycle events
Collaborate closely with clusters, networking, and broader infrastructure engineering teams
Partner with external data center operators to maintain highest quality standards and SLAs
Identify performance bottlenecks in GPU infrastructure and implement targeted optimizations
Continuously improve automation pipelines to eliminate manual intervention and toil
Conduct thorough investigations into hardware failures and system disruptions at scale
Design and deploy tools for proactive detection of potential fleet-wide issues
Optimize kernel parameters and hardware configurations for maximum HPC efficiency
Manage lifecycle events including hardware refreshes, expansions, and decommissioning
Integrate with observability stacks using Prometheus, Grafana, and custom alerting
Troubleshoot state-of-the-art supercomputing systems pioneering new technologies
Ensure high availability and reliability of compute fleets powering AI research and products

Benefits

general: Competitive salary with equity package in a high-growth AI company
general: Comprehensive medical, dental, and vision insurance coverage
general: 401(k) retirement plan with generous company matching
general: Unlimited PTO policy with encouragement to recharge
general: Flexible work hours and hybrid work options in San Francisco
general: Generous parental leave for primary and secondary caregivers
general: Fertility assistance and family planning benefits
general: Mental health support through professional counseling services
general: Professional development stipend for conferences and courses
general: Fully stocked office with premium snacks and meals
general: Gym membership reimbursement and wellness programs
general: Relocation assistance package for new hires
general: Cutting-edge hardware and tools for maximum productivity
general: Direct impact on world-changing AI research and deployment

Target Your Resume for "Software Engineer, GPU Infrastructure - HPC Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Get personalized recommendations to optimize your resume specifically for Software Engineer, GPU Infrastructure - HPC Careers at OpenAI - San Francisco, California | Apply Now!. Takes only 15 seconds!

AI-powered keyword optimization

Skills matching & gap analysis

Experience alignment suggestions

Check Your ATS Score for "Software Engineer, GPU Infrastructure - HPC Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check

Keyword optimization analysis

Skill matching & gap identification

Format & readability score

Tags & Categories

OpenAI HPC jobsGPU infrastructure engineer OpenAISoftware Engineer HPC San Franciscofleet engineering careers OpenAIAI infrastructure jobssupercomputer operations engineerdistributed systems HPC OpenAIGPU cluster management jobsHPC automation engineerLinux systems engineer AIPrometheus Grafana OpenAI jobsInfiniband PCIe engineerIPMI Redfish jobskernel tuning HPC careersAI compute infrastructure rolesOpenAI fleet team jobsGPU server fleet managementHPC monitoring engineerSan Francisco HPC jobssenior infrastructure engineer OpenAIAI supercomputer operationsPython Go HPC developerScaling

Answer 10 quick questions to check your fit for Software Engineer, GPU Infrastructure - HPC Careers at OpenAI - San Francisco, California | Apply Now! @ OpenAI.

10 Questions

~2 Minutes

Instant Score

Related Books and Jobs

No related jobs found at the moment.

Privacy Terms & Conditions About Us Refund Policy Recruiter Login Sitemap

Software Engineer, GPU Infrastructure - HPC Careers at OpenAI - San Francisco, California | Apply Now!

OpenAI

Software Engineer, GPU Infrastructure - HPC Careers at OpenAI - San Francisco, California | Apply Now!

OpenAI

full-timePosted: Feb 10, 2026

Job Description

Software Engineer, GPU Infrastructure - HPC Careers at OpenAI - San Francisco, California

Role Overview

Key Responsibilities

Your day-to-day will focus on making OpenAI's compute fleet unbreakable:

Architect automation systems that provision and manage tens of thousands of GPU servers across global data centers
Build sophisticated monitoring solutions tracking server health, thermal performance, memory errors, and network fabric integrity
Lead cross-team collaborations with cluster orchestration, networking, and capacity planning groups
Work directly with colocation partners to enforce rigorous quality standards and rapid issue resolution
Profile and eliminate performance bottlenecks across the entire GPU stack—from PCIe lanes to inter-node Infiniband fabrics
Engineer continuous improvements that slash manual operations by orders of magnitude
Perform exhaustive root cause analysis on fleet-wide anomalies using every tool in your arsenal
Develop proactive detection systems that predict failures before they cascade
Fine-tune Linux kernels and hardware firmware for optimal HPC workloads
Manage full hardware lifecycles from procurement through decommissioning at supercomputer scale
Integrate observability across Prometheus, Grafana, custom agents, and ML-powered anomaly detection
Troubleshoot bleeding-edge systems where you're writing the playbooks as you go

Every decision you make directly impacts OpenAI's ability to push AI boundaries responsibly and reliably.

Qualifications

We're looking for engineers who combine builder and operator DNA:

Proven experience managing production server fleets at significant scale (10k+ nodes preferred)
Demonstrated ability to build AND reliably operate mission-critical infrastructure
Strong proficiency in Python, Go, Rust, or similar for infrastructure automation
Deep Linux expertise including kernel debugging, networking stack, and system tuning
Experience wrangling messy telemetry data with SQL, PromQL, Pandas, or equivalent
Track record of deep-dive investigations that uncover root causes others miss
Passion for automation that scales linearly while humans scale logarithmically

Salary & Benefits

Comprehensive Benefits Package:

Market-leading medical, dental, vision coverage
401(k) with generous matching
Unlimited PTO with recharge encouragement
Generous parental leave (primary + secondary)
Fertility assistance benefits
Mental health professional support
Professional growth stipend
Relocation support for SF move
Premium office perks and wellness reimbursement

OpenAI structures compensation to attract and retain exceptional talent building humanity's most important technology.

Why Join OpenAI?

OpenAI isn't just another tech company—we're creating artificial general intelligence to benefit all humanity. Your work directly enables:

Training next-generation foundation models that redefine what's possible
Serving hundreds of millions through products like ChatGPT
Safety-first deployment of increasingly powerful systems
Research pushing fundamental AI boundaries

OpenAI is an equal opportunity employer committed to diversity across all dimensions.

How to Apply

Locations: San Francisco, CA (primary). Exceptional candidates considered for remote within US time zones.

Applications reviewed on rolling basis. OpenAI provides reasonable accommodation during the application process.

Locations

San Francisco, California, United States

Salary

Estimated Salary Rangehigh confidence

262,500 - 495,000 USD / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

Python programmingintermediate
Go programmingintermediate
Linux systems administrationintermediate
Networking protocolsintermediate
Server hardware managementintermediate
HPC infrastructureintermediate
GPU cluster managementintermediate
Prometheus monitoringintermediate
Grafana dashboardsintermediate
SQL queryingintermediate
PromQLintermediate
Pandas data analysisintermediate
Automation scriptingintermediate
Distributed systemsintermediate
Infiniband networkingintermediate
PCIe protocolsintermediate
IPMI hardware managementintermediate
Redfish protocolsintermediate
Kernel performance tuningintermediate
Power management systemsintermediate

Required Qualifications

Experience managing large-scale server environments at production scale (experience)
Proven balance of building software tools and operationalizing infrastructure (experience)
Proficiency in Python, Go, or equivalent systems programming languages (experience)
Strong knowledge of Linux operating systems, networking, and server hardware (experience)
Comfortable analyzing noisy data using SQL, PromQL, Pandas, or similar tools (experience)
Ability to perform deep system-level investigations and root cause analysis (experience)
Experience developing automation for detection, alerting, and remediation (experience)
Familiarity with high-performance computing (HPC) environments (experience)
Understanding of hardware management protocols like IPMI and Redfish (experience)
Knowledge of low-level hardware details including PCIe, Infiniband, and power management (experience)
Experience with monitoring tools such as Prometheus and Grafana (experience)
Prior work with distributed systems or supercomputing infrastructure (experience)

Responsibilities

Build and maintain automation systems for provisioning thousands of servers across data centers
Develop comprehensive monitoring tools for server health, performance metrics, and lifecycle events
Collaborate closely with clusters, networking, and broader infrastructure engineering teams
Partner with external data center operators to maintain highest quality standards and SLAs
Identify performance bottlenecks in GPU infrastructure and implement targeted optimizations
Continuously improve automation pipelines to eliminate manual intervention and toil
Conduct thorough investigations into hardware failures and system disruptions at scale
Design and deploy tools for proactive detection of potential fleet-wide issues
Optimize kernel parameters and hardware configurations for maximum HPC efficiency
Manage lifecycle events including hardware refreshes, expansions, and decommissioning
Integrate with observability stacks using Prometheus, Grafana, and custom alerting
Troubleshoot state-of-the-art supercomputing systems pioneering new technologies
Ensure high availability and reliability of compute fleets powering AI research and products

Benefits

general: Competitive salary with equity package in a high-growth AI company
general: Comprehensive medical, dental, and vision insurance coverage
general: 401(k) retirement plan with generous company matching
general: Unlimited PTO policy with encouragement to recharge
general: Flexible work hours and hybrid work options in San Francisco
general: Generous parental leave for primary and secondary caregivers
general: Fertility assistance and family planning benefits
general: Mental health support through professional counseling services
general: Professional development stipend for conferences and courses
general: Fully stocked office with premium snacks and meals
general: Gym membership reimbursement and wellness programs
general: Relocation assistance package for new hires
general: Cutting-edge hardware and tools for maximum productivity
general: Direct impact on world-changing AI research and deployment

Target Your Resume for "Software Engineer, GPU Infrastructure - HPC Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

AI-powered keyword optimization

Skills matching & gap analysis

Experience alignment suggestions

Check Your ATS Score for "Software Engineer, GPU Infrastructure - HPC Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check

Keyword optimization analysis

Skill matching & gap identification

Format & readability score

Tags & Categories

Answer 10 quick questions to check your fit for Software Engineer, GPU Infrastructure - HPC Careers at OpenAI - San Francisco, California | Apply Now! @ OpenAI.

10 Questions

~2 Minutes

Instant Score

Related Books and Jobs

No related jobs found at the moment.

Privacy Terms & Conditions About Us Refund Policy Recruiter Login Sitemap