Resume and JobRESUME AND JOB
OpenAI logo

Software Engineer, GPU Infrastructure - HPC Careers at OpenAI - San Francisco, California | Apply Now!

OpenAI

Software Engineer, GPU Infrastructure - HPC Careers at OpenAI - San Francisco, California | Apply Now!

full-timePosted: Feb 10, 2026

Job Description

Software Engineer, GPU Infrastructure - HPC Careers at OpenAI - San Francisco, California

Join OpenAI's Fleet HPC team and become a critical guardian of the world's most advanced AI supercomputing infrastructure. As a Software Engineer, GPU Infrastructure - HPC, you'll ensure the reliability of massive GPU clusters powering groundbreaking research and products like ChatGPT. This senior-level role in San Francisco offers unparalleled impact on AI's future.

Role Overview

The Fleet team at OpenAI manages the computing backbone that drives our cutting-edge AI research and deployment. Spanning massive data centers filled with GPUs, high-speed networking, and custom infrastructure, our systems must deliver flawless performance at unprecedented scale. A single hardware failure can halt multi-billion-parameter model training or disrupt millions of ChatGPT users worldwide.

In this pivotal role, you'll pioneer solutions for state-of-the-art supercomputers that don't exist anywhere else. We're not just operating infrastructure—we're inventing the operational paradigms for AI at exascale. You'll enjoy high autonomy, deep ownership, and the ability to drive meaningful change across OpenAI's compute ecosystem.

Prior hardware expertise isn't required. We value engineers who excel at system-level thinking, automation, and turning chaos into reliable systems. If you thrive on comprehensive investigations, building at scale, and eliminating toil through code, this is your opportunity to shape AI infrastructure's frontier.

Key Responsibilities

Your day-to-day will focus on making OpenAI's compute fleet unbreakable:

  • Architect automation systems that provision and manage tens of thousands of GPU servers across global data centers
  • Build sophisticated monitoring solutions tracking server health, thermal performance, memory errors, and network fabric integrity
  • Lead cross-team collaborations with cluster orchestration, networking, and capacity planning groups
  • Work directly with colocation partners to enforce rigorous quality standards and rapid issue resolution
  • Profile and eliminate performance bottlenecks across the entire GPU stack—from PCIe lanes to inter-node Infiniband fabrics
  • Engineer continuous improvements that slash manual operations by orders of magnitude
  • Perform exhaustive root cause analysis on fleet-wide anomalies using every tool in your arsenal
  • Develop proactive detection systems that predict failures before they cascade
  • Fine-tune Linux kernels and hardware firmware for optimal HPC workloads
  • Manage full hardware lifecycles from procurement through decommissioning at supercomputer scale
  • Integrate observability across Prometheus, Grafana, custom agents, and ML-powered anomaly detection
  • Troubleshoot bleeding-edge systems where you're writing the playbooks as you go

Every decision you make directly impacts OpenAI's ability to push AI boundaries responsibly and reliably.

Qualifications

We're looking for engineers who combine builder and operator DNA:

  • Proven experience managing production server fleets at significant scale (10k+ nodes preferred)
  • Demonstrated ability to build AND reliably operate mission-critical infrastructure
  • Strong proficiency in Python, Go, Rust, or similar for infrastructure automation
  • Deep Linux expertise including kernel debugging, networking stack, and system tuning
  • Experience wrangling messy telemetry data with SQL, PromQL, Pandas, or equivalent
  • Track record of deep-dive investigations that uncover root causes others miss
  • Passion for automation that scales linearly while humans scale logarithmically

Bonus points for: Low-level hardware familiarity (PCIe, Infiniband, IPMI/Redfish), HPC experience, distributed systems design, or monitoring tool mastery. But don't hesitate if your background is software-heavy—we teach the hardware side.

Salary & Benefits

Compensation Range: $250,000 - $450,000 USD base salary (San Francisco), plus competitive equity package. Total compensation calibrated to senior infrastructure engineering market rates for top HPC talent.

Comprehensive Benefits Package:

  • Market-leading medical, dental, vision coverage
  • 401(k) with generous matching
  • Unlimited PTO with recharge encouragement
  • Generous parental leave (primary + secondary)
  • Fertility assistance benefits
  • Mental health professional support
  • Professional growth stipend
  • Relocation support for SF move
  • Premium office perks and wellness reimbursement

OpenAI structures compensation to attract and retain exceptional talent building humanity's most important technology.

Why Join OpenAI?

OpenAI isn't just another tech company—we're creating artificial general intelligence to benefit all humanity. Your work directly enables:

  • Training next-generation foundation models that redefine what's possible
  • Serving hundreds of millions through products like ChatGPT
  • Safety-first deployment of increasingly powerful systems
  • Research pushing fundamental AI boundaries

Work with brilliant peers who value diverse perspectives. Enjoy high ownership, rapid iteration, and tangible impact. We're building infrastructure that will power AGI, and you can be at the forefront.

OpenAI is an equal opportunity employer committed to diversity across all dimensions.

How to Apply

Ready to safeguard the compute powering AGI? Submit your resume and a brief note about your most impactful infrastructure project. Tell us about a time you automated away months of manual work or solved a fleet-wide mystery. We're excited to hear from builder-operators passionate about reliable systems at scale.

Locations: San Francisco, CA (primary). Exceptional candidates considered for remote within US time zones.

Applications reviewed on rolling basis. OpenAI provides reasonable accommodation during the application process.

Keywords: OpenAI HPC jobs, GPU infrastructure engineer, software engineer HPC San Francisco, fleet engineering careers, AI infrastructure roles, supercomputer operations, distributed systems engineering OpenAI

Locations

  • San Francisco, California, United States

Salary

Estimated Salary Rangehigh confidence

262,500 - 495,000 USD / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

  • Python programmingintermediate
  • Go programmingintermediate
  • Linux systems administrationintermediate
  • Networking protocolsintermediate
  • Server hardware managementintermediate
  • HPC infrastructureintermediate
  • GPU cluster managementintermediate
  • Prometheus monitoringintermediate
  • Grafana dashboardsintermediate
  • SQL queryingintermediate
  • PromQLintermediate
  • Pandas data analysisintermediate
  • Automation scriptingintermediate
  • Distributed systemsintermediate
  • Infiniband networkingintermediate
  • PCIe protocolsintermediate
  • IPMI hardware managementintermediate
  • Redfish protocolsintermediate
  • Kernel performance tuningintermediate
  • Power management systemsintermediate

Required Qualifications

  • Experience managing large-scale server environments at production scale (experience)
  • Proven balance of building software tools and operationalizing infrastructure (experience)
  • Proficiency in Python, Go, or equivalent systems programming languages (experience)
  • Strong knowledge of Linux operating systems, networking, and server hardware (experience)
  • Comfortable analyzing noisy data using SQL, PromQL, Pandas, or similar tools (experience)
  • Ability to perform deep system-level investigations and root cause analysis (experience)
  • Experience developing automation for detection, alerting, and remediation (experience)
  • Familiarity with high-performance computing (HPC) environments (experience)
  • Understanding of hardware management protocols like IPMI and Redfish (experience)
  • Knowledge of low-level hardware details including PCIe, Infiniband, and power management (experience)
  • Experience with monitoring tools such as Prometheus and Grafana (experience)
  • Prior work with distributed systems or supercomputing infrastructure (experience)

Responsibilities

  • Build and maintain automation systems for provisioning thousands of servers across data centers
  • Develop comprehensive monitoring tools for server health, performance metrics, and lifecycle events
  • Collaborate closely with clusters, networking, and broader infrastructure engineering teams
  • Partner with external data center operators to maintain highest quality standards and SLAs
  • Identify performance bottlenecks in GPU infrastructure and implement targeted optimizations
  • Continuously improve automation pipelines to eliminate manual intervention and toil
  • Conduct thorough investigations into hardware failures and system disruptions at scale
  • Design and deploy tools for proactive detection of potential fleet-wide issues
  • Optimize kernel parameters and hardware configurations for maximum HPC efficiency
  • Manage lifecycle events including hardware refreshes, expansions, and decommissioning
  • Integrate with observability stacks using Prometheus, Grafana, and custom alerting
  • Troubleshoot state-of-the-art supercomputing systems pioneering new technologies
  • Ensure high availability and reliability of compute fleets powering AI research and products

Benefits

  • general: Competitive salary with equity package in a high-growth AI company
  • general: Comprehensive medical, dental, and vision insurance coverage
  • general: 401(k) retirement plan with generous company matching
  • general: Unlimited PTO policy with encouragement to recharge
  • general: Flexible work hours and hybrid work options in San Francisco
  • general: Generous parental leave for primary and secondary caregivers
  • general: Fertility assistance and family planning benefits
  • general: Mental health support through professional counseling services
  • general: Professional development stipend for conferences and courses
  • general: Fully stocked office with premium snacks and meals
  • general: Gym membership reimbursement and wellness programs
  • general: Relocation assistance package for new hires
  • general: Cutting-edge hardware and tools for maximum productivity
  • general: Direct impact on world-changing AI research and deployment

Target Your Resume for "Software Engineer, GPU Infrastructure - HPC Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Get personalized recommendations to optimize your resume specifically for Software Engineer, GPU Infrastructure - HPC Careers at OpenAI - San Francisco, California | Apply Now!. Takes only 15 seconds!

AI-powered keyword optimization
Skills matching & gap analysis
Experience alignment suggestions

Check Your ATS Score for "Software Engineer, GPU Infrastructure - HPC Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check
Keyword optimization analysis
Skill matching & gap identification
Format & readability score

Tags & Categories

OpenAI HPC jobsGPU infrastructure engineer OpenAISoftware Engineer HPC San Franciscofleet engineering careers OpenAIAI infrastructure jobssupercomputer operations engineerdistributed systems HPC OpenAIGPU cluster management jobsHPC automation engineerLinux systems engineer AIPrometheus Grafana OpenAI jobsInfiniband PCIe engineerIPMI Redfish jobskernel tuning HPC careersAI compute infrastructure rolesOpenAI fleet team jobsGPU server fleet managementHPC monitoring engineerSan Francisco HPC jobssenior infrastructure engineer OpenAIAI supercomputer operationsPython Go HPC developerScaling

Answer 10 quick questions to check your fit for Software Engineer, GPU Infrastructure - HPC Careers at OpenAI - San Francisco, California | Apply Now! @ OpenAI.

Quiz Challenge
10 Questions
~2 Minutes
Instant Score

Related Books and Jobs

No related jobs found at the moment.

OpenAI logo

Software Engineer, GPU Infrastructure - HPC Careers at OpenAI - San Francisco, California | Apply Now!

OpenAI

Software Engineer, GPU Infrastructure - HPC Careers at OpenAI - San Francisco, California | Apply Now!

full-timePosted: Feb 10, 2026

Job Description

Software Engineer, GPU Infrastructure - HPC Careers at OpenAI - San Francisco, California

Join OpenAI's Fleet HPC team and become a critical guardian of the world's most advanced AI supercomputing infrastructure. As a Software Engineer, GPU Infrastructure - HPC, you'll ensure the reliability of massive GPU clusters powering groundbreaking research and products like ChatGPT. This senior-level role in San Francisco offers unparalleled impact on AI's future.

Role Overview

The Fleet team at OpenAI manages the computing backbone that drives our cutting-edge AI research and deployment. Spanning massive data centers filled with GPUs, high-speed networking, and custom infrastructure, our systems must deliver flawless performance at unprecedented scale. A single hardware failure can halt multi-billion-parameter model training or disrupt millions of ChatGPT users worldwide.

In this pivotal role, you'll pioneer solutions for state-of-the-art supercomputers that don't exist anywhere else. We're not just operating infrastructure—we're inventing the operational paradigms for AI at exascale. You'll enjoy high autonomy, deep ownership, and the ability to drive meaningful change across OpenAI's compute ecosystem.

Prior hardware expertise isn't required. We value engineers who excel at system-level thinking, automation, and turning chaos into reliable systems. If you thrive on comprehensive investigations, building at scale, and eliminating toil through code, this is your opportunity to shape AI infrastructure's frontier.

Key Responsibilities

Your day-to-day will focus on making OpenAI's compute fleet unbreakable:

  • Architect automation systems that provision and manage tens of thousands of GPU servers across global data centers
  • Build sophisticated monitoring solutions tracking server health, thermal performance, memory errors, and network fabric integrity
  • Lead cross-team collaborations with cluster orchestration, networking, and capacity planning groups
  • Work directly with colocation partners to enforce rigorous quality standards and rapid issue resolution
  • Profile and eliminate performance bottlenecks across the entire GPU stack—from PCIe lanes to inter-node Infiniband fabrics
  • Engineer continuous improvements that slash manual operations by orders of magnitude
  • Perform exhaustive root cause analysis on fleet-wide anomalies using every tool in your arsenal
  • Develop proactive detection systems that predict failures before they cascade
  • Fine-tune Linux kernels and hardware firmware for optimal HPC workloads
  • Manage full hardware lifecycles from procurement through decommissioning at supercomputer scale
  • Integrate observability across Prometheus, Grafana, custom agents, and ML-powered anomaly detection
  • Troubleshoot bleeding-edge systems where you're writing the playbooks as you go

Every decision you make directly impacts OpenAI's ability to push AI boundaries responsibly and reliably.

Qualifications

We're looking for engineers who combine builder and operator DNA:

  • Proven experience managing production server fleets at significant scale (10k+ nodes preferred)
  • Demonstrated ability to build AND reliably operate mission-critical infrastructure
  • Strong proficiency in Python, Go, Rust, or similar for infrastructure automation
  • Deep Linux expertise including kernel debugging, networking stack, and system tuning
  • Experience wrangling messy telemetry data with SQL, PromQL, Pandas, or equivalent
  • Track record of deep-dive investigations that uncover root causes others miss
  • Passion for automation that scales linearly while humans scale logarithmically

Bonus points for: Low-level hardware familiarity (PCIe, Infiniband, IPMI/Redfish), HPC experience, distributed systems design, or monitoring tool mastery. But don't hesitate if your background is software-heavy—we teach the hardware side.

Salary & Benefits

Compensation Range: $250,000 - $450,000 USD base salary (San Francisco), plus competitive equity package. Total compensation calibrated to senior infrastructure engineering market rates for top HPC talent.

Comprehensive Benefits Package:

  • Market-leading medical, dental, vision coverage
  • 401(k) with generous matching
  • Unlimited PTO with recharge encouragement
  • Generous parental leave (primary + secondary)
  • Fertility assistance benefits
  • Mental health professional support
  • Professional growth stipend
  • Relocation support for SF move
  • Premium office perks and wellness reimbursement

OpenAI structures compensation to attract and retain exceptional talent building humanity's most important technology.

Why Join OpenAI?

OpenAI isn't just another tech company—we're creating artificial general intelligence to benefit all humanity. Your work directly enables:

  • Training next-generation foundation models that redefine what's possible
  • Serving hundreds of millions through products like ChatGPT
  • Safety-first deployment of increasingly powerful systems
  • Research pushing fundamental AI boundaries

Work with brilliant peers who value diverse perspectives. Enjoy high ownership, rapid iteration, and tangible impact. We're building infrastructure that will power AGI, and you can be at the forefront.

OpenAI is an equal opportunity employer committed to diversity across all dimensions.

How to Apply

Ready to safeguard the compute powering AGI? Submit your resume and a brief note about your most impactful infrastructure project. Tell us about a time you automated away months of manual work or solved a fleet-wide mystery. We're excited to hear from builder-operators passionate about reliable systems at scale.

Locations: San Francisco, CA (primary). Exceptional candidates considered for remote within US time zones.

Applications reviewed on rolling basis. OpenAI provides reasonable accommodation during the application process.

Keywords: OpenAI HPC jobs, GPU infrastructure engineer, software engineer HPC San Francisco, fleet engineering careers, AI infrastructure roles, supercomputer operations, distributed systems engineering OpenAI

Locations

  • San Francisco, California, United States

Salary

Estimated Salary Rangehigh confidence

262,500 - 495,000 USD / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

  • Python programmingintermediate
  • Go programmingintermediate
  • Linux systems administrationintermediate
  • Networking protocolsintermediate
  • Server hardware managementintermediate
  • HPC infrastructureintermediate
  • GPU cluster managementintermediate
  • Prometheus monitoringintermediate
  • Grafana dashboardsintermediate
  • SQL queryingintermediate
  • PromQLintermediate
  • Pandas data analysisintermediate
  • Automation scriptingintermediate
  • Distributed systemsintermediate
  • Infiniband networkingintermediate
  • PCIe protocolsintermediate
  • IPMI hardware managementintermediate
  • Redfish protocolsintermediate
  • Kernel performance tuningintermediate
  • Power management systemsintermediate

Required Qualifications

  • Experience managing large-scale server environments at production scale (experience)
  • Proven balance of building software tools and operationalizing infrastructure (experience)
  • Proficiency in Python, Go, or equivalent systems programming languages (experience)
  • Strong knowledge of Linux operating systems, networking, and server hardware (experience)
  • Comfortable analyzing noisy data using SQL, PromQL, Pandas, or similar tools (experience)
  • Ability to perform deep system-level investigations and root cause analysis (experience)
  • Experience developing automation for detection, alerting, and remediation (experience)
  • Familiarity with high-performance computing (HPC) environments (experience)
  • Understanding of hardware management protocols like IPMI and Redfish (experience)
  • Knowledge of low-level hardware details including PCIe, Infiniband, and power management (experience)
  • Experience with monitoring tools such as Prometheus and Grafana (experience)
  • Prior work with distributed systems or supercomputing infrastructure (experience)

Responsibilities

  • Build and maintain automation systems for provisioning thousands of servers across data centers
  • Develop comprehensive monitoring tools for server health, performance metrics, and lifecycle events
  • Collaborate closely with clusters, networking, and broader infrastructure engineering teams
  • Partner with external data center operators to maintain highest quality standards and SLAs
  • Identify performance bottlenecks in GPU infrastructure and implement targeted optimizations
  • Continuously improve automation pipelines to eliminate manual intervention and toil
  • Conduct thorough investigations into hardware failures and system disruptions at scale
  • Design and deploy tools for proactive detection of potential fleet-wide issues
  • Optimize kernel parameters and hardware configurations for maximum HPC efficiency
  • Manage lifecycle events including hardware refreshes, expansions, and decommissioning
  • Integrate with observability stacks using Prometheus, Grafana, and custom alerting
  • Troubleshoot state-of-the-art supercomputing systems pioneering new technologies
  • Ensure high availability and reliability of compute fleets powering AI research and products

Benefits

  • general: Competitive salary with equity package in a high-growth AI company
  • general: Comprehensive medical, dental, and vision insurance coverage
  • general: 401(k) retirement plan with generous company matching
  • general: Unlimited PTO policy with encouragement to recharge
  • general: Flexible work hours and hybrid work options in San Francisco
  • general: Generous parental leave for primary and secondary caregivers
  • general: Fertility assistance and family planning benefits
  • general: Mental health support through professional counseling services
  • general: Professional development stipend for conferences and courses
  • general: Fully stocked office with premium snacks and meals
  • general: Gym membership reimbursement and wellness programs
  • general: Relocation assistance package for new hires
  • general: Cutting-edge hardware and tools for maximum productivity
  • general: Direct impact on world-changing AI research and deployment

Target Your Resume for "Software Engineer, GPU Infrastructure - HPC Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Get personalized recommendations to optimize your resume specifically for Software Engineer, GPU Infrastructure - HPC Careers at OpenAI - San Francisco, California | Apply Now!. Takes only 15 seconds!

AI-powered keyword optimization
Skills matching & gap analysis
Experience alignment suggestions

Check Your ATS Score for "Software Engineer, GPU Infrastructure - HPC Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check
Keyword optimization analysis
Skill matching & gap identification
Format & readability score

Tags & Categories

OpenAI HPC jobsGPU infrastructure engineer OpenAISoftware Engineer HPC San Franciscofleet engineering careers OpenAIAI infrastructure jobssupercomputer operations engineerdistributed systems HPC OpenAIGPU cluster management jobsHPC automation engineerLinux systems engineer AIPrometheus Grafana OpenAI jobsInfiniband PCIe engineerIPMI Redfish jobskernel tuning HPC careersAI compute infrastructure rolesOpenAI fleet team jobsGPU server fleet managementHPC monitoring engineerSan Francisco HPC jobssenior infrastructure engineer OpenAIAI supercomputer operationsPython Go HPC developerScaling

Answer 10 quick questions to check your fit for Software Engineer, GPU Infrastructure - HPC Careers at OpenAI - San Francisco, California | Apply Now! @ OpenAI.

Quiz Challenge
10 Questions
~2 Minutes
Instant Score

Related Books and Jobs

No related jobs found at the moment.