Resume and JobRESUME AND JOB
OpenAI logo

Software Engineer, Fleet Management Careers at OpenAI - San Francisco, California | Apply Now!

OpenAI

Software Engineer, Fleet Management Careers at OpenAI - San Francisco, California | Apply Now!

full-timePosted: Feb 10, 2026

Job Description

Software Engineer, Fleet Management at OpenAI - San Francisco, CA

Join OpenAI's Fleet Management team and build the world's most advanced AI infrastructure. As a Software Engineer specializing in Operating Systems & Orchestration, you'll manage massive GPU clusters powering ChatGPT and frontier AI research. This senior-level role offers hybrid work in San Francisco with relocation assistance.

Role Overview

The Fleet team at OpenAI powers the computing backbone for our groundbreaking AI research and products. We manage exascale infrastructure spanning thousands of GPUs across data centers worldwide, ensuring 99.99% availability for mission-critical AI workloads. Your work directly enables researchers to train next-generation models while maintaining safety and reliability at unprecedented scale.

In this role, you'll bridge low-level hardware systems with high-level orchestration, creating tools that integrate bare-metal servers into unified clusters. You'll leverage cutting-edge LLMs to automate vendor operations, optimize scheduling algorithms, and eliminate operational toil. This position requires deep systems expertise and passion for solving infrastructure challenges that don't exist anywhere else.

Based in San Francisco with a hybrid model (3 days/week in office), this role offers relocation support and exposure to the most advanced AI infrastructure on the planet. You'll collaborate with world-class researchers, hardware engineers, and infrastructure experts pushing the boundaries of what's possible in AI compute.

Key Responsibilities

Software Engineers on the Fleet team tackle complex challenges across the entire infrastructure stack:

  • Architect cluster management systems handling 100,000+ GPU nodes with heterogeneous hardware
  • Develop observability platforms integrating NVIDIA DCGM, Prometheus, and custom ML workload metrics
  • Build LLM agents automating hardware procurement, RMA processing, and vendor SLAs
  • Implement auto-scaling algorithms balancing research deadlines, cost, and energy efficiency
  • Create self-healing infrastructure detecting and resolving node failures in seconds
  • Optimize Linux kernel parameters for AI training workloads achieving 20%+ performance gains
  • Design firmware update orchestration minimizing downtime across global data centers
  • Build network topology discovery and routing optimization for InfiniBand/RoCE fabrics
  • Develop capacity forecasting models predicting compute needs 6+ months ahead
  • Automate security compliance across bare-metal and cloud-hybrid environments
  • Create researcher-facing portals for cluster utilization and job prioritization
  • Lead cross-team initiatives improving end-to-end research infrastructure velocity
  • Maintain 24/7 on-call supporting production AI services used by millions daily

Qualifications

We're looking for senior engineers with proven track records in large-scale infrastructure:

  • 7+ years experience building and operating distributed systems at hyperscale
  • Deep expertise in Kubernetes/Slurm cluster management at 10,000+ node scale
  • Strong Linux systems programming including kernel tuning and eBPF development
  • Hands-on experience with bare-metal provisioning (MAAS, Tinkerbell, or equivalent)
  • Production experience managing GPU fleets (NVIDIA A100/H100 clusters preferred)
  • Proficiency in Infrastructure as Code (Terraform, Pulumi) and GitOps workflows
  • Experience building observability systems (Prometheus, Grafana, custom agents)
  • Familiarity with ML platform tools (Kubeflow, Ray, custom schedulers)
  • Strong Python/Go/C++ skills for systems-level tooling
  • Track record automating complex operational workflows to zero-touch
  • Experience with high-performance networking (InfiniBand, 400G Ethernet)
  • BS/MS in Computer Science, Electrical Engineering, or equivalent experience

Salary & Benefits

Compensation Range: $220,000 - $380,000 USD base salary + equity + bonus (SF location). Total compensation includes comprehensive benefits package.

  • Industry-leading medical, dental, vision coverage (90%+ premiums covered)
  • 401(k) with 4%+ employer match, immediate vesting
  • Unlimited PTO + 16 weeks paid parental leave
  • Hybrid SF work model with catered meals daily
  • $10K+ annual learning stipend (conferences, courses, books)
  • Full relocation package including temporary housing
  • Generous equity grants with 4-year vesting
  • Fitness reimbursement, mental health support, commuter benefits

Why Join OpenAI?

OpenAI isn't just building AI products—we're creating the infrastructure that makes artificial general intelligence possible. Your work on the Fleet team directly enables:

  • Training runs consuming more compute than most countries' entire grids
  • Safety systems preventing catastrophic model failures at scale
  • Products serving 100M+ weekly users with 99.99% reliability
  • Research pushing fundamental limits of what AI can achieve

Join a team where engineers own entire surface areas—from kernel patches to global orchestration. Work with PhD researchers, hardware architects, and product leaders shaping AI's future. Our culture emphasizes impact, ownership, and rapid learning in humanity's most important technical challenge.

How to Apply

Ready to build the infrastructure powering AGI? Submit your resume and a brief note about your most impactful infrastructure project. We're particularly interested in:

  • Experience scaling GPU infrastructure for ML workloads
  • Custom tooling you've built for fleet management
  • Performance optimizations achieving measurable business impact

Application Process:

  1. 30-minute recruiter screen
  2. Technical deep-dive with engineering team
  3. Systems design interview (cluster management scenario)
  4. Take-home project (optional, by mutual agreement)
  5. Team fit conversations
  6. Offer!

We review applications on a rolling basis. SF-based candidates preferred due to hybrid requirements.

Locations

  • San Francisco, California, United States

Salary

Estimated Salary Rangehigh confidence

231,000 - 418,000 USD / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

  • Kubernetesintermediate
  • Terraformintermediate
  • CI/CD Pipelinesintermediate
  • Linux Kernelsintermediate
  • Containerizationintermediate
  • Chef Configuration Managementintermediate
  • Firmware Managementintermediate
  • Host Routingintermediate
  • GPU Cluster Managementintermediate
  • Bare-Metal Infrastructureintermediate
  • Cloud Providers (AWS, GCP, Azure)intermediate
  • LLM Integrationintermediate
  • Infrastructure Automationintermediate
  • Job Schedulingintermediate
  • Hardware Metrics Monitoringintermediate
  • Vendor Management Systemsintermediate
  • Systems Programmingintermediate
  • Network Orchestrationintermediate
  • Performance Optimizationintermediate
  • Reliability Engineeringintermediate

Required Qualifications

  • Strong software engineering skills with 5+ years in large-scale infrastructure environments (experience)
  • Broad knowledge of cluster-level systems including Kubernetes, Slurm, or equivalent (experience)
  • Hands-on experience with CI/CD pipelines using Jenkins, GitHub Actions, or GitLab CI (experience)
  • Proficiency in Infrastructure as Code tools like Terraform, Ansible, or Puppet (experience)
  • Deep expertise in server-level systems including Linux kernel tuning and optimization (experience)
  • Experience with containerization technologies such as Docker and containerd (experience)
  • Familiarity with configuration management tools like Chef, SaltStack, or similar (experience)
  • Knowledge of firmware management and BIOS/UEFI configuration at scale (experience)
  • Strong understanding of host networking, routing protocols, and SDN (experience)
  • Experience managing bare-metal fleets and cloud-hybrid environments (experience)
  • Passion for optimizing GPU/CPU compute performance and reliability (experience)
  • Ability to thrive in fast-paced, dynamic research environments (experience)
  • Excellent collaboration skills with cross-functional engineering teams (experience)
  • Bachelor's degree in Computer Science, Electrical Engineering, or equivalent experience (experience)

Responsibilities

  • Design and implement systems to manage cloud and bare-metal fleets at exascale
  • Build tools integrating low-level hardware telemetry with cluster schedulers
  • Develop LLM-powered automation for vendor coordination and procurement workflows
  • Create observability platforms combining hardware metrics, job performance, and cost data
  • Automate node provisioning, configuration drift detection, and self-healing systems
  • Optimize cluster scheduling algorithms for AI/ML workloads and research pipelines
  • Implement firmware update orchestration across thousands of heterogeneous nodes
  • Design network fabric management for high-bandwidth GPU interconnects
  • Build capacity planning tools forecasting compute needs for model training
  • Develop security hardening automation for infrastructure components
  • Create dashboards and alerting systems for fleet health and performance
  • Collaborate with research teams to tune infrastructure for specific model requirements
  • Document systems architecture and operational procedures for team scalability
  • Participate in on-call rotation to maintain 99.99% cluster availability

Benefits

  • general: Competitive salary with annual performance bonuses
  • general: Comprehensive medical, dental, and vision insurance
  • general: 401(k) matching program with immediate vesting
  • general: Unlimited PTO with encouraged recharge periods
  • general: Hybrid work model (3 days/week in SF office)
  • general: Full relocation assistance including housing support
  • general: Generous parental leave (16 weeks fully paid)
  • general: Fitness reimbursement and wellness programs
  • general: Catered meals and fully stocked kitchens daily
  • general: Learning stipend for conferences and certifications
  • general: Equity grants with significant growth potential
  • general: Mental health support through dedicated programs
  • general: Commuter benefits and subsidized public transit
  • general: Volunteer time off and charitable matching
  • general: Cutting-edge hardware access for personal projects

Target Your Resume for "Software Engineer, Fleet Management Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Get personalized recommendations to optimize your resume specifically for Software Engineer, Fleet Management Careers at OpenAI - San Francisco, California | Apply Now!. Takes only 15 seconds!

AI-powered keyword optimization
Skills matching & gap analysis
Experience alignment suggestions

Check Your ATS Score for "Software Engineer, Fleet Management Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check
Keyword optimization analysis
Skill matching & gap identification
Format & readability score

Tags & Categories

Software Engineer Fleet Management OpenAIOpenAI infrastructure engineer jobsGPU cluster engineer San FranciscoKubernetes engineer AI researchBare-metal fleet management careersOpenAI systems engineer salaryAI infrastructure jobs CaliforniaSoftware Engineer Operating Systems OpenAILLM infrastructure automation jobsHigh performance computing engineerNVIDIA GPU fleet managementTerraform Kubernetes OpenAI jobsSan Francisco tech infrastructure jobsOpenAI fleet team careersDistributed systems engineer AIML platform engineer OpenAILinux kernel engineer jobsCloud bare-metal hybrid engineerResearch infrastructure engineerOpenAI software engineer salary SFExascale computing engineer jobsAI safety infrastructure careersScaling

Answer 10 quick questions to check your fit for Software Engineer, Fleet Management Careers at OpenAI - San Francisco, California | Apply Now! @ OpenAI.

Quiz Challenge
10 Questions
~2 Minutes
Instant Score

Related Books and Jobs

No related jobs found at the moment.

OpenAI logo

Software Engineer, Fleet Management Careers at OpenAI - San Francisco, California | Apply Now!

OpenAI

Software Engineer, Fleet Management Careers at OpenAI - San Francisco, California | Apply Now!

full-timePosted: Feb 10, 2026

Job Description

Software Engineer, Fleet Management at OpenAI - San Francisco, CA

Join OpenAI's Fleet Management team and build the world's most advanced AI infrastructure. As a Software Engineer specializing in Operating Systems & Orchestration, you'll manage massive GPU clusters powering ChatGPT and frontier AI research. This senior-level role offers hybrid work in San Francisco with relocation assistance.

Role Overview

The Fleet team at OpenAI powers the computing backbone for our groundbreaking AI research and products. We manage exascale infrastructure spanning thousands of GPUs across data centers worldwide, ensuring 99.99% availability for mission-critical AI workloads. Your work directly enables researchers to train next-generation models while maintaining safety and reliability at unprecedented scale.

In this role, you'll bridge low-level hardware systems with high-level orchestration, creating tools that integrate bare-metal servers into unified clusters. You'll leverage cutting-edge LLMs to automate vendor operations, optimize scheduling algorithms, and eliminate operational toil. This position requires deep systems expertise and passion for solving infrastructure challenges that don't exist anywhere else.

Based in San Francisco with a hybrid model (3 days/week in office), this role offers relocation support and exposure to the most advanced AI infrastructure on the planet. You'll collaborate with world-class researchers, hardware engineers, and infrastructure experts pushing the boundaries of what's possible in AI compute.

Key Responsibilities

Software Engineers on the Fleet team tackle complex challenges across the entire infrastructure stack:

  • Architect cluster management systems handling 100,000+ GPU nodes with heterogeneous hardware
  • Develop observability platforms integrating NVIDIA DCGM, Prometheus, and custom ML workload metrics
  • Build LLM agents automating hardware procurement, RMA processing, and vendor SLAs
  • Implement auto-scaling algorithms balancing research deadlines, cost, and energy efficiency
  • Create self-healing infrastructure detecting and resolving node failures in seconds
  • Optimize Linux kernel parameters for AI training workloads achieving 20%+ performance gains
  • Design firmware update orchestration minimizing downtime across global data centers
  • Build network topology discovery and routing optimization for InfiniBand/RoCE fabrics
  • Develop capacity forecasting models predicting compute needs 6+ months ahead
  • Automate security compliance across bare-metal and cloud-hybrid environments
  • Create researcher-facing portals for cluster utilization and job prioritization
  • Lead cross-team initiatives improving end-to-end research infrastructure velocity
  • Maintain 24/7 on-call supporting production AI services used by millions daily

Qualifications

We're looking for senior engineers with proven track records in large-scale infrastructure:

  • 7+ years experience building and operating distributed systems at hyperscale
  • Deep expertise in Kubernetes/Slurm cluster management at 10,000+ node scale
  • Strong Linux systems programming including kernel tuning and eBPF development
  • Hands-on experience with bare-metal provisioning (MAAS, Tinkerbell, or equivalent)
  • Production experience managing GPU fleets (NVIDIA A100/H100 clusters preferred)
  • Proficiency in Infrastructure as Code (Terraform, Pulumi) and GitOps workflows
  • Experience building observability systems (Prometheus, Grafana, custom agents)
  • Familiarity with ML platform tools (Kubeflow, Ray, custom schedulers)
  • Strong Python/Go/C++ skills for systems-level tooling
  • Track record automating complex operational workflows to zero-touch
  • Experience with high-performance networking (InfiniBand, 400G Ethernet)
  • BS/MS in Computer Science, Electrical Engineering, or equivalent experience

Salary & Benefits

Compensation Range: $220,000 - $380,000 USD base salary + equity + bonus (SF location). Total compensation includes comprehensive benefits package.

  • Industry-leading medical, dental, vision coverage (90%+ premiums covered)
  • 401(k) with 4%+ employer match, immediate vesting
  • Unlimited PTO + 16 weeks paid parental leave
  • Hybrid SF work model with catered meals daily
  • $10K+ annual learning stipend (conferences, courses, books)
  • Full relocation package including temporary housing
  • Generous equity grants with 4-year vesting
  • Fitness reimbursement, mental health support, commuter benefits

Why Join OpenAI?

OpenAI isn't just building AI products—we're creating the infrastructure that makes artificial general intelligence possible. Your work on the Fleet team directly enables:

  • Training runs consuming more compute than most countries' entire grids
  • Safety systems preventing catastrophic model failures at scale
  • Products serving 100M+ weekly users with 99.99% reliability
  • Research pushing fundamental limits of what AI can achieve

Join a team where engineers own entire surface areas—from kernel patches to global orchestration. Work with PhD researchers, hardware architects, and product leaders shaping AI's future. Our culture emphasizes impact, ownership, and rapid learning in humanity's most important technical challenge.

How to Apply

Ready to build the infrastructure powering AGI? Submit your resume and a brief note about your most impactful infrastructure project. We're particularly interested in:

  • Experience scaling GPU infrastructure for ML workloads
  • Custom tooling you've built for fleet management
  • Performance optimizations achieving measurable business impact

Application Process:

  1. 30-minute recruiter screen
  2. Technical deep-dive with engineering team
  3. Systems design interview (cluster management scenario)
  4. Take-home project (optional, by mutual agreement)
  5. Team fit conversations
  6. Offer!

We review applications on a rolling basis. SF-based candidates preferred due to hybrid requirements.

Locations

  • San Francisco, California, United States

Salary

Estimated Salary Rangehigh confidence

231,000 - 418,000 USD / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

  • Kubernetesintermediate
  • Terraformintermediate
  • CI/CD Pipelinesintermediate
  • Linux Kernelsintermediate
  • Containerizationintermediate
  • Chef Configuration Managementintermediate
  • Firmware Managementintermediate
  • Host Routingintermediate
  • GPU Cluster Managementintermediate
  • Bare-Metal Infrastructureintermediate
  • Cloud Providers (AWS, GCP, Azure)intermediate
  • LLM Integrationintermediate
  • Infrastructure Automationintermediate
  • Job Schedulingintermediate
  • Hardware Metrics Monitoringintermediate
  • Vendor Management Systemsintermediate
  • Systems Programmingintermediate
  • Network Orchestrationintermediate
  • Performance Optimizationintermediate
  • Reliability Engineeringintermediate

Required Qualifications

  • Strong software engineering skills with 5+ years in large-scale infrastructure environments (experience)
  • Broad knowledge of cluster-level systems including Kubernetes, Slurm, or equivalent (experience)
  • Hands-on experience with CI/CD pipelines using Jenkins, GitHub Actions, or GitLab CI (experience)
  • Proficiency in Infrastructure as Code tools like Terraform, Ansible, or Puppet (experience)
  • Deep expertise in server-level systems including Linux kernel tuning and optimization (experience)
  • Experience with containerization technologies such as Docker and containerd (experience)
  • Familiarity with configuration management tools like Chef, SaltStack, or similar (experience)
  • Knowledge of firmware management and BIOS/UEFI configuration at scale (experience)
  • Strong understanding of host networking, routing protocols, and SDN (experience)
  • Experience managing bare-metal fleets and cloud-hybrid environments (experience)
  • Passion for optimizing GPU/CPU compute performance and reliability (experience)
  • Ability to thrive in fast-paced, dynamic research environments (experience)
  • Excellent collaboration skills with cross-functional engineering teams (experience)
  • Bachelor's degree in Computer Science, Electrical Engineering, or equivalent experience (experience)

Responsibilities

  • Design and implement systems to manage cloud and bare-metal fleets at exascale
  • Build tools integrating low-level hardware telemetry with cluster schedulers
  • Develop LLM-powered automation for vendor coordination and procurement workflows
  • Create observability platforms combining hardware metrics, job performance, and cost data
  • Automate node provisioning, configuration drift detection, and self-healing systems
  • Optimize cluster scheduling algorithms for AI/ML workloads and research pipelines
  • Implement firmware update orchestration across thousands of heterogeneous nodes
  • Design network fabric management for high-bandwidth GPU interconnects
  • Build capacity planning tools forecasting compute needs for model training
  • Develop security hardening automation for infrastructure components
  • Create dashboards and alerting systems for fleet health and performance
  • Collaborate with research teams to tune infrastructure for specific model requirements
  • Document systems architecture and operational procedures for team scalability
  • Participate in on-call rotation to maintain 99.99% cluster availability

Benefits

  • general: Competitive salary with annual performance bonuses
  • general: Comprehensive medical, dental, and vision insurance
  • general: 401(k) matching program with immediate vesting
  • general: Unlimited PTO with encouraged recharge periods
  • general: Hybrid work model (3 days/week in SF office)
  • general: Full relocation assistance including housing support
  • general: Generous parental leave (16 weeks fully paid)
  • general: Fitness reimbursement and wellness programs
  • general: Catered meals and fully stocked kitchens daily
  • general: Learning stipend for conferences and certifications
  • general: Equity grants with significant growth potential
  • general: Mental health support through dedicated programs
  • general: Commuter benefits and subsidized public transit
  • general: Volunteer time off and charitable matching
  • general: Cutting-edge hardware access for personal projects

Target Your Resume for "Software Engineer, Fleet Management Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Get personalized recommendations to optimize your resume specifically for Software Engineer, Fleet Management Careers at OpenAI - San Francisco, California | Apply Now!. Takes only 15 seconds!

AI-powered keyword optimization
Skills matching & gap analysis
Experience alignment suggestions

Check Your ATS Score for "Software Engineer, Fleet Management Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check
Keyword optimization analysis
Skill matching & gap identification
Format & readability score

Tags & Categories

Software Engineer Fleet Management OpenAIOpenAI infrastructure engineer jobsGPU cluster engineer San FranciscoKubernetes engineer AI researchBare-metal fleet management careersOpenAI systems engineer salaryAI infrastructure jobs CaliforniaSoftware Engineer Operating Systems OpenAILLM infrastructure automation jobsHigh performance computing engineerNVIDIA GPU fleet managementTerraform Kubernetes OpenAI jobsSan Francisco tech infrastructure jobsOpenAI fleet team careersDistributed systems engineer AIML platform engineer OpenAILinux kernel engineer jobsCloud bare-metal hybrid engineerResearch infrastructure engineerOpenAI software engineer salary SFExascale computing engineer jobsAI safety infrastructure careersScaling

Answer 10 quick questions to check your fit for Software Engineer, Fleet Management Careers at OpenAI - San Francisco, California | Apply Now! @ OpenAI.

Quiz Challenge
10 Questions
~2 Minutes
Instant Score

Related Books and Jobs

No related jobs found at the moment.