RESUME AND JOB

Software Engineer, Fleet Management Careers at OpenAI - San Francisco, California | Apply Now!

OpenAI

Software Engineer, Fleet Management Careers at OpenAI - San Francisco, California | Apply Now!

OpenAI

full-timePosted: Feb 10, 2026

Job Description

Software Engineer, Fleet Management at OpenAI - San Francisco, CA

Join OpenAI's Fleet Management team and build the world's most advanced AI infrastructure. As a Software Engineer specializing in Operating Systems & Orchestration, you'll manage massive GPU clusters powering ChatGPT and frontier AI research. This senior-level role offers hybrid work in San Francisco with relocation assistance.

Role Overview

The Fleet team at OpenAI powers the computing backbone for our groundbreaking AI research and products. We manage exascale infrastructure spanning thousands of GPUs across data centers worldwide, ensuring 99.99% availability for mission-critical AI workloads. Your work directly enables researchers to train next-generation models while maintaining safety and reliability at unprecedented scale.

In this role, you'll bridge low-level hardware systems with high-level orchestration, creating tools that integrate bare-metal servers into unified clusters. You'll leverage cutting-edge LLMs to automate vendor operations, optimize scheduling algorithms, and eliminate operational toil. This position requires deep systems expertise and passion for solving infrastructure challenges that don't exist anywhere else.

Based in San Francisco with a hybrid model (3 days/week in office), this role offers relocation support and exposure to the most advanced AI infrastructure on the planet. You'll collaborate with world-class researchers, hardware engineers, and infrastructure experts pushing the boundaries of what's possible in AI compute.

Key Responsibilities

Software Engineers on the Fleet team tackle complex challenges across the entire infrastructure stack:

Architect cluster management systems handling 100,000+ GPU nodes with heterogeneous hardware
Develop observability platforms integrating NVIDIA DCGM, Prometheus, and custom ML workload metrics
Build LLM agents automating hardware procurement, RMA processing, and vendor SLAs
Implement auto-scaling algorithms balancing research deadlines, cost, and energy efficiency
Create self-healing infrastructure detecting and resolving node failures in seconds
Optimize Linux kernel parameters for AI training workloads achieving 20%+ performance gains
Design firmware update orchestration minimizing downtime across global data centers
Build network topology discovery and routing optimization for InfiniBand/RoCE fabrics
Develop capacity forecasting models predicting compute needs 6+ months ahead
Automate security compliance across bare-metal and cloud-hybrid environments
Create researcher-facing portals for cluster utilization and job prioritization
Lead cross-team initiatives improving end-to-end research infrastructure velocity
Maintain 24/7 on-call supporting production AI services used by millions daily

Qualifications

We're looking for senior engineers with proven track records in large-scale infrastructure:

7+ years experience building and operating distributed systems at hyperscale
Deep expertise in Kubernetes/Slurm cluster management at 10,000+ node scale
Strong Linux systems programming including kernel tuning and eBPF development
Hands-on experience with bare-metal provisioning (MAAS, Tinkerbell, or equivalent)
Production experience managing GPU fleets (NVIDIA A100/H100 clusters preferred)
Proficiency in Infrastructure as Code (Terraform, Pulumi) and GitOps workflows
Experience building observability systems (Prometheus, Grafana, custom agents)
Familiarity with ML platform tools (Kubeflow, Ray, custom schedulers)
Strong Python/Go/C++ skills for systems-level tooling
Track record automating complex operational workflows to zero-touch
Experience with high-performance networking (InfiniBand, 400G Ethernet)
BS/MS in Computer Science, Electrical Engineering, or equivalent experience

Salary & Benefits

Compensation Range: $220,000 - $380,000 USD base salary + equity + bonus (SF location). Total compensation includes comprehensive benefits package.

Industry-leading medical, dental, vision coverage (90%+ premiums covered)
401(k) with 4%+ employer match, immediate vesting
Unlimited PTO + 16 weeks paid parental leave
Hybrid SF work model with catered meals daily
$10K+ annual learning stipend (conferences, courses, books)
Full relocation package including temporary housing
Generous equity grants with 4-year vesting
Fitness reimbursement, mental health support, commuter benefits

Why Join OpenAI?

OpenAI isn't just building AI products—we're creating the infrastructure that makes artificial general intelligence possible. Your work on the Fleet team directly enables:

Training runs consuming more compute than most countries' entire grids
Safety systems preventing catastrophic model failures at scale
Products serving 100M+ weekly users with 99.99% reliability
Research pushing fundamental limits of what AI can achieve

Join a team where engineers own entire surface areas—from kernel patches to global orchestration. Work with PhD researchers, hardware architects, and product leaders shaping AI's future. Our culture emphasizes impact, ownership, and rapid learning in humanity's most important technical challenge.

How to Apply

Ready to build the infrastructure powering AGI? Submit your resume and a brief note about your most impactful infrastructure project. We're particularly interested in:

Experience scaling GPU infrastructure for ML workloads
Custom tooling you've built for fleet management
Performance optimizations achieving measurable business impact

Application Process:

30-minute recruiter screen
Technical deep-dive with engineering team
Systems design interview (cluster management scenario)
Take-home project (optional, by mutual agreement)
Team fit conversations
Offer!

We review applications on a rolling basis. SF-based candidates preferred due to hybrid requirements.

Locations

San Francisco, California, United States

Salary

Estimated Salary Rangehigh confidence

231,000 - 418,000 USD / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

Kubernetesintermediate
Terraformintermediate
CI/CD Pipelinesintermediate
Linux Kernelsintermediate
Containerizationintermediate
Chef Configuration Managementintermediate
Firmware Managementintermediate
Host Routingintermediate
GPU Cluster Managementintermediate
Bare-Metal Infrastructureintermediate
Cloud Providers (AWS, GCP, Azure)intermediate
LLM Integrationintermediate
Infrastructure Automationintermediate
Job Schedulingintermediate
Hardware Metrics Monitoringintermediate
Vendor Management Systemsintermediate
Systems Programmingintermediate
Network Orchestrationintermediate
Performance Optimizationintermediate
Reliability Engineeringintermediate

Required Qualifications

Strong software engineering skills with 5+ years in large-scale infrastructure environments (experience)
Broad knowledge of cluster-level systems including Kubernetes, Slurm, or equivalent (experience)
Hands-on experience with CI/CD pipelines using Jenkins, GitHub Actions, or GitLab CI (experience)
Proficiency in Infrastructure as Code tools like Terraform, Ansible, or Puppet (experience)
Deep expertise in server-level systems including Linux kernel tuning and optimization (experience)
Experience with containerization technologies such as Docker and containerd (experience)
Familiarity with configuration management tools like Chef, SaltStack, or similar (experience)
Knowledge of firmware management and BIOS/UEFI configuration at scale (experience)
Strong understanding of host networking, routing protocols, and SDN (experience)
Experience managing bare-metal fleets and cloud-hybrid environments (experience)
Passion for optimizing GPU/CPU compute performance and reliability (experience)
Ability to thrive in fast-paced, dynamic research environments (experience)
Excellent collaboration skills with cross-functional engineering teams (experience)
Bachelor's degree in Computer Science, Electrical Engineering, or equivalent experience (experience)

Responsibilities

Design and implement systems to manage cloud and bare-metal fleets at exascale
Build tools integrating low-level hardware telemetry with cluster schedulers
Develop LLM-powered automation for vendor coordination and procurement workflows
Create observability platforms combining hardware metrics, job performance, and cost data
Automate node provisioning, configuration drift detection, and self-healing systems
Optimize cluster scheduling algorithms for AI/ML workloads and research pipelines
Implement firmware update orchestration across thousands of heterogeneous nodes
Design network fabric management for high-bandwidth GPU interconnects
Build capacity planning tools forecasting compute needs for model training
Develop security hardening automation for infrastructure components
Create dashboards and alerting systems for fleet health and performance
Collaborate with research teams to tune infrastructure for specific model requirements
Document systems architecture and operational procedures for team scalability
Participate in on-call rotation to maintain 99.99% cluster availability

Benefits

general: Competitive salary with annual performance bonuses
general: Comprehensive medical, dental, and vision insurance
general: 401(k) matching program with immediate vesting
general: Unlimited PTO with encouraged recharge periods
general: Hybrid work model (3 days/week in SF office)
general: Full relocation assistance including housing support
general: Generous parental leave (16 weeks fully paid)
general: Fitness reimbursement and wellness programs
general: Catered meals and fully stocked kitchens daily
general: Learning stipend for conferences and certifications
general: Equity grants with significant growth potential
general: Mental health support through dedicated programs
general: Commuter benefits and subsidized public transit
general: Volunteer time off and charitable matching
general: Cutting-edge hardware access for personal projects

Target Your Resume for "Software Engineer, Fleet Management Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Get personalized recommendations to optimize your resume specifically for Software Engineer, Fleet Management Careers at OpenAI - San Francisco, California | Apply Now!. Takes only 15 seconds!

AI-powered keyword optimization

Skills matching & gap analysis

Experience alignment suggestions

Check Your ATS Score for "Software Engineer, Fleet Management Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check

Keyword optimization analysis

Skill matching & gap identification

Format & readability score

Tags & Categories

Software Engineer Fleet Management OpenAIOpenAI infrastructure engineer jobsGPU cluster engineer San FranciscoKubernetes engineer AI researchBare-metal fleet management careersOpenAI systems engineer salaryAI infrastructure jobs CaliforniaSoftware Engineer Operating Systems OpenAILLM infrastructure automation jobsHigh performance computing engineerNVIDIA GPU fleet managementTerraform Kubernetes OpenAI jobsSan Francisco tech infrastructure jobsOpenAI fleet team careersDistributed systems engineer AIML platform engineer OpenAILinux kernel engineer jobsCloud bare-metal hybrid engineerResearch infrastructure engineerOpenAI software engineer salary SFExascale computing engineer jobsAI safety infrastructure careersScaling

Answer 10 quick questions to check your fit for Software Engineer, Fleet Management Careers at OpenAI - San Francisco, California | Apply Now! @ OpenAI.

10 Questions

~2 Minutes

Instant Score

Related Books and Jobs

No related jobs found at the moment.

Privacy Terms & Conditions About Us Refund Policy Recruiter Login Sitemap

Software Engineer, Fleet Management Careers at OpenAI - San Francisco, California | Apply Now!

OpenAI

Software Engineer, Fleet Management Careers at OpenAI - San Francisco, California | Apply Now!

OpenAI

full-timePosted: Feb 10, 2026

Job Description

Software Engineer, Fleet Management at OpenAI - San Francisco, CA

Role Overview

Key Responsibilities

Software Engineers on the Fleet team tackle complex challenges across the entire infrastructure stack:

Architect cluster management systems handling 100,000+ GPU nodes with heterogeneous hardware
Develop observability platforms integrating NVIDIA DCGM, Prometheus, and custom ML workload metrics
Build LLM agents automating hardware procurement, RMA processing, and vendor SLAs
Implement auto-scaling algorithms balancing research deadlines, cost, and energy efficiency
Create self-healing infrastructure detecting and resolving node failures in seconds
Optimize Linux kernel parameters for AI training workloads achieving 20%+ performance gains
Design firmware update orchestration minimizing downtime across global data centers
Build network topology discovery and routing optimization for InfiniBand/RoCE fabrics
Develop capacity forecasting models predicting compute needs 6+ months ahead
Automate security compliance across bare-metal and cloud-hybrid environments
Create researcher-facing portals for cluster utilization and job prioritization
Lead cross-team initiatives improving end-to-end research infrastructure velocity
Maintain 24/7 on-call supporting production AI services used by millions daily

Qualifications

We're looking for senior engineers with proven track records in large-scale infrastructure:

7+ years experience building and operating distributed systems at hyperscale
Deep expertise in Kubernetes/Slurm cluster management at 10,000+ node scale
Strong Linux systems programming including kernel tuning and eBPF development
Hands-on experience with bare-metal provisioning (MAAS, Tinkerbell, or equivalent)
Production experience managing GPU fleets (NVIDIA A100/H100 clusters preferred)
Proficiency in Infrastructure as Code (Terraform, Pulumi) and GitOps workflows
Experience building observability systems (Prometheus, Grafana, custom agents)
Familiarity with ML platform tools (Kubeflow, Ray, custom schedulers)
Strong Python/Go/C++ skills for systems-level tooling
Track record automating complex operational workflows to zero-touch
Experience with high-performance networking (InfiniBand, 400G Ethernet)
BS/MS in Computer Science, Electrical Engineering, or equivalent experience

Salary & Benefits

Compensation Range: $220,000 - $380,000 USD base salary + equity + bonus (SF location). Total compensation includes comprehensive benefits package.

Industry-leading medical, dental, vision coverage (90%+ premiums covered)
401(k) with 4%+ employer match, immediate vesting
Unlimited PTO + 16 weeks paid parental leave
Hybrid SF work model with catered meals daily
$10K+ annual learning stipend (conferences, courses, books)
Full relocation package including temporary housing
Generous equity grants with 4-year vesting
Fitness reimbursement, mental health support, commuter benefits

Why Join OpenAI?

OpenAI isn't just building AI products—we're creating the infrastructure that makes artificial general intelligence possible. Your work on the Fleet team directly enables:

Training runs consuming more compute than most countries' entire grids
Safety systems preventing catastrophic model failures at scale
Products serving 100M+ weekly users with 99.99% reliability
Research pushing fundamental limits of what AI can achieve

How to Apply

Ready to build the infrastructure powering AGI? Submit your resume and a brief note about your most impactful infrastructure project. We're particularly interested in:

Experience scaling GPU infrastructure for ML workloads
Custom tooling you've built for fleet management
Performance optimizations achieving measurable business impact

Application Process:

30-minute recruiter screen
Technical deep-dive with engineering team
Systems design interview (cluster management scenario)
Take-home project (optional, by mutual agreement)
Team fit conversations
Offer!

We review applications on a rolling basis. SF-based candidates preferred due to hybrid requirements.

Locations

San Francisco, California, United States

Salary

Estimated Salary Rangehigh confidence

231,000 - 418,000 USD / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

Kubernetesintermediate
Terraformintermediate
CI/CD Pipelinesintermediate
Linux Kernelsintermediate
Containerizationintermediate
Chef Configuration Managementintermediate
Firmware Managementintermediate
Host Routingintermediate
GPU Cluster Managementintermediate
Bare-Metal Infrastructureintermediate
Cloud Providers (AWS, GCP, Azure)intermediate
LLM Integrationintermediate
Infrastructure Automationintermediate
Job Schedulingintermediate
Hardware Metrics Monitoringintermediate
Vendor Management Systemsintermediate
Systems Programmingintermediate
Network Orchestrationintermediate
Performance Optimizationintermediate
Reliability Engineeringintermediate

Required Qualifications

Strong software engineering skills with 5+ years in large-scale infrastructure environments (experience)
Broad knowledge of cluster-level systems including Kubernetes, Slurm, or equivalent (experience)
Hands-on experience with CI/CD pipelines using Jenkins, GitHub Actions, or GitLab CI (experience)
Proficiency in Infrastructure as Code tools like Terraform, Ansible, or Puppet (experience)
Deep expertise in server-level systems including Linux kernel tuning and optimization (experience)
Experience with containerization technologies such as Docker and containerd (experience)
Familiarity with configuration management tools like Chef, SaltStack, or similar (experience)
Knowledge of firmware management and BIOS/UEFI configuration at scale (experience)
Strong understanding of host networking, routing protocols, and SDN (experience)
Experience managing bare-metal fleets and cloud-hybrid environments (experience)
Passion for optimizing GPU/CPU compute performance and reliability (experience)
Ability to thrive in fast-paced, dynamic research environments (experience)
Excellent collaboration skills with cross-functional engineering teams (experience)
Bachelor's degree in Computer Science, Electrical Engineering, or equivalent experience (experience)

Responsibilities

Design and implement systems to manage cloud and bare-metal fleets at exascale
Build tools integrating low-level hardware telemetry with cluster schedulers
Develop LLM-powered automation for vendor coordination and procurement workflows
Create observability platforms combining hardware metrics, job performance, and cost data
Automate node provisioning, configuration drift detection, and self-healing systems
Optimize cluster scheduling algorithms for AI/ML workloads and research pipelines
Implement firmware update orchestration across thousands of heterogeneous nodes
Design network fabric management for high-bandwidth GPU interconnects
Build capacity planning tools forecasting compute needs for model training
Develop security hardening automation for infrastructure components
Create dashboards and alerting systems for fleet health and performance
Collaborate with research teams to tune infrastructure for specific model requirements
Document systems architecture and operational procedures for team scalability
Participate in on-call rotation to maintain 99.99% cluster availability

Benefits

general: Competitive salary with annual performance bonuses
general: Comprehensive medical, dental, and vision insurance
general: 401(k) matching program with immediate vesting
general: Unlimited PTO with encouraged recharge periods
general: Hybrid work model (3 days/week in SF office)
general: Full relocation assistance including housing support
general: Generous parental leave (16 weeks fully paid)
general: Fitness reimbursement and wellness programs
general: Catered meals and fully stocked kitchens daily
general: Learning stipend for conferences and certifications
general: Equity grants with significant growth potential
general: Mental health support through dedicated programs
general: Commuter benefits and subsidized public transit
general: Volunteer time off and charitable matching
general: Cutting-edge hardware access for personal projects

Target Your Resume for "Software Engineer, Fleet Management Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Get personalized recommendations to optimize your resume specifically for Software Engineer, Fleet Management Careers at OpenAI - San Francisco, California | Apply Now!. Takes only 15 seconds!

AI-powered keyword optimization

Skills matching & gap analysis

Experience alignment suggestions

Check Your ATS Score for "Software Engineer, Fleet Management Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check

Keyword optimization analysis

Skill matching & gap identification

Format & readability score

Tags & Categories

Answer 10 quick questions to check your fit for Software Engineer, Fleet Management Careers at OpenAI - San Francisco, California | Apply Now! @ OpenAI.

10 Questions

~2 Minutes

Instant Score

Related Books and Jobs

No related jobs found at the moment.

Privacy Terms & Conditions About Us Refund Policy Recruiter Login Sitemap