Resume and JobRESUME AND JOB
OpenAI logo

Software Engineer, Distributed Systems Careers at OpenAI - San Francisco, California | Apply Now!

OpenAI

Software Engineer, Distributed Systems Careers at OpenAI - San Francisco, California | Apply Now!

full-timePosted: Feb 10, 2026

Job Description

Software Engineer, Distributed Systems at OpenAI - San Francisco, CA

Join OpenAI's Compute Runtime team building the foundational infrastructure powering the world's most advanced ML training systems. As a Software Engineer specializing in Distributed Systems, you'll architect the low-level frameworks that orchestrate thousands of computers moving petabytes of training data across our cutting-edge supercomputers.

This hybrid role in San Francisco offers relocation assistance and combines the excitement of AGI research with hands-on systems engineering at exascale. If you thrive on optimizing end-to-end performance while maintaining developer-friendly APIs, this is your opportunity to accelerate humanity's progress toward artificial general intelligence.

Key Responsibilities

  • Architect powerful, introspectable APIs orchestrating thousands of GPUs and CPUs for seamless data movement and persistence across distributed clusters
  • Profile and optimize high-performance I/O pipelines maximizing local storage throughput and network fabric utilization
  • Design scalable compute runtime components supporting dynamic ML training workloads from research prototypes to production supercomputers
  • Rapidly deploy training frameworks to bleeding-edge hardware architectures responding to evolving model requirements
  • Build robust Python and Rust components ensuring stability when scaling from hundreds to tens of thousands of nodes
  • Collaborate with ML researchers to understand data movement patterns and optimize for researcher productivity
  • Implement observability and monitoring systems providing real-time insights into cluster-wide performance bottlenecks
  • Create fault-tolerant designs automatically recovering from hardware failures across massive distributed systems
  • Optimize end-to-end training stack minimizing job completion times while maintaining system reliability
  • Develop debugging tools enabling rapid iteration cycles even at supercomputer scale
  • Contribute to open-source components advancing the state of distributed ML infrastructure
  • Mentor junior engineers while leading critical infrastructure initiatives

Qualifications & Requirements

Technical Expertise:

  • 5+ years experience building production distributed systems handling massive scale
  • Deep proficiency in Python and Rust (or C++/Go) for systems programming
  • Proven track record profiling/optimizing I/O-bound and compute-bound workloads
  • Hands-on experience deploying to GPU clusters or supercomputing environments

Mindset & Approach:

  • Obsessive focus on end-to-end system performance and developer experience
  • Thrives in ambiguity with rapidly changing requirements and hardware stacks
  • Natural instinct for simplifying complex systems without sacrificing capability
  • Strong ownership mentality delivering reliable systems at exascale

Bonus Points:

  • Experience with ML training infrastructure (PyTorch, JAX, etc.)
  • Supercomputing center experience (LLNL, NERSC, etc.)
  • Contributions to distributed systems frameworks (Ray, Dask, etc.)

Salary & Benefits

Compensation: $250,000 - $450,000 base + equity + benefits (San Francisco market rate)

Comprehensive Benefits Package:

  • Top-tier medical, dental, vision coverage with low employee premiums
  • 401(k) with 4%+ employer match
  • Unlimited PTO with wellness days
  • 16 weeks paid parental leave (primary/secondary caregivers)
  • $2,000+ annual learning stipend
  • Fitness reimbursement up to $100/month
  • Daily catered meals in SF office
  • Relocation package including housing and moving support
  • Commuter benefits and subsidized housing options

Hybrid schedule: 3 days/week in our state-of-the-art San Francisco headquarters.

Why Join OpenAI?

OpenAI isn't just building AI products—we're creating the infrastructure that makes AGI possible. Our Compute Runtime team sits at the foundation, enabling researchers to push boundaries without infrastructure bottlenecks.

Impact at Scale: Your optimizations directly accelerate humanity's path to AGI

Cutting-Edge Challenges: Work with hardware and scale most engineers never see

Research Synergy: Partner directly with world-class ML researchers

Team Culture: Collaborative, mission-driven engineers who move fast and ship reliable systems

How to Apply

Submit your resume and a brief note explaining why you're excited about distributed systems at OpenAI scale. Tell us about your favorite optimization you've implemented and the impact it had.

Timeline: Rolling applications with interviews typically within 1-2 weeks

Process: Technical screen → Systems coding → Distributed systems deep dive → Team match

Apply Now

Locations

  • San Francisco, California, United States

Salary

Estimated Salary Rangehigh confidence

262,500 - 495,000 USD / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

  • Distributed Systemsintermediate
  • Python Programmingintermediate
  • Rust Programmingintermediate
  • High-Performance Computingintermediate
  • System Optimizationintermediate
  • Profiling Toolsintermediate
  • Scalable APIsintermediate
  • Data Persistenceintermediate
  • I/O Optimizationintermediate
  • ML Training Frameworksintermediate
  • Supercomputer Deploymentintermediate
  • Software Engineeringintermediate
  • Performance Debuggingintermediate
  • Large-Scale Systemsintermediate
  • Hybrid Cloud Infrastructureintermediate
  • Real-Time Monitoringintermediate
  • Fault-Tolerant Designintermediate
  • Concurrency Programmingintermediate

Required Qualifications

  • Experience building large-scale distributed systems handling thousands of nodes (experience)
  • Proficiency in Python and Rust programming languages or equivalent high-performance languages (experience)
  • Strong background in profiling and optimizing compute and data pipelines for scale (experience)
  • Hands-on experience with high-performance I/O systems and network optimization (experience)
  • Demonstrated ability to design introspectable APIs for fast debugging and development (experience)
  • Track record of deploying software to supercomputers or massive GPU clusters (experience)
  • Passion for end-to-end system optimization minimizing complexity and maintenance (experience)
  • Experience working in fast-paced environments with rapidly evolving requirements (experience)
  • Deep understanding of ML training system architectures and data movement patterns (experience)
  • Strong problem-solving skills for maintaining stability at exascale computing levels (experience)
  • Familiarity with hybrid work models and ability to collaborate 3 days/week in office (experience)
  • Bachelor's or higher degree in Computer Science, Engineering, or related field preferred (experience)

Responsibilities

  • Develop powerful APIs orchestrating thousands of computers for data movement and persistence
  • Design easy-to-use, introspectable systems promoting fast debugging and development cycles
  • Profile and optimize compute and data capabilities across local and distributed environments
  • Deploy training frameworks to latest supercomputers responding to evolving ML needs
  • Work across Python and Rust stack to build robust, scalable framework components
  • Maximize researcher productivity through high-performance system components
  • Optimize end-to-end systems from high-performance I/O to supercomputer-scale distribution
  • Maintain system stability and performance when scaling to newest hardware architectures
  • Collaborate with ML researchers to understand and address dynamic training requirements
  • Implement fault-tolerant designs ensuring reliability across massive distributed clusters
  • Create monitoring and observability tools for real-time system performance insights
  • Contribute to low-level framework components powering OpenAI's AGI training infrastructure
  • Rapidly iterate on systems based on hardware advancements and workload changes

Benefits

  • general: Competitive salary with equity package in leading AI company
  • general: Comprehensive health, dental, and vision insurance coverage
  • general: 401(k) retirement plan with generous company matching
  • general: Relocation assistance for new employees moving to San Francisco
  • general: Hybrid work model: 3 days in office, 2 days remote flexibility
  • general: Unlimited PTO with encouragement to disconnect and recharge
  • general: Generous parental leave policies for primary and secondary caregivers
  • general: Fitness reimbursement and wellness program benefits
  • general: Catered lunches and dinners daily in San Francisco office
  • general: Learning stipend for conferences, courses, and professional development
  • general: Mental health support through comprehensive EAP program
  • general: Commuter benefits and subsidized public transportation passes
  • general: Team offsites and social events throughout the year
  • general: Cutting-edge hardware access including latest GPU supercomputers

Target Your Resume for "Software Engineer, Distributed Systems Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Get personalized recommendations to optimize your resume specifically for Software Engineer, Distributed Systems Careers at OpenAI - San Francisco, California | Apply Now!. Takes only 15 seconds!

AI-powered keyword optimization
Skills matching & gap analysis
Experience alignment suggestions

Check Your ATS Score for "Software Engineer, Distributed Systems Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check
Keyword optimization analysis
Skill matching & gap identification
Format & readability score

Tags & Categories

Software Engineer Distributed Systems OpenAIDistributed Systems Engineer San FranciscoML Training Infrastructure JobsPython Rust Systems ProgrammingSupercomputer Software EngineerHigh Performance Computing CareersAGI Infrastructure EngineerOpenAI Software Engineering JobsDistributed ML Systems OpenAIGPU Cluster Optimization JobsSan Francisco Tech Jobs OpenAIScalable Systems EngineerPerformance Optimization EngineerExascale Computing JobsAI Research Infrastructure CareersRust Python Distributed SystemsML Training Framework DeveloperOpenAI Compute Runtime TeamSupercomputing Software JobsDistributed Systems AGI ResearchHigh Performance I/O EngineerSan Francisco AI Jobs 2024Scaling

Answer 10 quick questions to check your fit for Software Engineer, Distributed Systems Careers at OpenAI - San Francisco, California | Apply Now! @ OpenAI.

Quiz Challenge
10 Questions
~2 Minutes
Instant Score

Related Books and Jobs

No related jobs found at the moment.

OpenAI logo

Software Engineer, Distributed Systems Careers at OpenAI - San Francisco, California | Apply Now!

OpenAI

Software Engineer, Distributed Systems Careers at OpenAI - San Francisco, California | Apply Now!

full-timePosted: Feb 10, 2026

Job Description

Software Engineer, Distributed Systems at OpenAI - San Francisco, CA

Join OpenAI's Compute Runtime team building the foundational infrastructure powering the world's most advanced ML training systems. As a Software Engineer specializing in Distributed Systems, you'll architect the low-level frameworks that orchestrate thousands of computers moving petabytes of training data across our cutting-edge supercomputers.

This hybrid role in San Francisco offers relocation assistance and combines the excitement of AGI research with hands-on systems engineering at exascale. If you thrive on optimizing end-to-end performance while maintaining developer-friendly APIs, this is your opportunity to accelerate humanity's progress toward artificial general intelligence.

Key Responsibilities

  • Architect powerful, introspectable APIs orchestrating thousands of GPUs and CPUs for seamless data movement and persistence across distributed clusters
  • Profile and optimize high-performance I/O pipelines maximizing local storage throughput and network fabric utilization
  • Design scalable compute runtime components supporting dynamic ML training workloads from research prototypes to production supercomputers
  • Rapidly deploy training frameworks to bleeding-edge hardware architectures responding to evolving model requirements
  • Build robust Python and Rust components ensuring stability when scaling from hundreds to tens of thousands of nodes
  • Collaborate with ML researchers to understand data movement patterns and optimize for researcher productivity
  • Implement observability and monitoring systems providing real-time insights into cluster-wide performance bottlenecks
  • Create fault-tolerant designs automatically recovering from hardware failures across massive distributed systems
  • Optimize end-to-end training stack minimizing job completion times while maintaining system reliability
  • Develop debugging tools enabling rapid iteration cycles even at supercomputer scale
  • Contribute to open-source components advancing the state of distributed ML infrastructure
  • Mentor junior engineers while leading critical infrastructure initiatives

Qualifications & Requirements

Technical Expertise:

  • 5+ years experience building production distributed systems handling massive scale
  • Deep proficiency in Python and Rust (or C++/Go) for systems programming
  • Proven track record profiling/optimizing I/O-bound and compute-bound workloads
  • Hands-on experience deploying to GPU clusters or supercomputing environments

Mindset & Approach:

  • Obsessive focus on end-to-end system performance and developer experience
  • Thrives in ambiguity with rapidly changing requirements and hardware stacks
  • Natural instinct for simplifying complex systems without sacrificing capability
  • Strong ownership mentality delivering reliable systems at exascale

Bonus Points:

  • Experience with ML training infrastructure (PyTorch, JAX, etc.)
  • Supercomputing center experience (LLNL, NERSC, etc.)
  • Contributions to distributed systems frameworks (Ray, Dask, etc.)

Salary & Benefits

Compensation: $250,000 - $450,000 base + equity + benefits (San Francisco market rate)

Comprehensive Benefits Package:

  • Top-tier medical, dental, vision coverage with low employee premiums
  • 401(k) with 4%+ employer match
  • Unlimited PTO with wellness days
  • 16 weeks paid parental leave (primary/secondary caregivers)
  • $2,000+ annual learning stipend
  • Fitness reimbursement up to $100/month
  • Daily catered meals in SF office
  • Relocation package including housing and moving support
  • Commuter benefits and subsidized housing options

Hybrid schedule: 3 days/week in our state-of-the-art San Francisco headquarters.

Why Join OpenAI?

OpenAI isn't just building AI products—we're creating the infrastructure that makes AGI possible. Our Compute Runtime team sits at the foundation, enabling researchers to push boundaries without infrastructure bottlenecks.

Impact at Scale: Your optimizations directly accelerate humanity's path to AGI

Cutting-Edge Challenges: Work with hardware and scale most engineers never see

Research Synergy: Partner directly with world-class ML researchers

Team Culture: Collaborative, mission-driven engineers who move fast and ship reliable systems

How to Apply

Submit your resume and a brief note explaining why you're excited about distributed systems at OpenAI scale. Tell us about your favorite optimization you've implemented and the impact it had.

Timeline: Rolling applications with interviews typically within 1-2 weeks

Process: Technical screen → Systems coding → Distributed systems deep dive → Team match

Apply Now

Locations

  • San Francisco, California, United States

Salary

Estimated Salary Rangehigh confidence

262,500 - 495,000 USD / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

  • Distributed Systemsintermediate
  • Python Programmingintermediate
  • Rust Programmingintermediate
  • High-Performance Computingintermediate
  • System Optimizationintermediate
  • Profiling Toolsintermediate
  • Scalable APIsintermediate
  • Data Persistenceintermediate
  • I/O Optimizationintermediate
  • ML Training Frameworksintermediate
  • Supercomputer Deploymentintermediate
  • Software Engineeringintermediate
  • Performance Debuggingintermediate
  • Large-Scale Systemsintermediate
  • Hybrid Cloud Infrastructureintermediate
  • Real-Time Monitoringintermediate
  • Fault-Tolerant Designintermediate
  • Concurrency Programmingintermediate

Required Qualifications

  • Experience building large-scale distributed systems handling thousands of nodes (experience)
  • Proficiency in Python and Rust programming languages or equivalent high-performance languages (experience)
  • Strong background in profiling and optimizing compute and data pipelines for scale (experience)
  • Hands-on experience with high-performance I/O systems and network optimization (experience)
  • Demonstrated ability to design introspectable APIs for fast debugging and development (experience)
  • Track record of deploying software to supercomputers or massive GPU clusters (experience)
  • Passion for end-to-end system optimization minimizing complexity and maintenance (experience)
  • Experience working in fast-paced environments with rapidly evolving requirements (experience)
  • Deep understanding of ML training system architectures and data movement patterns (experience)
  • Strong problem-solving skills for maintaining stability at exascale computing levels (experience)
  • Familiarity with hybrid work models and ability to collaborate 3 days/week in office (experience)
  • Bachelor's or higher degree in Computer Science, Engineering, or related field preferred (experience)

Responsibilities

  • Develop powerful APIs orchestrating thousands of computers for data movement and persistence
  • Design easy-to-use, introspectable systems promoting fast debugging and development cycles
  • Profile and optimize compute and data capabilities across local and distributed environments
  • Deploy training frameworks to latest supercomputers responding to evolving ML needs
  • Work across Python and Rust stack to build robust, scalable framework components
  • Maximize researcher productivity through high-performance system components
  • Optimize end-to-end systems from high-performance I/O to supercomputer-scale distribution
  • Maintain system stability and performance when scaling to newest hardware architectures
  • Collaborate with ML researchers to understand and address dynamic training requirements
  • Implement fault-tolerant designs ensuring reliability across massive distributed clusters
  • Create monitoring and observability tools for real-time system performance insights
  • Contribute to low-level framework components powering OpenAI's AGI training infrastructure
  • Rapidly iterate on systems based on hardware advancements and workload changes

Benefits

  • general: Competitive salary with equity package in leading AI company
  • general: Comprehensive health, dental, and vision insurance coverage
  • general: 401(k) retirement plan with generous company matching
  • general: Relocation assistance for new employees moving to San Francisco
  • general: Hybrid work model: 3 days in office, 2 days remote flexibility
  • general: Unlimited PTO with encouragement to disconnect and recharge
  • general: Generous parental leave policies for primary and secondary caregivers
  • general: Fitness reimbursement and wellness program benefits
  • general: Catered lunches and dinners daily in San Francisco office
  • general: Learning stipend for conferences, courses, and professional development
  • general: Mental health support through comprehensive EAP program
  • general: Commuter benefits and subsidized public transportation passes
  • general: Team offsites and social events throughout the year
  • general: Cutting-edge hardware access including latest GPU supercomputers

Target Your Resume for "Software Engineer, Distributed Systems Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Get personalized recommendations to optimize your resume specifically for Software Engineer, Distributed Systems Careers at OpenAI - San Francisco, California | Apply Now!. Takes only 15 seconds!

AI-powered keyword optimization
Skills matching & gap analysis
Experience alignment suggestions

Check Your ATS Score for "Software Engineer, Distributed Systems Careers at OpenAI - San Francisco, California | Apply Now!" , OpenAI

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check
Keyword optimization analysis
Skill matching & gap identification
Format & readability score

Tags & Categories

Software Engineer Distributed Systems OpenAIDistributed Systems Engineer San FranciscoML Training Infrastructure JobsPython Rust Systems ProgrammingSupercomputer Software EngineerHigh Performance Computing CareersAGI Infrastructure EngineerOpenAI Software Engineering JobsDistributed ML Systems OpenAIGPU Cluster Optimization JobsSan Francisco Tech Jobs OpenAIScalable Systems EngineerPerformance Optimization EngineerExascale Computing JobsAI Research Infrastructure CareersRust Python Distributed SystemsML Training Framework DeveloperOpenAI Compute Runtime TeamSupercomputing Software JobsDistributed Systems AGI ResearchHigh Performance I/O EngineerSan Francisco AI Jobs 2024Scaling

Answer 10 quick questions to check your fit for Software Engineer, Distributed Systems Careers at OpenAI - San Francisco, California | Apply Now! @ OpenAI.

Quiz Challenge
10 Questions
~2 Minutes
Instant Score

Related Books and Jobs

No related jobs found at the moment.