RESUME AND JOB

Staff Site Reliability Engineer, Managed AI Careers at Crusoe - San Francisco, California | Apply Now!

Crusoe

Staff Site Reliability Engineer, Managed AI Careers at Crusoe - San Francisco, California | Apply Now!

Crusoe

full-timePosted: Jan 24, 2026

Job Description

Staff Site Reliability Engineer, Managed AI - Crusoe Energy Systems

About Crusoe

Crusoe's mission is to align the future of energy with the future of computing. We build innovative technologies that reduce both the costs and the environmental impact of the world’s expanding digital infrastructure. By creating mutually beneficial relationships between the energy and digital sectors, we unlock stranded energy resources, lower the costs of computation, and pave the way for a more sustainable and prosperous future.

Our focus is on providing cloud computing solutions powered by otherwise wasted energy sources, reducing flaring and emissions while simultaneously supporting compute-intensive workloads such as AI and machine learning. We are a team of innovators, problem-solvers, and visionaries dedicated to making a tangible difference in the world. Join us on our journey to accelerate the abundance of energy and intelligence in a sustainable manner.

Role Overview: Staff Site Reliability Engineer, Managed AI

As a Staff Site Reliability Engineer (SRE) specializing in Managed AI at Crusoe, you will play a critical role in ensuring the reliability, scalability, and performance of our AI-optimized cloud platform. You will be responsible for designing, building, and operating the infrastructure that powers large language models (LLMs) and other AI services at scale. This role requires a deep understanding of distributed systems, a passion for automation, and a commitment to delivering exceptional service to our customers.

You will collaborate closely with AI, platform, and infrastructure teams to optimize the entire AI pipeline, from training to inference. Your contributions will directly impact the efficiency and effectiveness of our AI services, enabling our customers to push the boundaries of innovation in various industries. You will be a key member of a dynamic team that is at the forefront of sustainable and transformative cloud infrastructure.

A Day in the Life of a Staff Site Reliability Engineer

Here’s a glimpse into what your day-to-day might look like:

Morning: Start your day by reviewing monitoring dashboards and alerts to identify any potential issues or performance bottlenecks in the AI infrastructure.
Mid-day: Participate in a team meeting to discuss ongoing projects, share updates, and brainstorm solutions to complex challenges.
Afternoon: Work on automating the deployment and scaling of LLM inference services using Kubernetes and other container orchestration tools.
Evening: Investigate a reported performance issue in a distributed AI system, using telemetry, logs, and profiling tools to identify the root cause and implement a fix.
Throughout the Day: Collaborate with AI researchers and engineers to optimize the performance of their models and ensure seamless integration with the underlying infrastructure.

Why San Francisco, CA?

San Francisco is a hub of technological innovation and a vibrant ecosystem for AI and machine learning. Located in the heart of Silicon Valley, this city offers unparalleled opportunities for professional growth and networking. By working in our San Francisco office, you will be surrounded by some of the brightest minds in the industry, with access to cutting-edge research and development.

Beyond the professional advantages, San Francisco boasts a rich cultural scene, diverse neighborhoods, and stunning natural beauty. From the Golden Gate Bridge to the vibrant arts and culinary scene, there’s always something new to explore. The city's commitment to sustainability also aligns perfectly with Crusoe's mission, making it an ideal place to live and work.

Career Path at Crusoe

At Crusoe, we are committed to fostering the growth and development of our employees. As a Staff Site Reliability Engineer, you will have opportunities to advance your career through:

Technical Leadership: Lead complex projects, mentor junior engineers, and contribute to the architectural vision of our AI infrastructure.
Specialization: Deepen your expertise in specific areas of SRE, such as performance engineering, capacity planning, or security.
Management: Transition into a management role, leading a team of SREs and overseeing the operation of critical AI services.
Cross-Functional Opportunities: Explore opportunities to work on other teams within Crusoe, such as AI research, platform engineering, or infrastructure development.

Salary & Benefits

Crusoe offers a competitive salary and comprehensive benefits package that includes:

Competitive Pay: We offer salaries that are competitive with top tech companies in the San Francisco area.
Restricted Stock Units: Employees receive restricted stock units, providing an opportunity to share in the company's success.
Health Insurance: Comprehensive health, dental, and vision insurance plans are available.
HSA Contributions: Employer contributions to Health Savings Accounts (HSA).
Paid Parental Leave: Generous paid parental leave for new parents.
Life and Disability Insurance: Company-paid life insurance, short-term, and long-term disability coverage.
Teladoc: Access to virtual healthcare services through Teladoc.
401(k): A 401(k) plan with a 100% match up to 4% of your salary.
Paid Time Off: Generous paid time off and a comprehensive holiday schedule.
Additional Perks: Cell phone reimbursement, tuition reimbursement, Calm app subscription, MetLife Legal services, and a company-paid commuter benefit.

Crusoe Culture

Our culture is built on a foundation of innovation, collaboration, and sustainability. We value diversity, creativity, and a passion for making a difference. At Crusoe, you will be part of a team that is committed to:

Continuous Learning: We encourage employees to stay up-to-date with the latest technologies and trends in AI and cloud computing.
Open Communication: We foster a culture of open communication and transparency, where everyone feels comfortable sharing their ideas and feedback.
Work-Life Balance: We recognize the importance of work-life balance and offer flexible work arrangements to support our employees' needs.
Impactful Work: We provide opportunities to work on projects that have a real-world impact, contributing to a more sustainable and prosperous future.

How to Apply

If you are passionate about AI, distributed systems, and sustainability, and you are looking for a challenging and rewarding career, we encourage you to apply for the Staff Site Reliability Engineer, Managed AI position at Crusoe. To apply, please submit your resume and a cover letter outlining your qualifications and experience through our online application portal.

FAQ

What is Crusoe's mission?

Crusoe's mission is to align the future of energy with the future of computing by building innovative technologies that reduce both the costs and environmental impact of expanding digital infrastructure.
What is Site Reliability Engineering (SRE)?

Site Reliability Engineering (SRE) is a discipline that applies software engineering principles to infrastructure operations. The goal of SRE is to automate operational tasks, improve system reliability, and ensure scalability and performance.
What are Large Language Models (LLMs)?

Large language models (LLMs) are a type of artificial intelligence model that is trained on vast amounts of text data to understand and generate human-like text. They are used in a variety of applications, including chatbots, content creation, and machine translation.
What is Kubernetes?

Kubernetes is an open-source container orchestration platform that automates the deployment, scaling, and management of containerized applications. It is widely used in cloud-native environments to manage complex workloads.
What kind of experience is Crusoe looking for in a Staff SRE?

Crusoe is looking for candidates with a strong software engineering background, experience in distributed systems design, hands-on experience with LLMs or AI/ML infrastructure, and a solid understanding of SRE principles.
What programming languages are preferred for this role?

Proficiency in at least one modern programming language, such as Python, Go, Java, or C++, is required.
What are the key responsibilities of a Staff SRE at Crusoe?

Key responsibilities include designing and operating reliable managed AI services, building automation and reliability tooling, defining and measuring SLIs/SLOs, collaborating with other teams to optimize AI pipelines, and investigating and resolving reliability issues.
What benefits does Crusoe offer?

Crusoe offers a competitive salary, restricted stock units, comprehensive health insurance, HSA contributions, paid parental leave, life and disability insurance, Teladoc access, a 401(k) plan with a company match, generous paid time off, and additional perks like cell phone reimbursement and tuition reimbursement.
How does Crusoe contribute to sustainability?

Crusoe contributes to sustainability by using otherwise wasted energy sources to power cloud computing solutions, reducing flaring and emissions while supporting compute-intensive workloads such as AI and machine learning.
What is the work environment like at Crusoe?

Crusoe fosters a culture of innovation, collaboration, and sustainability. The work environment is dynamic, fast-paced, and mission-driven, with a focus on continuous learning and open communication.

Locations

San Francisco, California, United States

Salary

Estimated Salary Rangemedium confidence

198,000 - 308,000 USD / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

Site Reliability Engineering (SRE)intermediate
Distributed Systems Designintermediate
Large Language Models (LLMs)intermediate
AI/ML Infrastructureintermediate
Kubernetesintermediate
Container Orchestrationintermediate
Pythonintermediate
Gointermediate
Javaintermediate
C++intermediate
SLI/SLO Definitionintermediate
Monitoring Systemsintermediate
Observability Systemsintermediate
Fault-Tolerant Systemsintermediate
Automated Testingintermediate
Telemetryintermediate
Loggingintermediate
Profilingintermediate
Performance Tuningintermediate
Cloud Computingintermediate
Automationintermediate
Problem-Solvingintermediate
Collaborationintermediate
Communicationintermediate
Infrastructure as Codeintermediate

Required Qualifications

Strong software engineering background with experience building production-grade systems. (experience)
Demonstrated experience in distributed systems design and implementation. (experience)
Hands-on experience with large language models (LLMs) or AI/ML infrastructure. (experience)
SRE mindset and experience in defining and measuring SLIs/SLOs. (experience)
Experience building monitoring and observability systems. (experience)
Proven ability to drive performance and reliability improvements. (experience)
Experience designing fault-tolerant systems and automated testing strategies. (experience)
Proficiency in at least one modern programming language (Python, Go, Java, C++). (experience)
Familiarity with Kubernetes or container orchestration platforms. (experience)
Strong collaboration and communication skills. (experience)
Ability to thrive in a fast-paced, mission-driven environment. (experience)
Experience scaling inference or training workloads for LLMs (Bonus) (experience)

Responsibilities

Design and operate reliable managed AI services with a focus on serving and scaling LLM workloads.
Build automation and reliability tooling to support distributed AI pipelines and inference services.
Define, measure, and improve SLIs/SLOs across AI workloads to ensure performance and reliability targets are met.
Collaborate with AI, platform, and infrastructure teams to optimize large-scale training and inference clusters.
Automate observability by building telemetry and performance tuning strategies for latency-sensitive AI services.
Investigate and resolve reliability issues in distributed AI systems using telemetry, logs, and profiling.
Contribute to the architecture of next-generation distributed systems purpose-built for AI-first environments.
Participate in on-call rotations to ensure system uptime and availability.
Develop and maintain comprehensive documentation for systems and processes.
Mentor junior engineers and share knowledge within the SRE team.
Proactively identify and address potential performance bottlenecks and scalability limitations.
Implement security best practices in all aspects of system design and operation.
Participate in incident response and post-mortem analysis to prevent future occurrences.
Work with vendors to evaluate and integrate new technologies.

Benefits

general: Industry competitive pay
general: Restricted Stock Units in a fast growing, well-funded technology company
general: Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
general: Employer contributions to HSA accounts
general: Paid Parental Leave
general: Paid life insurance
general: Short-term and long-term disability
general: Teladoc
general: 401(k) with a 100% match up to 4% of salary
general: Generous paid time off and holiday schedule
general: Cell phone reimbursement
general: Tuition reimbursement
general: Subscription to the Calm app
general: MetLife Legal
general: Company paid commuter benefit; $300 per month

Target Your Resume for "Staff Site Reliability Engineer, Managed AI Careers at Crusoe - San Francisco, California | Apply Now!" , Crusoe

Get personalized recommendations to optimize your resume specifically for Staff Site Reliability Engineer, Managed AI Careers at Crusoe - San Francisco, California | Apply Now!. Takes only 15 seconds!

AI-powered keyword optimization

Skills matching & gap analysis

Experience alignment suggestions

Check Your ATS Score for "Staff Site Reliability Engineer, Managed AI Careers at Crusoe - San Francisco, California | Apply Now!" , Crusoe

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check

Keyword optimization analysis

Skill matching & gap identification

Format & readability score

Tags & Categories

SREAILLMKubernetesCloudSan FranciscoSite Reliability EngineerManaged AILarge Language ModelsArtificial IntelligenceMachine LearningCloud ComputingCaliforniaContainer OrchestrationDistributed SystemsHigh AvailabilityScalabilityPerformance TuningMonitoringObservabilityAutomationInfrastructure as CodePythonGoJavaC++Crusoe Energy SystemsGreen ComputingSustainable TechnologyAI InfrastructureGreen TechAI InfrastructureCloudEngineering

Answer 10 quick questions to check your fit for Staff Site Reliability Engineer, Managed AI Careers at Crusoe - San Francisco, California | Apply Now! @ Crusoe.

10 Questions

~2 Minutes

Instant Score

Related Books and Jobs

No related jobs found at the moment.

Privacy Terms & Conditions About Us Refund Policy Recruiter Login Sitemap

Staff Site Reliability Engineer, Managed AI Careers at Crusoe - San Francisco, California | Apply Now!

Crusoe

Staff Site Reliability Engineer, Managed AI Careers at Crusoe - San Francisco, California | Apply Now!

Crusoe

full-timePosted: Jan 24, 2026

Job Description

Staff Site Reliability Engineer, Managed AI - Crusoe Energy Systems

About Crusoe

Role Overview: Staff Site Reliability Engineer, Managed AI

A Day in the Life of a Staff Site Reliability Engineer

Here’s a glimpse into what your day-to-day might look like:

Morning: Start your day by reviewing monitoring dashboards and alerts to identify any potential issues or performance bottlenecks in the AI infrastructure.
Mid-day: Participate in a team meeting to discuss ongoing projects, share updates, and brainstorm solutions to complex challenges.
Afternoon: Work on automating the deployment and scaling of LLM inference services using Kubernetes and other container orchestration tools.
Evening: Investigate a reported performance issue in a distributed AI system, using telemetry, logs, and profiling tools to identify the root cause and implement a fix.
Throughout the Day: Collaborate with AI researchers and engineers to optimize the performance of their models and ensure seamless integration with the underlying infrastructure.

Why San Francisco, CA?

Career Path at Crusoe

At Crusoe, we are committed to fostering the growth and development of our employees. As a Staff Site Reliability Engineer, you will have opportunities to advance your career through:

Technical Leadership: Lead complex projects, mentor junior engineers, and contribute to the architectural vision of our AI infrastructure.
Specialization: Deepen your expertise in specific areas of SRE, such as performance engineering, capacity planning, or security.
Management: Transition into a management role, leading a team of SREs and overseeing the operation of critical AI services.
Cross-Functional Opportunities: Explore opportunities to work on other teams within Crusoe, such as AI research, platform engineering, or infrastructure development.

Salary & Benefits

Crusoe offers a competitive salary and comprehensive benefits package that includes:

Competitive Pay: We offer salaries that are competitive with top tech companies in the San Francisco area.
Restricted Stock Units: Employees receive restricted stock units, providing an opportunity to share in the company's success.
Health Insurance: Comprehensive health, dental, and vision insurance plans are available.
HSA Contributions: Employer contributions to Health Savings Accounts (HSA).
Paid Parental Leave: Generous paid parental leave for new parents.
Life and Disability Insurance: Company-paid life insurance, short-term, and long-term disability coverage.
Teladoc: Access to virtual healthcare services through Teladoc.
401(k): A 401(k) plan with a 100% match up to 4% of your salary.
Paid Time Off: Generous paid time off and a comprehensive holiday schedule.
Additional Perks: Cell phone reimbursement, tuition reimbursement, Calm app subscription, MetLife Legal services, and a company-paid commuter benefit.

Crusoe Culture

Continuous Learning: We encourage employees to stay up-to-date with the latest technologies and trends in AI and cloud computing.
Open Communication: We foster a culture of open communication and transparency, where everyone feels comfortable sharing their ideas and feedback.
Work-Life Balance: We recognize the importance of work-life balance and offer flexible work arrangements to support our employees' needs.
Impactful Work: We provide opportunities to work on projects that have a real-world impact, contributing to a more sustainable and prosperous future.

How to Apply

FAQ

What is Crusoe's mission?

Crusoe's mission is to align the future of energy with the future of computing by building innovative technologies that reduce both the costs and environmental impact of expanding digital infrastructure.
What is Site Reliability Engineering (SRE)?

Site Reliability Engineering (SRE) is a discipline that applies software engineering principles to infrastructure operations. The goal of SRE is to automate operational tasks, improve system reliability, and ensure scalability and performance.
What are Large Language Models (LLMs)?

Large language models (LLMs) are a type of artificial intelligence model that is trained on vast amounts of text data to understand and generate human-like text. They are used in a variety of applications, including chatbots, content creation, and machine translation.
What is Kubernetes?

Kubernetes is an open-source container orchestration platform that automates the deployment, scaling, and management of containerized applications. It is widely used in cloud-native environments to manage complex workloads.
What kind of experience is Crusoe looking for in a Staff SRE?

Crusoe is looking for candidates with a strong software engineering background, experience in distributed systems design, hands-on experience with LLMs or AI/ML infrastructure, and a solid understanding of SRE principles.
What programming languages are preferred for this role?

Proficiency in at least one modern programming language, such as Python, Go, Java, or C++, is required.
What are the key responsibilities of a Staff SRE at Crusoe?

Key responsibilities include designing and operating reliable managed AI services, building automation and reliability tooling, defining and measuring SLIs/SLOs, collaborating with other teams to optimize AI pipelines, and investigating and resolving reliability issues.
What benefits does Crusoe offer?

Crusoe offers a competitive salary, restricted stock units, comprehensive health insurance, HSA contributions, paid parental leave, life and disability insurance, Teladoc access, a 401(k) plan with a company match, generous paid time off, and additional perks like cell phone reimbursement and tuition reimbursement.
How does Crusoe contribute to sustainability?

Crusoe contributes to sustainability by using otherwise wasted energy sources to power cloud computing solutions, reducing flaring and emissions while supporting compute-intensive workloads such as AI and machine learning.
What is the work environment like at Crusoe?

Crusoe fosters a culture of innovation, collaboration, and sustainability. The work environment is dynamic, fast-paced, and mission-driven, with a focus on continuous learning and open communication.

Locations

San Francisco, California, United States

Salary

Estimated Salary Rangemedium confidence

198,000 - 308,000 USD / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

Site Reliability Engineering (SRE)intermediate
Distributed Systems Designintermediate
Large Language Models (LLMs)intermediate
AI/ML Infrastructureintermediate
Kubernetesintermediate
Container Orchestrationintermediate
Pythonintermediate
Gointermediate
Javaintermediate
C++intermediate
SLI/SLO Definitionintermediate
Monitoring Systemsintermediate
Observability Systemsintermediate
Fault-Tolerant Systemsintermediate
Automated Testingintermediate
Telemetryintermediate
Loggingintermediate
Profilingintermediate
Performance Tuningintermediate
Cloud Computingintermediate
Automationintermediate
Problem-Solvingintermediate
Collaborationintermediate
Communicationintermediate
Infrastructure as Codeintermediate

Required Qualifications

Strong software engineering background with experience building production-grade systems. (experience)
Demonstrated experience in distributed systems design and implementation. (experience)
Hands-on experience with large language models (LLMs) or AI/ML infrastructure. (experience)
SRE mindset and experience in defining and measuring SLIs/SLOs. (experience)
Experience building monitoring and observability systems. (experience)
Proven ability to drive performance and reliability improvements. (experience)
Experience designing fault-tolerant systems and automated testing strategies. (experience)
Proficiency in at least one modern programming language (Python, Go, Java, C++). (experience)
Familiarity with Kubernetes or container orchestration platforms. (experience)
Strong collaboration and communication skills. (experience)
Ability to thrive in a fast-paced, mission-driven environment. (experience)
Experience scaling inference or training workloads for LLMs (Bonus) (experience)

Responsibilities

Design and operate reliable managed AI services with a focus on serving and scaling LLM workloads.
Build automation and reliability tooling to support distributed AI pipelines and inference services.
Define, measure, and improve SLIs/SLOs across AI workloads to ensure performance and reliability targets are met.
Collaborate with AI, platform, and infrastructure teams to optimize large-scale training and inference clusters.
Automate observability by building telemetry and performance tuning strategies for latency-sensitive AI services.
Investigate and resolve reliability issues in distributed AI systems using telemetry, logs, and profiling.
Contribute to the architecture of next-generation distributed systems purpose-built for AI-first environments.
Participate in on-call rotations to ensure system uptime and availability.
Develop and maintain comprehensive documentation for systems and processes.
Mentor junior engineers and share knowledge within the SRE team.
Proactively identify and address potential performance bottlenecks and scalability limitations.
Implement security best practices in all aspects of system design and operation.
Participate in incident response and post-mortem analysis to prevent future occurrences.
Work with vendors to evaluate and integrate new technologies.

Benefits

general: Industry competitive pay
general: Restricted Stock Units in a fast growing, well-funded technology company
general: Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
general: Employer contributions to HSA accounts
general: Paid Parental Leave
general: Paid life insurance
general: Short-term and long-term disability
general: Teladoc
general: 401(k) with a 100% match up to 4% of salary
general: Generous paid time off and holiday schedule
general: Cell phone reimbursement
general: Tuition reimbursement
general: Subscription to the Calm app
general: MetLife Legal
general: Company paid commuter benefit; $300 per month

Target Your Resume for "Staff Site Reliability Engineer, Managed AI Careers at Crusoe - San Francisco, California | Apply Now!" , Crusoe

AI-powered keyword optimization

Skills matching & gap analysis

Experience alignment suggestions

Check Your ATS Score for "Staff Site Reliability Engineer, Managed AI Careers at Crusoe - San Francisco, California | Apply Now!" , Crusoe

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check

Keyword optimization analysis

Skill matching & gap identification

Format & readability score

Tags & Categories

Answer 10 quick questions to check your fit for Staff Site Reliability Engineer, Managed AI Careers at Crusoe - San Francisco, California | Apply Now! @ Crusoe.

10 Questions

~2 Minutes

Instant Score

Related Books and Jobs

No related jobs found at the moment.

Privacy Terms & Conditions About Us Refund Policy Recruiter Login Sitemap