Resume and JobRESUME AND JOB
Crusoe logo

Senior Site Reliability Engineer Careers at Crusoe - San Francisco, California | Apply Now!

Crusoe

Senior Site Reliability Engineer Careers at Crusoe - San Francisco, California | Apply Now!

full-timePosted: Dec 5, 2025

Job Description

Senior Site Reliability Engineer at Crusoe: Powering the AI Revolution Sustainably

Role Overview

Crusoe is at the forefront of accelerating the abundance of energy and intelligence. Our mission is to build the engine that powers a world where people can create ambitiously with AI, without sacrificing scale, speed, or sustainability. We are building the most reliable, energy-efficient, AI-optimized cloud platform, and operational excellence is at the heart of that mission.

As a Senior Site Reliability Engineer focused on Operational Excellence, you will play a critical role in ensuring the stability, resilience, and performance of Crusoe’s GPU cloud. This role is perfect for engineers who thrive in fast-paced environments, enjoy solving operational problems, and want to grow their technical career while supporting incident response, reliability, and continuous improvement across a large-scale distributed platform.

You'll partner closely with senior SREs, infrastructure engineers, and platform teams to improve reliability, reduce operational toil, and strengthen Crusoe’s incident management practices. Your contributions will directly impact the success of our mission and the growth of our sustainable cloud infrastructure.

A Day in the Life

Here’s a glimpse into what a typical day might look like as a Senior Site Reliability Engineer at Crusoe:

  • Morning: Start the day by reviewing monitoring dashboards and alerts to identify any potential issues or performance bottlenecks. Participate in a daily stand-up meeting with the SRE team to discuss ongoing projects, priorities, and any roadblocks.
  • Mid-day: Collaborate with infrastructure engineers to troubleshoot a service disruption. Analyze logs, metrics, and system behavior to identify the root cause and implement a fix. Document the incident and participate in a post-incident review to identify areas for improvement.
  • Afternoon: Work on automating a manual operational task using Python and Ansible. This will help reduce operational toil and improve the efficiency of the SRE team. Partner with the compute team to improve the resilience of a critical service by implementing a new disaster recovery strategy.
  • Evening: Participate in an on-call rotation to support the Crusoe cloud platform. Monitor system health and respond to any incidents that occur. Contribute to knowledge sharing by documenting best practices and creating training materials for junior SREs.

Why San Francisco, California?

San Francisco is a global hub for technology innovation and talent, making it the ideal location for Crusoe's headquarters. Being in San Francisco provides access to a vibrant ecosystem of engineers, researchers, and entrepreneurs. The city's proximity to leading universities and research institutions ensures a constant stream of talent and fresh ideas. San Francisco also offers a high quality of life, with a diverse culture, world-class restaurants, and stunning natural beauty. Working in San Francisco provides unparalleled opportunities for professional growth and personal enrichment.

Career Path

At Crusoe, we are committed to fostering the growth and development of our employees. As a Senior Site Reliability Engineer, you'll have opportunities to advance your career in a variety of directions. You could specialize in a particular area of SRE, such as incident management, automation, or performance engineering. You could also move into a leadership role, managing a team of SREs and driving the overall reliability strategy for Crusoe. We provide ongoing training, mentorship, and professional development opportunities to help you achieve your career goals.

Salary & Benefits

Crusoe offers a competitive salary and benefits package to attract and retain top talent. The estimated salary range for a Senior Site Reliability Engineer in San Francisco is $170,000 to $250,000 per year, depending on experience and qualifications. In addition to salary, we offer a comprehensive benefits package that includes:

  • Competitive salary and equity compensation
  • Comprehensive health, dental, and vision insurance
  • Generous paid time off and holidays
  • 401(k) retirement plan with company match
  • Professional development opportunities and tuition reimbursement
  • Wellness programs and resources
  • Employee assistance program (EAP)
  • Flexible work arrangements
  • Company-sponsored events and team-building activities
  • Stocked kitchen with snacks and beverages
  • Commuter benefits

Crusoe Culture

At Crusoe, we are driven by our mission to accelerate the abundance of energy and intelligence. We foster a culture of innovation, collaboration, and sustainability. We value creativity, problem-solving, and a growth mindset. We are committed to creating a diverse and inclusive workplace where everyone feels valued and respected. We believe that our people are our greatest asset, and we invest in their growth and development. We are passionate about using technology to solve some of the world's most pressing challenges.

How to Apply

If you are a talented and motivated engineer who is passionate about reliability, sustainability, and innovation, we encourage you to apply for the Senior Site Reliability Engineer position at Crusoe. To apply, please submit your resume and cover letter through our online application portal. Be sure to highlight your relevant experience and qualifications, and explain why you are interested in working at Crusoe.

FAQ

  1. What is Crusoe's mission?

    Crusoe's mission is to accelerate the abundance of energy and intelligence.

  2. What is the role of a Senior Site Reliability Engineer at Crusoe?

    The Senior Site Reliability Engineer ensures the stability, resilience, and performance of Crusoe’s GPU cloud.

  3. What skills are required for this role?

    Key skills include cloud operations, SRE, GPU workload experience, Linux systems administration, networking, and automation.

  4. What experience is needed for this role?

    5+ years of experience in cloud operations, SRE, or related roles is required.

  5. What tools and technologies does Crusoe use?

    Crusoe uses Kubernetes, AWS/GCP, Prometheus, Grafana, Terraform, Ansible, and scripting languages like Go and Python.

  6. What is the career path for this role?

    Opportunities exist to specialize in SRE areas or move into leadership roles within the SRE team.

  7. What benefits does Crusoe offer?

    Crusoe offers competitive salary, health insurance, paid time off, 401(k), professional development, and more.

  8. What is the work environment like at Crusoe?

    Crusoe fosters a culture of innovation, collaboration, sustainability, and inclusivity.

  9. How does Crusoe support professional development?

    Crusoe provides ongoing training, mentorship, and professional development opportunities.

  10. How can I apply for this position?

    Submit your resume and cover letter through our online application portal.

Locations

  • San Francisco, California, United States

Salary

Estimated Salary Rangemedium confidence

187,000 - 275,000 USD / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

  • Cloud Operationsintermediate
  • Site Reliability Engineering (SRE)intermediate
  • GPU Workloadsintermediate
  • High-Performance Computing (HPC)intermediate
  • Latency/Throughput Optimizationintermediate
  • Unix/Linux Systems Administrationintermediate
  • Networking (TCP/IP, DNS, Routing)intermediate
  • Cloud Platforms (Kubernetes, AWS, GCP)intermediate
  • Virtualizationintermediate
  • Distributed Systemsintermediate
  • Incident Managementintermediate
  • Monitoring and Alerting (Prometheus, Grafana, Alertmanager)intermediate
  • Observability (OpenTelemetry)intermediate
  • Infrastructure-as-Code (Terraform)intermediate
  • Configuration Management (Ansible)intermediate
  • Scripting (Go, Python)intermediate
  • Automationintermediate
  • Root Cause Analysis (RCA)intermediate
  • SLI/SLO Managementintermediate
  • Disaster Recoveryintermediate
  • Communicationintermediate
  • Problem-Solvingintermediate
  • Collaborationintermediate
  • Operational Excellenceintermediate
  • Growth Mindsetintermediate

Required Qualifications

  • 5+ years of experience in cloud operations, SRE, or related roles. (experience)
  • Background working with GPU workloads, high-performance computing, or latency/throughput-sensitive systems. (experience)
  • Strong knowledge of Unix/Linux systems (kernel/user space) and networking. (experience)
  • Understanding of cloud platforms and infrastructure fundamentals (Kubernetes, AWS/GCP, virtualization, distributed systems). (experience)
  • Familiarity with incident management practices and operational frameworks (SRE/ITIL/etc.). (experience)
  • Experience with monitoring and alerting tools (Prometheus, Grafana) or a strong willingness to learn. (experience)
  • Familiarity with infrastructure-as-code and configuration management tools such as Terraform and Ansible. (experience)
  • Basic scripting and automation experience (Go, Python, C, C++, or similar). (experience)
  • Strong communication skills, with the ability to clearly articulate technical issues to diverse stakeholders. (experience)
  • Ability to stay calm, focused, and effective in fast-moving or high-pressure situations. (experience)
  • A growth mindset with enthusiasm for operational excellence, reliability engineering, and continuous improvement. (experience)

Responsibilities

  • Collaborate with cross-functional teams to define and refine availability metrics for Crusoe’s cloud infrastructure, including establishing, tracking, and improving SLIs and SLOs.
  • Assist in incident response by identifying, diagnosing, and resolving service disruptions.
  • Support post-incident processes through RCA documentation and participation in post-incident reviews.
  • Build, operate, and monitor infrastructure health using Crusoe’s observability stack (Prometheus, Grafana, Alertmanager, OpenTelemetry).
  • Identify and communicate reliability risks, performance bottlenecks, and early indicators of potential incidents that could impact service availability.
  • Develop automation and tooling to reduce operational toil, minimize manual intervention, and enhance service recovery and self-healing capabilities.
  • Partner with compute, network, storage, and platform teams to improve service resilience and strengthen disaster recovery readiness.
  • Contribute to knowledge sharing, process improvements, and the development of operational best practices across the organization.
  • Participate in ongoing training, mentorship, and professional development to grow into advanced SRE responsibilities.
  • Participate in on-call rotations to support the Crusoe cloud platform.
  • Develop and maintain comprehensive documentation for systems and processes.
  • Proactively identify opportunities to improve system performance, security, and scalability.

Benefits

  • general: Competitive salary and equity compensation.
  • general: Comprehensive health, dental, and vision insurance.
  • general: Generous paid time off and holidays.
  • general: 401(k) retirement plan with company match.
  • general: Professional development opportunities and tuition reimbursement.
  • general: Wellness programs and resources.
  • general: Employee assistance program (EAP).
  • general: Flexible work arrangements.
  • general: Company-sponsored events and team-building activities.
  • general: Stocked kitchen with snacks and beverages.
  • general: Commuter benefits.
  • general: Opportunity to work on cutting-edge technology in a rapidly growing company.
  • general: A supportive and collaborative work environment.
  • general: Opportunity to make a significant impact on the future of sustainable cloud computing.
  • general: Access to Crusoe Cloud platform resources for personal projects and development.

Target Your Resume for "Senior Site Reliability Engineer Careers at Crusoe - San Francisco, California | Apply Now!" , Crusoe

Get personalized recommendations to optimize your resume specifically for Senior Site Reliability Engineer Careers at Crusoe - San Francisco, California | Apply Now!. Takes only 15 seconds!

AI-powered keyword optimization
Skills matching & gap analysis
Experience alignment suggestions

Check Your ATS Score for "Senior Site Reliability Engineer Careers at Crusoe - San Francisco, California | Apply Now!" , Crusoe

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check
Keyword optimization analysis
Skill matching & gap identification
Format & readability score

Tags & Categories

SRECloudGPUKubernetesAutomationSan FranciscoFull-timeSite Reliability EngineerCloud OperationsGPU ComputingHigh-Performance ComputingAWSGCPPrometheusGrafanaTerraformAnsibleLinuxNetworkingIncident ManagementCaliforniaCrusoe EnergySustainable CloudAI InfrastructureOperational ExcellenceCloud InfrastructureReliability EngineeringDevOpsGreen TechAI InfrastructureCloudEngineering

Answer 10 quick questions to check your fit for Senior Site Reliability Engineer Careers at Crusoe - San Francisco, California | Apply Now! @ Crusoe.

Quiz Challenge
10 Questions
~2 Minutes
Instant Score

Related Books and Jobs

No related jobs found at the moment.

Crusoe logo

Senior Site Reliability Engineer Careers at Crusoe - San Francisco, California | Apply Now!

Crusoe

Senior Site Reliability Engineer Careers at Crusoe - San Francisco, California | Apply Now!

full-timePosted: Dec 5, 2025

Job Description

Senior Site Reliability Engineer at Crusoe: Powering the AI Revolution Sustainably

Role Overview

Crusoe is at the forefront of accelerating the abundance of energy and intelligence. Our mission is to build the engine that powers a world where people can create ambitiously with AI, without sacrificing scale, speed, or sustainability. We are building the most reliable, energy-efficient, AI-optimized cloud platform, and operational excellence is at the heart of that mission.

As a Senior Site Reliability Engineer focused on Operational Excellence, you will play a critical role in ensuring the stability, resilience, and performance of Crusoe’s GPU cloud. This role is perfect for engineers who thrive in fast-paced environments, enjoy solving operational problems, and want to grow their technical career while supporting incident response, reliability, and continuous improvement across a large-scale distributed platform.

You'll partner closely with senior SREs, infrastructure engineers, and platform teams to improve reliability, reduce operational toil, and strengthen Crusoe’s incident management practices. Your contributions will directly impact the success of our mission and the growth of our sustainable cloud infrastructure.

A Day in the Life

Here’s a glimpse into what a typical day might look like as a Senior Site Reliability Engineer at Crusoe:

  • Morning: Start the day by reviewing monitoring dashboards and alerts to identify any potential issues or performance bottlenecks. Participate in a daily stand-up meeting with the SRE team to discuss ongoing projects, priorities, and any roadblocks.
  • Mid-day: Collaborate with infrastructure engineers to troubleshoot a service disruption. Analyze logs, metrics, and system behavior to identify the root cause and implement a fix. Document the incident and participate in a post-incident review to identify areas for improvement.
  • Afternoon: Work on automating a manual operational task using Python and Ansible. This will help reduce operational toil and improve the efficiency of the SRE team. Partner with the compute team to improve the resilience of a critical service by implementing a new disaster recovery strategy.
  • Evening: Participate in an on-call rotation to support the Crusoe cloud platform. Monitor system health and respond to any incidents that occur. Contribute to knowledge sharing by documenting best practices and creating training materials for junior SREs.

Why San Francisco, California?

San Francisco is a global hub for technology innovation and talent, making it the ideal location for Crusoe's headquarters. Being in San Francisco provides access to a vibrant ecosystem of engineers, researchers, and entrepreneurs. The city's proximity to leading universities and research institutions ensures a constant stream of talent and fresh ideas. San Francisco also offers a high quality of life, with a diverse culture, world-class restaurants, and stunning natural beauty. Working in San Francisco provides unparalleled opportunities for professional growth and personal enrichment.

Career Path

At Crusoe, we are committed to fostering the growth and development of our employees. As a Senior Site Reliability Engineer, you'll have opportunities to advance your career in a variety of directions. You could specialize in a particular area of SRE, such as incident management, automation, or performance engineering. You could also move into a leadership role, managing a team of SREs and driving the overall reliability strategy for Crusoe. We provide ongoing training, mentorship, and professional development opportunities to help you achieve your career goals.

Salary & Benefits

Crusoe offers a competitive salary and benefits package to attract and retain top talent. The estimated salary range for a Senior Site Reliability Engineer in San Francisco is $170,000 to $250,000 per year, depending on experience and qualifications. In addition to salary, we offer a comprehensive benefits package that includes:

  • Competitive salary and equity compensation
  • Comprehensive health, dental, and vision insurance
  • Generous paid time off and holidays
  • 401(k) retirement plan with company match
  • Professional development opportunities and tuition reimbursement
  • Wellness programs and resources
  • Employee assistance program (EAP)
  • Flexible work arrangements
  • Company-sponsored events and team-building activities
  • Stocked kitchen with snacks and beverages
  • Commuter benefits

Crusoe Culture

At Crusoe, we are driven by our mission to accelerate the abundance of energy and intelligence. We foster a culture of innovation, collaboration, and sustainability. We value creativity, problem-solving, and a growth mindset. We are committed to creating a diverse and inclusive workplace where everyone feels valued and respected. We believe that our people are our greatest asset, and we invest in their growth and development. We are passionate about using technology to solve some of the world's most pressing challenges.

How to Apply

If you are a talented and motivated engineer who is passionate about reliability, sustainability, and innovation, we encourage you to apply for the Senior Site Reliability Engineer position at Crusoe. To apply, please submit your resume and cover letter through our online application portal. Be sure to highlight your relevant experience and qualifications, and explain why you are interested in working at Crusoe.

FAQ

  1. What is Crusoe's mission?

    Crusoe's mission is to accelerate the abundance of energy and intelligence.

  2. What is the role of a Senior Site Reliability Engineer at Crusoe?

    The Senior Site Reliability Engineer ensures the stability, resilience, and performance of Crusoe’s GPU cloud.

  3. What skills are required for this role?

    Key skills include cloud operations, SRE, GPU workload experience, Linux systems administration, networking, and automation.

  4. What experience is needed for this role?

    5+ years of experience in cloud operations, SRE, or related roles is required.

  5. What tools and technologies does Crusoe use?

    Crusoe uses Kubernetes, AWS/GCP, Prometheus, Grafana, Terraform, Ansible, and scripting languages like Go and Python.

  6. What is the career path for this role?

    Opportunities exist to specialize in SRE areas or move into leadership roles within the SRE team.

  7. What benefits does Crusoe offer?

    Crusoe offers competitive salary, health insurance, paid time off, 401(k), professional development, and more.

  8. What is the work environment like at Crusoe?

    Crusoe fosters a culture of innovation, collaboration, sustainability, and inclusivity.

  9. How does Crusoe support professional development?

    Crusoe provides ongoing training, mentorship, and professional development opportunities.

  10. How can I apply for this position?

    Submit your resume and cover letter through our online application portal.

Locations

  • San Francisco, California, United States

Salary

Estimated Salary Rangemedium confidence

187,000 - 275,000 USD / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

  • Cloud Operationsintermediate
  • Site Reliability Engineering (SRE)intermediate
  • GPU Workloadsintermediate
  • High-Performance Computing (HPC)intermediate
  • Latency/Throughput Optimizationintermediate
  • Unix/Linux Systems Administrationintermediate
  • Networking (TCP/IP, DNS, Routing)intermediate
  • Cloud Platforms (Kubernetes, AWS, GCP)intermediate
  • Virtualizationintermediate
  • Distributed Systemsintermediate
  • Incident Managementintermediate
  • Monitoring and Alerting (Prometheus, Grafana, Alertmanager)intermediate
  • Observability (OpenTelemetry)intermediate
  • Infrastructure-as-Code (Terraform)intermediate
  • Configuration Management (Ansible)intermediate
  • Scripting (Go, Python)intermediate
  • Automationintermediate
  • Root Cause Analysis (RCA)intermediate
  • SLI/SLO Managementintermediate
  • Disaster Recoveryintermediate
  • Communicationintermediate
  • Problem-Solvingintermediate
  • Collaborationintermediate
  • Operational Excellenceintermediate
  • Growth Mindsetintermediate

Required Qualifications

  • 5+ years of experience in cloud operations, SRE, or related roles. (experience)
  • Background working with GPU workloads, high-performance computing, or latency/throughput-sensitive systems. (experience)
  • Strong knowledge of Unix/Linux systems (kernel/user space) and networking. (experience)
  • Understanding of cloud platforms and infrastructure fundamentals (Kubernetes, AWS/GCP, virtualization, distributed systems). (experience)
  • Familiarity with incident management practices and operational frameworks (SRE/ITIL/etc.). (experience)
  • Experience with monitoring and alerting tools (Prometheus, Grafana) or a strong willingness to learn. (experience)
  • Familiarity with infrastructure-as-code and configuration management tools such as Terraform and Ansible. (experience)
  • Basic scripting and automation experience (Go, Python, C, C++, or similar). (experience)
  • Strong communication skills, with the ability to clearly articulate technical issues to diverse stakeholders. (experience)
  • Ability to stay calm, focused, and effective in fast-moving or high-pressure situations. (experience)
  • A growth mindset with enthusiasm for operational excellence, reliability engineering, and continuous improvement. (experience)

Responsibilities

  • Collaborate with cross-functional teams to define and refine availability metrics for Crusoe’s cloud infrastructure, including establishing, tracking, and improving SLIs and SLOs.
  • Assist in incident response by identifying, diagnosing, and resolving service disruptions.
  • Support post-incident processes through RCA documentation and participation in post-incident reviews.
  • Build, operate, and monitor infrastructure health using Crusoe’s observability stack (Prometheus, Grafana, Alertmanager, OpenTelemetry).
  • Identify and communicate reliability risks, performance bottlenecks, and early indicators of potential incidents that could impact service availability.
  • Develop automation and tooling to reduce operational toil, minimize manual intervention, and enhance service recovery and self-healing capabilities.
  • Partner with compute, network, storage, and platform teams to improve service resilience and strengthen disaster recovery readiness.
  • Contribute to knowledge sharing, process improvements, and the development of operational best practices across the organization.
  • Participate in ongoing training, mentorship, and professional development to grow into advanced SRE responsibilities.
  • Participate in on-call rotations to support the Crusoe cloud platform.
  • Develop and maintain comprehensive documentation for systems and processes.
  • Proactively identify opportunities to improve system performance, security, and scalability.

Benefits

  • general: Competitive salary and equity compensation.
  • general: Comprehensive health, dental, and vision insurance.
  • general: Generous paid time off and holidays.
  • general: 401(k) retirement plan with company match.
  • general: Professional development opportunities and tuition reimbursement.
  • general: Wellness programs and resources.
  • general: Employee assistance program (EAP).
  • general: Flexible work arrangements.
  • general: Company-sponsored events and team-building activities.
  • general: Stocked kitchen with snacks and beverages.
  • general: Commuter benefits.
  • general: Opportunity to work on cutting-edge technology in a rapidly growing company.
  • general: A supportive and collaborative work environment.
  • general: Opportunity to make a significant impact on the future of sustainable cloud computing.
  • general: Access to Crusoe Cloud platform resources for personal projects and development.

Target Your Resume for "Senior Site Reliability Engineer Careers at Crusoe - San Francisco, California | Apply Now!" , Crusoe

Get personalized recommendations to optimize your resume specifically for Senior Site Reliability Engineer Careers at Crusoe - San Francisco, California | Apply Now!. Takes only 15 seconds!

AI-powered keyword optimization
Skills matching & gap analysis
Experience alignment suggestions

Check Your ATS Score for "Senior Site Reliability Engineer Careers at Crusoe - San Francisco, California | Apply Now!" , Crusoe

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check
Keyword optimization analysis
Skill matching & gap identification
Format & readability score

Tags & Categories

SRECloudGPUKubernetesAutomationSan FranciscoFull-timeSite Reliability EngineerCloud OperationsGPU ComputingHigh-Performance ComputingAWSGCPPrometheusGrafanaTerraformAnsibleLinuxNetworkingIncident ManagementCaliforniaCrusoe EnergySustainable CloudAI InfrastructureOperational ExcellenceCloud InfrastructureReliability EngineeringDevOpsGreen TechAI InfrastructureCloudEngineering

Answer 10 quick questions to check your fit for Senior Site Reliability Engineer Careers at Crusoe - San Francisco, California | Apply Now! @ Crusoe.

Quiz Challenge
10 Questions
~2 Minutes
Instant Score

Related Books and Jobs

No related jobs found at the moment.