Resume and JobRESUME AND JOB
Crusoe logo

Senior Site Reliability Engineer (Managed AI) Careers at Crusoe - San Francisco, California | Apply Now!

Crusoe

Senior Site Reliability Engineer (Managed AI) Careers at Crusoe - San Francisco, California | Apply Now!

full-timePosted: Feb 3, 2026

Job Description

Senior Site Reliability Engineer (Managed AI) at Crusoe

Role Overview

As a Senior Site Reliability Engineer specializing in Managed AI at Crusoe, you will be at the forefront of building and operating the infrastructure that powers the next generation of AI applications. This role is crucial in ensuring the reliability, scalability, and performance of Crusoe's AI-optimized cloud platform, particularly for large language model (LLM) workloads. You will collaborate with cross-functional teams to optimize AI training and inference clusters, automate observability, and resolve critical reliability issues. Your expertise in distributed systems, combined with hands-on experience with LLMs, will be instrumental in delivering highly available and cost-efficient AI infrastructure to our customers.

Day in the Life

A typical day for a Senior Site Reliability Engineer at Crusoe might include:

  • Designing and implementing automation to scale LLM inference services.
  • Collaborating with data scientists and AI engineers to optimize AI pipeline performance.
  • Monitoring system performance and identifying potential bottlenecks using telemetry and logging tools.
  • Participating in incident response, troubleshooting, and resolving system outages.
  • Defining and refining SLIs/SLOs to ensure alignment with business objectives.
  • Contributing to the architecture of new distributed systems for AI workloads.
  • Writing code to automate infrastructure provisioning and management tasks.
  • Participating in code reviews and knowledge sharing sessions with the team.

Why San Francisco?

San Francisco is a hub for technological innovation, attracting top talent and fostering a dynamic and collaborative environment. Being located in San Francisco provides Crusoe with access to a diverse pool of skilled engineers, data scientists, and AI specialists. The city's vibrant tech ecosystem promotes networking opportunities, partnerships, and continuous learning, making it an ideal location for driving innovation in AI and cloud computing.

Career Path

At Crusoe, we are committed to providing opportunities for career growth and development. As a Senior Site Reliability Engineer, you can advance along technical or management tracks. Opportunities include:

  • Principal Engineer: Focus on solving complex technical challenges and leading architectural initiatives.
  • Staff Engineer: Provide technical guidance and mentorship to other engineers, driving best practices and standards.
  • Engineering Manager: Lead and manage a team of SREs, overseeing project execution and team performance.
  • Architect: Design and implement large-scale systems and infrastructure solutions, ensuring scalability and reliability.

Salary and Benefits

Crusoe offers a competitive compensation package that includes:

  • Competitive Salary: Based on experience and market rates, with regular performance-based reviews.
  • Restricted Stock Units (RSUs): Ownership in a fast-growing, well-funded technology company.
  • Health Insurance: Comprehensive health, vision, and dental coverage for you and your dependents.
  • HSA Contributions: Employer contributions to Health Savings Accounts.
  • Paid Parental Leave: Generous paid time off for new parents.
  • Life and Disability Insurance: Paid life insurance, short-term and long-term disability coverage.
  • Teladoc: Access to virtual healthcare services.
  • 401(k): 100% match up to 4% of salary.
  • Paid Time Off: Generous paid time off and holiday schedule.
  • Cell Phone Reimbursement: Monthly reimbursement for cell phone expenses.
  • Tuition Reimbursement: Support for ongoing education and professional development.
  • Calm App Subscription: Subscription to the Calm app for mindfulness and stress reduction.
  • MetLife Legal: Access to legal services.
  • Commuter Benefit: Company-paid commuter benefit ($300 per month).

Crusoe Culture

Crusoe Energy Systems Inc. was founded in 2018 with a vision to align the future of energy with the future of computing. As the world’s climate continues to change, we are motivated to build systems for the energy transition to unlock clean, stranded, and wasted energy resources. Crusoe is committed to building a company culture that is inclusive, collaborative, and innovative. We value diversity and believe that diverse teams are essential to our success. We strive to create a workplace where everyone feels welcome, respected, and empowered to contribute their best work.

We have a team of innovators who are building new systems that create a more sustainable future. We are looking for talented, passionate individuals to join our team and help us achieve our mission.

How to Apply

Interested candidates are encouraged to apply online through the Crusoe Careers page. Please submit your resume, cover letter, and any relevant portfolio materials. Be sure to highlight your experience with distributed systems, large language models, and SRE practices. We look forward to hearing from you!

Frequently Asked Questions (FAQ)

  1. What is Crusoe's mission?

    Crusoe's mission is to accelerate the abundance of energy and intelligence.

  2. What type of projects will I be working on?

    You'll be working on designing, building, and operating the infrastructure that powers AI applications, with a focus on large language models.

  3. What is the company culture like?

    Crusoe fosters an inclusive, collaborative, and innovative culture where diversity is valued, and employees are empowered to contribute their best work.

  4. What are the career growth opportunities?

    Crusoe provides opportunities for career growth along technical or management tracks, including roles such as Principal Engineer, Engineering Manager, and Architect.

  5. What benefits does Crusoe offer?

    Crusoe offers a comprehensive benefits package, including competitive salary, RSUs, health insurance, HSA contributions, paid parental leave, and more.

  6. Is there room to grow?

    Yes. Crusoe is a fast-growing company that provides ample opportunities for employees to advance their skills and careers.

  7. How is the management?

    Crusoe has a strong leadership team with extensive experience in energy, technology, and finance. The management encourages open communication, collaboration, and innovation.

  8. What is the work life balance?

    Crusoe values work-life balance and provides employees with generous paid time off and flexible work arrangements.

  9. Is Crusoe a good fit for me?

    If you are passionate about AI, cloud computing, and sustainability, and thrive in a fast-paced, mission-driven environment, Crusoe may be a good fit for you.

  10. What does the interview process look like?

    Typically there is an initial phone screen with a recruiter, followed by one or more technical interviews with members of the SRE team, and a final interview with the hiring manager. The process may also include a take-home assignment or a live coding exercise.

Locations

  • San Francisco, California, United States

Salary

Estimated Salary Rangemedium confidence

198,000 - 308,000 USD / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

  • Site Reliability Engineering (SRE)intermediate
  • Distributed Systemsintermediate
  • Large Language Models (LLMs)intermediate
  • AI/ML Infrastructureintermediate
  • Kubernetesintermediate
  • Container Orchestrationintermediate
  • Pythonintermediate
  • Gointermediate
  • Javaintermediate
  • C++intermediate
  • Cloud Computingintermediate
  • Automationintermediate
  • Monitoringintermediate
  • Observabilityintermediate
  • Telemetryintermediate
  • Performance Tuningintermediate
  • Fault Toleranceintermediate
  • System Designintermediate
  • Software Engineeringintermediate
  • Collaborationintermediate
  • Communicationintermediate
  • Problem Solvingintermediate
  • Troubleshootingintermediate
  • SLIs/SLOs Managementintermediate
  • AI Pipelinesintermediate
  • Inference Servicesintermediate
  • Agile Methodologiesintermediate
  • DevOpsintermediate

Required Qualifications

  • Strong software engineering background with experience building production-grade systems. (experience)
  • Demonstrated experience in distributed systems design and implementation. (experience)
  • Hands-on experience with large language models (LLMs) or AI/ML infrastructure. (experience)
  • SRE mindset and experience, including defining and measuring SLIs/SLOs. (experience)
  • Experience building monitoring and observability systems. (experience)
  • Experience driving performance and reliability improvements. (experience)
  • Experience designing fault-tolerant systems and automated testing strategies. (experience)
  • Proficiency in at least one modern programming language (Python, Go, Java, C++). (experience)
  • Familiarity with Kubernetes or container orchestration platforms. (experience)
  • Strong collaboration and communication skills. (experience)
  • Ability to thrive in a fast-paced, mission-driven environment. (experience)
  • Experience scaling inference or training workloads for LLMs (Bonus). (experience)

Responsibilities

  • Design and operate reliable managed AI services with a focus on serving and scaling LLM workloads.
  • Build automation and reliability tooling to support distributed AI pipelines and inference services.
  • Define, measure, and improve SLIs/SLOs across AI workloads to ensure performance and reliability targets are met.
  • Collaborate with AI, platform, and infrastructure teams to optimize large-scale training and inference clusters.
  • Automate observability by building telemetry and performance tuning strategies for latency-sensitive AI services.
  • Investigate and resolve reliability issues in distributed AI systems using telemetry, logs, and profiling.
  • Contribute to the architecture of next-generation distributed systems purpose-built for AI-first environments.
  • Participate in on-call rotations to ensure system availability and responsiveness.
  • Develop and maintain comprehensive documentation for systems and processes.
  • Mentor junior engineers and share knowledge within the team.
  • Proactively identify and address potential performance bottlenecks and reliability risks.
  • Implement and maintain security best practices for AI infrastructure.

Benefits

  • general: Industry competitive pay
  • general: Restricted Stock Units in a fast-growing, well-funded technology company
  • general: Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • general: Employer contributions to HSA accounts
  • general: Paid Parental Leave
  • general: Paid life insurance
  • general: Short-term and long-term disability
  • general: Teladoc
  • general: 401(k) with a 100% match up to 4% of salary
  • general: Generous paid time off and holiday schedule
  • general: Cell phone reimbursement
  • general: Tuition reimbursement
  • general: Subscription to the Calm app
  • general: MetLife Legal
  • general: Company paid commuter benefit; $300 per month

Target Your Resume for "Senior Site Reliability Engineer (Managed AI) Careers at Crusoe - San Francisco, California | Apply Now!" , Crusoe

Get personalized recommendations to optimize your resume specifically for Senior Site Reliability Engineer (Managed AI) Careers at Crusoe - San Francisco, California | Apply Now!. Takes only 15 seconds!

AI-powered keyword optimization
Skills matching & gap analysis
Experience alignment suggestions

Check Your ATS Score for "Senior Site Reliability Engineer (Managed AI) Careers at Crusoe - San Francisco, California | Apply Now!" , Crusoe

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check
Keyword optimization analysis
Skill matching & gap identification
Format & readability score

Tags & Categories

SREAILLMCloudSan FranciscoSenior Site Reliability EngineerManaged AILarge Language ModelsLLMsAI InfrastructureMachine LearningDistributed SystemsCloud ComputingKubernetesContainerizationCaliforniaCrusoe EnergyAI PipelinesInference ServicesSLIsSLOsTelemetryMonitoringAutomationPythonGoJavaC++High AvailabilityScalabilityReliabilityPerformance TuningCloud InfrastructureData ScienceAI EngineerSustainable TechnologyGreen TechAI InfrastructureCloudEngineering

Answer 10 quick questions to check your fit for Senior Site Reliability Engineer (Managed AI) Careers at Crusoe - San Francisco, California | Apply Now! @ Crusoe.

Quiz Challenge
10 Questions
~2 Minutes
Instant Score

Related Books and Jobs

No related jobs found at the moment.

Crusoe logo

Senior Site Reliability Engineer (Managed AI) Careers at Crusoe - San Francisco, California | Apply Now!

Crusoe

Senior Site Reliability Engineer (Managed AI) Careers at Crusoe - San Francisco, California | Apply Now!

full-timePosted: Feb 3, 2026

Job Description

Senior Site Reliability Engineer (Managed AI) at Crusoe

Role Overview

As a Senior Site Reliability Engineer specializing in Managed AI at Crusoe, you will be at the forefront of building and operating the infrastructure that powers the next generation of AI applications. This role is crucial in ensuring the reliability, scalability, and performance of Crusoe's AI-optimized cloud platform, particularly for large language model (LLM) workloads. You will collaborate with cross-functional teams to optimize AI training and inference clusters, automate observability, and resolve critical reliability issues. Your expertise in distributed systems, combined with hands-on experience with LLMs, will be instrumental in delivering highly available and cost-efficient AI infrastructure to our customers.

Day in the Life

A typical day for a Senior Site Reliability Engineer at Crusoe might include:

  • Designing and implementing automation to scale LLM inference services.
  • Collaborating with data scientists and AI engineers to optimize AI pipeline performance.
  • Monitoring system performance and identifying potential bottlenecks using telemetry and logging tools.
  • Participating in incident response, troubleshooting, and resolving system outages.
  • Defining and refining SLIs/SLOs to ensure alignment with business objectives.
  • Contributing to the architecture of new distributed systems for AI workloads.
  • Writing code to automate infrastructure provisioning and management tasks.
  • Participating in code reviews and knowledge sharing sessions with the team.

Why San Francisco?

San Francisco is a hub for technological innovation, attracting top talent and fostering a dynamic and collaborative environment. Being located in San Francisco provides Crusoe with access to a diverse pool of skilled engineers, data scientists, and AI specialists. The city's vibrant tech ecosystem promotes networking opportunities, partnerships, and continuous learning, making it an ideal location for driving innovation in AI and cloud computing.

Career Path

At Crusoe, we are committed to providing opportunities for career growth and development. As a Senior Site Reliability Engineer, you can advance along technical or management tracks. Opportunities include:

  • Principal Engineer: Focus on solving complex technical challenges and leading architectural initiatives.
  • Staff Engineer: Provide technical guidance and mentorship to other engineers, driving best practices and standards.
  • Engineering Manager: Lead and manage a team of SREs, overseeing project execution and team performance.
  • Architect: Design and implement large-scale systems and infrastructure solutions, ensuring scalability and reliability.

Salary and Benefits

Crusoe offers a competitive compensation package that includes:

  • Competitive Salary: Based on experience and market rates, with regular performance-based reviews.
  • Restricted Stock Units (RSUs): Ownership in a fast-growing, well-funded technology company.
  • Health Insurance: Comprehensive health, vision, and dental coverage for you and your dependents.
  • HSA Contributions: Employer contributions to Health Savings Accounts.
  • Paid Parental Leave: Generous paid time off for new parents.
  • Life and Disability Insurance: Paid life insurance, short-term and long-term disability coverage.
  • Teladoc: Access to virtual healthcare services.
  • 401(k): 100% match up to 4% of salary.
  • Paid Time Off: Generous paid time off and holiday schedule.
  • Cell Phone Reimbursement: Monthly reimbursement for cell phone expenses.
  • Tuition Reimbursement: Support for ongoing education and professional development.
  • Calm App Subscription: Subscription to the Calm app for mindfulness and stress reduction.
  • MetLife Legal: Access to legal services.
  • Commuter Benefit: Company-paid commuter benefit ($300 per month).

Crusoe Culture

Crusoe Energy Systems Inc. was founded in 2018 with a vision to align the future of energy with the future of computing. As the world’s climate continues to change, we are motivated to build systems for the energy transition to unlock clean, stranded, and wasted energy resources. Crusoe is committed to building a company culture that is inclusive, collaborative, and innovative. We value diversity and believe that diverse teams are essential to our success. We strive to create a workplace where everyone feels welcome, respected, and empowered to contribute their best work.

We have a team of innovators who are building new systems that create a more sustainable future. We are looking for talented, passionate individuals to join our team and help us achieve our mission.

How to Apply

Interested candidates are encouraged to apply online through the Crusoe Careers page. Please submit your resume, cover letter, and any relevant portfolio materials. Be sure to highlight your experience with distributed systems, large language models, and SRE practices. We look forward to hearing from you!

Frequently Asked Questions (FAQ)

  1. What is Crusoe's mission?

    Crusoe's mission is to accelerate the abundance of energy and intelligence.

  2. What type of projects will I be working on?

    You'll be working on designing, building, and operating the infrastructure that powers AI applications, with a focus on large language models.

  3. What is the company culture like?

    Crusoe fosters an inclusive, collaborative, and innovative culture where diversity is valued, and employees are empowered to contribute their best work.

  4. What are the career growth opportunities?

    Crusoe provides opportunities for career growth along technical or management tracks, including roles such as Principal Engineer, Engineering Manager, and Architect.

  5. What benefits does Crusoe offer?

    Crusoe offers a comprehensive benefits package, including competitive salary, RSUs, health insurance, HSA contributions, paid parental leave, and more.

  6. Is there room to grow?

    Yes. Crusoe is a fast-growing company that provides ample opportunities for employees to advance their skills and careers.

  7. How is the management?

    Crusoe has a strong leadership team with extensive experience in energy, technology, and finance. The management encourages open communication, collaboration, and innovation.

  8. What is the work life balance?

    Crusoe values work-life balance and provides employees with generous paid time off and flexible work arrangements.

  9. Is Crusoe a good fit for me?

    If you are passionate about AI, cloud computing, and sustainability, and thrive in a fast-paced, mission-driven environment, Crusoe may be a good fit for you.

  10. What does the interview process look like?

    Typically there is an initial phone screen with a recruiter, followed by one or more technical interviews with members of the SRE team, and a final interview with the hiring manager. The process may also include a take-home assignment or a live coding exercise.

Locations

  • San Francisco, California, United States

Salary

Estimated Salary Rangemedium confidence

198,000 - 308,000 USD / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

  • Site Reliability Engineering (SRE)intermediate
  • Distributed Systemsintermediate
  • Large Language Models (LLMs)intermediate
  • AI/ML Infrastructureintermediate
  • Kubernetesintermediate
  • Container Orchestrationintermediate
  • Pythonintermediate
  • Gointermediate
  • Javaintermediate
  • C++intermediate
  • Cloud Computingintermediate
  • Automationintermediate
  • Monitoringintermediate
  • Observabilityintermediate
  • Telemetryintermediate
  • Performance Tuningintermediate
  • Fault Toleranceintermediate
  • System Designintermediate
  • Software Engineeringintermediate
  • Collaborationintermediate
  • Communicationintermediate
  • Problem Solvingintermediate
  • Troubleshootingintermediate
  • SLIs/SLOs Managementintermediate
  • AI Pipelinesintermediate
  • Inference Servicesintermediate
  • Agile Methodologiesintermediate
  • DevOpsintermediate

Required Qualifications

  • Strong software engineering background with experience building production-grade systems. (experience)
  • Demonstrated experience in distributed systems design and implementation. (experience)
  • Hands-on experience with large language models (LLMs) or AI/ML infrastructure. (experience)
  • SRE mindset and experience, including defining and measuring SLIs/SLOs. (experience)
  • Experience building monitoring and observability systems. (experience)
  • Experience driving performance and reliability improvements. (experience)
  • Experience designing fault-tolerant systems and automated testing strategies. (experience)
  • Proficiency in at least one modern programming language (Python, Go, Java, C++). (experience)
  • Familiarity with Kubernetes or container orchestration platforms. (experience)
  • Strong collaboration and communication skills. (experience)
  • Ability to thrive in a fast-paced, mission-driven environment. (experience)
  • Experience scaling inference or training workloads for LLMs (Bonus). (experience)

Responsibilities

  • Design and operate reliable managed AI services with a focus on serving and scaling LLM workloads.
  • Build automation and reliability tooling to support distributed AI pipelines and inference services.
  • Define, measure, and improve SLIs/SLOs across AI workloads to ensure performance and reliability targets are met.
  • Collaborate with AI, platform, and infrastructure teams to optimize large-scale training and inference clusters.
  • Automate observability by building telemetry and performance tuning strategies for latency-sensitive AI services.
  • Investigate and resolve reliability issues in distributed AI systems using telemetry, logs, and profiling.
  • Contribute to the architecture of next-generation distributed systems purpose-built for AI-first environments.
  • Participate in on-call rotations to ensure system availability and responsiveness.
  • Develop and maintain comprehensive documentation for systems and processes.
  • Mentor junior engineers and share knowledge within the team.
  • Proactively identify and address potential performance bottlenecks and reliability risks.
  • Implement and maintain security best practices for AI infrastructure.

Benefits

  • general: Industry competitive pay
  • general: Restricted Stock Units in a fast-growing, well-funded technology company
  • general: Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • general: Employer contributions to HSA accounts
  • general: Paid Parental Leave
  • general: Paid life insurance
  • general: Short-term and long-term disability
  • general: Teladoc
  • general: 401(k) with a 100% match up to 4% of salary
  • general: Generous paid time off and holiday schedule
  • general: Cell phone reimbursement
  • general: Tuition reimbursement
  • general: Subscription to the Calm app
  • general: MetLife Legal
  • general: Company paid commuter benefit; $300 per month

Target Your Resume for "Senior Site Reliability Engineer (Managed AI) Careers at Crusoe - San Francisco, California | Apply Now!" , Crusoe

Get personalized recommendations to optimize your resume specifically for Senior Site Reliability Engineer (Managed AI) Careers at Crusoe - San Francisco, California | Apply Now!. Takes only 15 seconds!

AI-powered keyword optimization
Skills matching & gap analysis
Experience alignment suggestions

Check Your ATS Score for "Senior Site Reliability Engineer (Managed AI) Careers at Crusoe - San Francisco, California | Apply Now!" , Crusoe

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check
Keyword optimization analysis
Skill matching & gap identification
Format & readability score

Tags & Categories

SREAILLMCloudSan FranciscoSenior Site Reliability EngineerManaged AILarge Language ModelsLLMsAI InfrastructureMachine LearningDistributed SystemsCloud ComputingKubernetesContainerizationCaliforniaCrusoe EnergyAI PipelinesInference ServicesSLIsSLOsTelemetryMonitoringAutomationPythonGoJavaC++High AvailabilityScalabilityReliabilityPerformance TuningCloud InfrastructureData ScienceAI EngineerSustainable TechnologyGreen TechAI InfrastructureCloudEngineering

Answer 10 quick questions to check your fit for Senior Site Reliability Engineer (Managed AI) Careers at Crusoe - San Francisco, California | Apply Now! @ Crusoe.

Quiz Challenge
10 Questions
~2 Minutes
Instant Score

Related Books and Jobs

No related jobs found at the moment.