Resume and JobRESUME AND JOB
Crusoe logo

Senior Site Reliability Engineer (Managed AI) Careers at Crusoe - San Francisco, California | Apply Now!

Crusoe

Senior Site Reliability Engineer (Managed AI) Careers at Crusoe - San Francisco, California | Apply Now!

full-timePosted: Feb 3, 2026

Job Description

Senior Site Reliability Engineer (Managed AI) at Crusoe

Role Overview

As a Senior Site Reliability Engineer (SRE) specializing in Managed AI at Crusoe, you will be at the forefront of building and maintaining the infrastructure that powers the AI revolution. You will work on ensuring the reliability, scalability, and performance of Crusoe's AI-optimized cloud platform, with a specific focus on large language models (LLMs). This role is crucial for delivering highly available, performant, and cost-efficient AI infrastructure to our customers, enabling them to tackle compute-intensive and latency-sensitive workloads.

Your primary responsibility will be to design, build, and operate reliable managed AI services, with a strong emphasis on serving and scaling LLM workloads. You will develop automation and reliability tooling to support distributed AI pipelines and inference services. Defining and measuring Service Level Indicators (SLIs) and Service Level Objectives (SLOs) across AI workloads will be essential to ensure performance and reliability targets are met. Collaboration with AI, platform, and infrastructure teams will be vital for optimizing large-scale training and inference clusters. Furthermore, you will automate observability, create telemetry, and devise performance tuning strategies for latency-sensitive AI services.

Investigating and resolving reliability issues in distributed AI systems using telemetry, logs, and profiling tools will be a regular part of your work. You will also contribute to the architecture of next-generation distributed systems purpose-built for AI-first environments, playing a key role in shaping the future of AI infrastructure.

A Day in the Life

Here’s what a typical day might look like for a Senior Site Reliability Engineer at Crusoe:

  • Morning: Start the day by reviewing monitoring dashboards and alerts to identify any potential issues or anomalies in the AI infrastructure. Participate in a stand-up meeting with the SRE team to discuss ongoing projects, challenges, and priorities.
  • Mid-day: Work on automating the deployment and scaling of LLM inference services using Kubernetes. Collaborate with the AI team to optimize model serving configurations for performance and cost-efficiency. Investigate a reported performance issue with a specific AI workload, using telemetry data and logs to identify the root cause.
  • Afternoon: Participate in a design review for a new feature in the AI platform. Contribute your expertise in reliability and scalability to ensure the feature is designed with SRE best practices in mind. Work on defining SLIs and SLOs for a new AI service, collaborating with product managers and stakeholders to establish meaningful metrics.
  • Late Afternoon: Develop and test a new monitoring tool to track the utilization of GPU resources across the AI infrastructure. Document the tool and provide training to other team members on its usage. Participate in an on-call rotation, responding to any incidents or alerts that may arise.

Why San Francisco?

San Francisco is a global hub for technology and innovation, making it an ideal location for a Senior Site Reliability Engineer working on cutting-edge AI infrastructure. The city boasts a vibrant ecosystem of startups, established tech companies, and research institutions, creating a wealth of opportunities for professional growth and networking. Furthermore, San Francisco is renowned for its diverse culture, world-class dining, and access to outdoor activities, making it a desirable place to live and work.

Career Path

At Crusoe, we are committed to the growth and development of our employees. A Senior Site Reliability Engineer can advance their career in several directions:

  • Principal SRE: Focus on providing technical leadership and strategic direction for the SRE team.
  • SRE Manager: Lead and manage a team of SREs, responsible for the overall reliability and performance of the AI infrastructure.
  • Architect: Design and architect next-generation AI infrastructure solutions, leveraging your deep technical expertise.
  • Specialized SRE Roles: Focus on specific areas of SRE, such as performance engineering, security, or automation.

Salary and Benefits

The estimated salary range for a Senior Site Reliability Engineer (Managed AI) in San Francisco is $180,000 to $280,000 per year. Note that salary ranges can vary based on experience, skills, and other factors.

Crusoe offers a comprehensive benefits package, including:

  • Industry competitive pay
  • Restricted Stock Units in a fast-growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Subscription to the Calm app
  • MetLife Legal
  • Company paid commuter benefit; $300 per month

Crusoe Culture

Crusoe is committed to accelerating the abundance of energy and intelligence by crafting the engine that powers a world where people can create ambitiously with AI — without sacrificing scale, speed, or sustainability. Crusoe is a fast-paced, mission-driven environment where innovation and collaboration are highly valued. We are looking for individuals who are passionate about making a tangible impact and contributing to a team that is setting the pace for responsible, transformative cloud infrastructure. We value integrity, intellectual honesty, and a commitment to excellence.

How to Apply

Interested candidates are encouraged to apply through the Crusoe careers page. Please submit your resume and a cover letter highlighting your relevant experience and qualifications.

Frequently Asked Questions (FAQ)

  1. What is Crusoe's mission?
    Crusoe's mission is to accelerate the abundance of energy and intelligence.
  2. What type of work will I be doing as a Senior SRE?
    You will be designing, building, and operating reliable managed AI services, focusing on LLM workloads.
  3. What skills are important for this role?
    Strong software engineering, distributed systems, and AI/ML infrastructure experience are crucial.
  4. What is the work environment like at Crusoe?
    It's a fast-paced, mission-driven environment that values innovation and collaboration.
  5. What are the benefits of working at Crusoe?
    Competitive pay, stock options, comprehensive health benefits, and professional development opportunities are offered.
  6. Is there room for career growth at Crusoe?
    Yes, there are opportunities to advance to Principal SRE, SRE Manager, Architect, or specialized SRE roles.
  7. What is the on-call rotation like?
    You will participate in an on-call rotation to ensure system uptime and availability.
  8. What is the company's approach to sustainability?
    Sustainability is a core value, with a focus on responsible and transformative cloud infrastructure.
  9. What programming languages are preferred for this role?
    Proficiency in Python, Go, Java, or C++ is highly valued.
  10. Is experience with Kubernetes required?
    Familiarity with Kubernetes or container orchestration platforms is essential.

Locations

  • San Francisco, California, United States

Salary

Estimated Salary Rangemedium confidence

198,000 - 308,000 USD / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

  • Site Reliability Engineering (SRE)intermediate
  • Distributed Systemsintermediate
  • Large Language Models (LLMs)intermediate
  • AI/ML Infrastructureintermediate
  • Kubernetesintermediate
  • Container Orchestrationintermediate
  • Pythonintermediate
  • Gointermediate
  • Javaintermediate
  • C++intermediate
  • Cloud Platforms (e.g., AWS, Azure, GCP)intermediate
  • Monitoring and Observabilityintermediate
  • Telemetryintermediate
  • Performance Tuningintermediate
  • Automationintermediate
  • Fault-Tolerant Systemsintermediate
  • Automated Testingintermediate
  • Collaborationintermediate
  • Communicationintermediate
  • Problem-Solvingintermediate
  • Incident Responseintermediate
  • SLI/SLO Managementintermediate
  • Capacity Planningintermediate
  • Configuration Managementintermediate
  • CI/CD Pipelinesintermediate
  • Infrastructure as Code (IaC)intermediate
  • Log Analysisintermediate
  • Profilingintermediate
  • Security Best Practicesintermediate
  • Networking Fundamentalsintermediate
  • Operating Systems (Linux)intermediate
  • Scripting (Bash, etc.)intermediate
  • Version Control (Git)intermediate
  • Agile Methodologiesintermediate

Required Qualifications

  • Strong software engineering background with experience building production-grade systems. (experience)
  • Demonstrated experience in distributed systems design and implementation. (experience)
  • Hands-on experience with large language models (LLMs) or AI/ML infrastructure. (experience)
  • SRE mindset and experience including defining and measuring SLIs/SLOs. (experience)
  • Experience building monitoring and observability systems. (experience)
  • Proven ability to drive performance and reliability improvements. (experience)
  • Experience in designing fault-tolerant systems and automated testing strategies. (experience)
  • Proficiency in at least one modern programming language (Python, Go, Java, C++). (experience)
  • Familiarity with Kubernetes or container orchestration platforms. (experience)
  • Strong collaboration and communication skills. (experience)
  • Ability to thrive in a fast-paced, mission-driven environment. (experience)
  • Experience scaling inference or training workloads for LLMs (Bonus). (experience)

Responsibilities

  • Design and operate reliable managed AI services with a focus on serving and scaling LLM workloads.
  • Build automation and reliability tooling to support distributed AI pipelines and inference services.
  • Define, measure, and improve SLIs/SLOs across AI workloads to ensure performance and reliability targets are met.
  • Collaborate with AI, platform, and infrastructure teams to optimize large-scale training and inference clusters.
  • Automate observability by building telemetry and performance tuning strategies for latency-sensitive AI services.
  • Investigate and resolve reliability issues in distributed AI systems using telemetry, logs, and profiling.
  • Contribute to the architecture of next-generation distributed systems purpose-built for AI-first environments.
  • Participate in on-call rotation and incident response to ensure system uptime and availability.
  • Develop and maintain documentation for systems and processes.
  • Perform root cause analysis of incidents and implement preventative measures.
  • Proactively identify and address potential performance bottlenecks and scalability issues.
  • Implement security best practices and ensure compliance with security policies.
  • Participate in code reviews and contribute to improving code quality.
  • Mentor junior engineers and share knowledge within the team.
  • Stay up-to-date with the latest trends and technologies in SRE and AI infrastructure.

Benefits

  • general: Industry competitive pay
  • general: Restricted Stock Units in a fast-growing, well-funded technology company
  • general: Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • general: Employer contributions to HSA accounts
  • general: Paid Parental Leave
  • general: Paid life insurance
  • general: Short-term and long-term disability insurance
  • general: Teladoc
  • general: 401(k) with a 100% match up to 4% of salary
  • general: Generous paid time off and holiday schedule
  • general: Cell phone reimbursement
  • general: Tuition reimbursement
  • general: Subscription to the Calm app
  • general: MetLife Legal
  • general: Company paid commuter benefit; $300 per month

Target Your Resume for "Senior Site Reliability Engineer (Managed AI) Careers at Crusoe - San Francisco, California | Apply Now!" , Crusoe

Get personalized recommendations to optimize your resume specifically for Senior Site Reliability Engineer (Managed AI) Careers at Crusoe - San Francisco, California | Apply Now!. Takes only 15 seconds!

AI-powered keyword optimization
Skills matching & gap analysis
Experience alignment suggestions

Check Your ATS Score for "Senior Site Reliability Engineer (Managed AI) Careers at Crusoe - San Francisco, California | Apply Now!" , Crusoe

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check
Keyword optimization analysis
Skill matching & gap identification
Format & readability score

Tags & Categories

SREAILLMKubernetesCloudSan FranciscoFull-TimeSenior Site Reliability EngineerManaged AILarge Language ModelsArtificial IntelligenceMachine LearningCloud InfrastructureCaliforniaDistributed SystemsAutomationMonitoringTelemetryPerformance TuningPythonGoJavaC++CareerJobHiringCrusoe EnergyAI InfrastructureSLISLOCloud ComputingGreen TechAI InfrastructureCloudEngineering

Answer 10 quick questions to check your fit for Senior Site Reliability Engineer (Managed AI) Careers at Crusoe - San Francisco, California | Apply Now! @ Crusoe.

Quiz Challenge
10 Questions
~2 Minutes
Instant Score

Related Books and Jobs

No related jobs found at the moment.

Crusoe logo

Senior Site Reliability Engineer (Managed AI) Careers at Crusoe - San Francisco, California | Apply Now!

Crusoe

Senior Site Reliability Engineer (Managed AI) Careers at Crusoe - San Francisco, California | Apply Now!

full-timePosted: Feb 3, 2026

Job Description

Senior Site Reliability Engineer (Managed AI) at Crusoe

Role Overview

As a Senior Site Reliability Engineer (SRE) specializing in Managed AI at Crusoe, you will be at the forefront of building and maintaining the infrastructure that powers the AI revolution. You will work on ensuring the reliability, scalability, and performance of Crusoe's AI-optimized cloud platform, with a specific focus on large language models (LLMs). This role is crucial for delivering highly available, performant, and cost-efficient AI infrastructure to our customers, enabling them to tackle compute-intensive and latency-sensitive workloads.

Your primary responsibility will be to design, build, and operate reliable managed AI services, with a strong emphasis on serving and scaling LLM workloads. You will develop automation and reliability tooling to support distributed AI pipelines and inference services. Defining and measuring Service Level Indicators (SLIs) and Service Level Objectives (SLOs) across AI workloads will be essential to ensure performance and reliability targets are met. Collaboration with AI, platform, and infrastructure teams will be vital for optimizing large-scale training and inference clusters. Furthermore, you will automate observability, create telemetry, and devise performance tuning strategies for latency-sensitive AI services.

Investigating and resolving reliability issues in distributed AI systems using telemetry, logs, and profiling tools will be a regular part of your work. You will also contribute to the architecture of next-generation distributed systems purpose-built for AI-first environments, playing a key role in shaping the future of AI infrastructure.

A Day in the Life

Here’s what a typical day might look like for a Senior Site Reliability Engineer at Crusoe:

  • Morning: Start the day by reviewing monitoring dashboards and alerts to identify any potential issues or anomalies in the AI infrastructure. Participate in a stand-up meeting with the SRE team to discuss ongoing projects, challenges, and priorities.
  • Mid-day: Work on automating the deployment and scaling of LLM inference services using Kubernetes. Collaborate with the AI team to optimize model serving configurations for performance and cost-efficiency. Investigate a reported performance issue with a specific AI workload, using telemetry data and logs to identify the root cause.
  • Afternoon: Participate in a design review for a new feature in the AI platform. Contribute your expertise in reliability and scalability to ensure the feature is designed with SRE best practices in mind. Work on defining SLIs and SLOs for a new AI service, collaborating with product managers and stakeholders to establish meaningful metrics.
  • Late Afternoon: Develop and test a new monitoring tool to track the utilization of GPU resources across the AI infrastructure. Document the tool and provide training to other team members on its usage. Participate in an on-call rotation, responding to any incidents or alerts that may arise.

Why San Francisco?

San Francisco is a global hub for technology and innovation, making it an ideal location for a Senior Site Reliability Engineer working on cutting-edge AI infrastructure. The city boasts a vibrant ecosystem of startups, established tech companies, and research institutions, creating a wealth of opportunities for professional growth and networking. Furthermore, San Francisco is renowned for its diverse culture, world-class dining, and access to outdoor activities, making it a desirable place to live and work.

Career Path

At Crusoe, we are committed to the growth and development of our employees. A Senior Site Reliability Engineer can advance their career in several directions:

  • Principal SRE: Focus on providing technical leadership and strategic direction for the SRE team.
  • SRE Manager: Lead and manage a team of SREs, responsible for the overall reliability and performance of the AI infrastructure.
  • Architect: Design and architect next-generation AI infrastructure solutions, leveraging your deep technical expertise.
  • Specialized SRE Roles: Focus on specific areas of SRE, such as performance engineering, security, or automation.

Salary and Benefits

The estimated salary range for a Senior Site Reliability Engineer (Managed AI) in San Francisco is $180,000 to $280,000 per year. Note that salary ranges can vary based on experience, skills, and other factors.

Crusoe offers a comprehensive benefits package, including:

  • Industry competitive pay
  • Restricted Stock Units in a fast-growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Subscription to the Calm app
  • MetLife Legal
  • Company paid commuter benefit; $300 per month

Crusoe Culture

Crusoe is committed to accelerating the abundance of energy and intelligence by crafting the engine that powers a world where people can create ambitiously with AI — without sacrificing scale, speed, or sustainability. Crusoe is a fast-paced, mission-driven environment where innovation and collaboration are highly valued. We are looking for individuals who are passionate about making a tangible impact and contributing to a team that is setting the pace for responsible, transformative cloud infrastructure. We value integrity, intellectual honesty, and a commitment to excellence.

How to Apply

Interested candidates are encouraged to apply through the Crusoe careers page. Please submit your resume and a cover letter highlighting your relevant experience and qualifications.

Frequently Asked Questions (FAQ)

  1. What is Crusoe's mission?
    Crusoe's mission is to accelerate the abundance of energy and intelligence.
  2. What type of work will I be doing as a Senior SRE?
    You will be designing, building, and operating reliable managed AI services, focusing on LLM workloads.
  3. What skills are important for this role?
    Strong software engineering, distributed systems, and AI/ML infrastructure experience are crucial.
  4. What is the work environment like at Crusoe?
    It's a fast-paced, mission-driven environment that values innovation and collaboration.
  5. What are the benefits of working at Crusoe?
    Competitive pay, stock options, comprehensive health benefits, and professional development opportunities are offered.
  6. Is there room for career growth at Crusoe?
    Yes, there are opportunities to advance to Principal SRE, SRE Manager, Architect, or specialized SRE roles.
  7. What is the on-call rotation like?
    You will participate in an on-call rotation to ensure system uptime and availability.
  8. What is the company's approach to sustainability?
    Sustainability is a core value, with a focus on responsible and transformative cloud infrastructure.
  9. What programming languages are preferred for this role?
    Proficiency in Python, Go, Java, or C++ is highly valued.
  10. Is experience with Kubernetes required?
    Familiarity with Kubernetes or container orchestration platforms is essential.

Locations

  • San Francisco, California, United States

Salary

Estimated Salary Rangemedium confidence

198,000 - 308,000 USD / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

  • Site Reliability Engineering (SRE)intermediate
  • Distributed Systemsintermediate
  • Large Language Models (LLMs)intermediate
  • AI/ML Infrastructureintermediate
  • Kubernetesintermediate
  • Container Orchestrationintermediate
  • Pythonintermediate
  • Gointermediate
  • Javaintermediate
  • C++intermediate
  • Cloud Platforms (e.g., AWS, Azure, GCP)intermediate
  • Monitoring and Observabilityintermediate
  • Telemetryintermediate
  • Performance Tuningintermediate
  • Automationintermediate
  • Fault-Tolerant Systemsintermediate
  • Automated Testingintermediate
  • Collaborationintermediate
  • Communicationintermediate
  • Problem-Solvingintermediate
  • Incident Responseintermediate
  • SLI/SLO Managementintermediate
  • Capacity Planningintermediate
  • Configuration Managementintermediate
  • CI/CD Pipelinesintermediate
  • Infrastructure as Code (IaC)intermediate
  • Log Analysisintermediate
  • Profilingintermediate
  • Security Best Practicesintermediate
  • Networking Fundamentalsintermediate
  • Operating Systems (Linux)intermediate
  • Scripting (Bash, etc.)intermediate
  • Version Control (Git)intermediate
  • Agile Methodologiesintermediate

Required Qualifications

  • Strong software engineering background with experience building production-grade systems. (experience)
  • Demonstrated experience in distributed systems design and implementation. (experience)
  • Hands-on experience with large language models (LLMs) or AI/ML infrastructure. (experience)
  • SRE mindset and experience including defining and measuring SLIs/SLOs. (experience)
  • Experience building monitoring and observability systems. (experience)
  • Proven ability to drive performance and reliability improvements. (experience)
  • Experience in designing fault-tolerant systems and automated testing strategies. (experience)
  • Proficiency in at least one modern programming language (Python, Go, Java, C++). (experience)
  • Familiarity with Kubernetes or container orchestration platforms. (experience)
  • Strong collaboration and communication skills. (experience)
  • Ability to thrive in a fast-paced, mission-driven environment. (experience)
  • Experience scaling inference or training workloads for LLMs (Bonus). (experience)

Responsibilities

  • Design and operate reliable managed AI services with a focus on serving and scaling LLM workloads.
  • Build automation and reliability tooling to support distributed AI pipelines and inference services.
  • Define, measure, and improve SLIs/SLOs across AI workloads to ensure performance and reliability targets are met.
  • Collaborate with AI, platform, and infrastructure teams to optimize large-scale training and inference clusters.
  • Automate observability by building telemetry and performance tuning strategies for latency-sensitive AI services.
  • Investigate and resolve reliability issues in distributed AI systems using telemetry, logs, and profiling.
  • Contribute to the architecture of next-generation distributed systems purpose-built for AI-first environments.
  • Participate in on-call rotation and incident response to ensure system uptime and availability.
  • Develop and maintain documentation for systems and processes.
  • Perform root cause analysis of incidents and implement preventative measures.
  • Proactively identify and address potential performance bottlenecks and scalability issues.
  • Implement security best practices and ensure compliance with security policies.
  • Participate in code reviews and contribute to improving code quality.
  • Mentor junior engineers and share knowledge within the team.
  • Stay up-to-date with the latest trends and technologies in SRE and AI infrastructure.

Benefits

  • general: Industry competitive pay
  • general: Restricted Stock Units in a fast-growing, well-funded technology company
  • general: Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • general: Employer contributions to HSA accounts
  • general: Paid Parental Leave
  • general: Paid life insurance
  • general: Short-term and long-term disability insurance
  • general: Teladoc
  • general: 401(k) with a 100% match up to 4% of salary
  • general: Generous paid time off and holiday schedule
  • general: Cell phone reimbursement
  • general: Tuition reimbursement
  • general: Subscription to the Calm app
  • general: MetLife Legal
  • general: Company paid commuter benefit; $300 per month

Target Your Resume for "Senior Site Reliability Engineer (Managed AI) Careers at Crusoe - San Francisco, California | Apply Now!" , Crusoe

Get personalized recommendations to optimize your resume specifically for Senior Site Reliability Engineer (Managed AI) Careers at Crusoe - San Francisco, California | Apply Now!. Takes only 15 seconds!

AI-powered keyword optimization
Skills matching & gap analysis
Experience alignment suggestions

Check Your ATS Score for "Senior Site Reliability Engineer (Managed AI) Careers at Crusoe - San Francisco, California | Apply Now!" , Crusoe

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check
Keyword optimization analysis
Skill matching & gap identification
Format & readability score

Tags & Categories

SREAILLMKubernetesCloudSan FranciscoFull-TimeSenior Site Reliability EngineerManaged AILarge Language ModelsArtificial IntelligenceMachine LearningCloud InfrastructureCaliforniaDistributed SystemsAutomationMonitoringTelemetryPerformance TuningPythonGoJavaC++CareerJobHiringCrusoe EnergyAI InfrastructureSLISLOCloud ComputingGreen TechAI InfrastructureCloudEngineering

Answer 10 quick questions to check your fit for Senior Site Reliability Engineer (Managed AI) Careers at Crusoe - San Francisco, California | Apply Now! @ Crusoe.

Quiz Challenge
10 Questions
~2 Minutes
Instant Score

Related Books and Jobs

No related jobs found at the moment.