Resume and JobRESUME AND JOB
xAI logo

Site Reliability Engineer - Monitoring

xAI

Site Reliability Engineer - Monitoring

full-timePosted: Dec 29, 2025

Job Description

About xAI

xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.

About the Role

As a Site Reliability Engineer in Monitoring, you will focus on developing and managing monitoring solutions, with heavy emphasis on Grafana for creating dashboards that provide visibility into datacenter health. You will leverage programming skills to automate monitoring, analyze data, and scale business operations through insightful visualizations. This role requires collaboration with datacenter teams to deliver actionable insights and minimize downtime in xAI's infrastructure.

Responsibilities

  • Design, build, and maintain Grafana dashboards tailored for datacenter technician organizations, providing real-time views into system health, performance metrics, and monitoring alerts.
  • Develop automation scripts and tools using languages such as Java, Golang, Python, C/C++/C#, Bash, or Linux shell scripting to integrate monitoring systems and process data in JSON formats.
  • Collaborate with Datacenter Operations Technicians to identify monitoring needs, troubleshoot issues, and ensure dashboards support efficient incident response and preventive maintenance.
  • Evaluate and optimize existing dashboards for scalability, drawing from past experiences in creating monitoring solutions that have driven business growth.
  • Manage dashboard lifecycle, including version control, updates, and performance tuning to handle large-scale datacenter environments.
  • Participate in on-call rotations, incident analysis, and root cause investigations using monitoring data to improve system reliability.
  • Document monitoring strategies, dashboard designs, and best practices to foster knowledge sharing within the team.

Required Qualifications

  • Bachelor's degree in Computer Science, Software Engineering, or a related field (or equivalent experience).
  • 5+ years of experience in site reliability engineering or monitoring roles, preferably in datacenter or cloud environments.
  • Proficiency in at least two of the following programming languages: Java, Golang, Python, C/C++/C#, with strong skills in Linux and Bash scripting.
  • Hands-on experience working with JSON for data parsing, integration, and API interactions.
  • Expert-level knowledge of Grafana, including creating complex dashboards, queries, and integrations with data sources like Prometheus or InfluxDB.
  • Proven track record of developing dashboards that provide health and monitoring views for operational teams, with examples of how they scaled business operations.
  • Experience managing monitoring tools and dashboards, including optimization, alerting, and integration into CI/CD pipelines.
  • Strong problem-solving skills with a focus on data-driven decision-making and collaboration in fast-paced environments.

Preferred Qualifications

  • Experience in AI/ML infrastructure or high-performance computing monitoring.
  • Familiarity with other monitoring tools (e.g., Grafana) and observability practices.
  • Prior work in a startup or tech company like xAI, with contributions to scalable monitoring systems.

xAI is an equal opportunity employer. For details on data processing, view our Recruitment Privacy Notice.

Locations

  • Memphis, TN,

Salary

Salary details available upon request

Estimated Salary Rangemedium confidence

300,000 - 650,000 USD / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

  • Grafanaintermediate
  • Javaintermediate
  • Golangintermediate
  • Pythonintermediate
  • C/C++/C#intermediate
  • Bashintermediate
  • Linux shell scriptingintermediate
  • JSONintermediate
  • Prometheusintermediate
  • InfluxDBintermediate
  • Linuxintermediate
  • problem-solvingintermediate
  • data-driven decision-makingintermediate
  • collaborationintermediate

Required Qualifications

  • Bachelor's degree in Computer Science, Software Engineering, or a related field (or equivalent experience) (experience)
  • 5+ years of experience in site reliability engineering or monitoring roles, preferably in datacenter or cloud environments (experience)
  • Proficiency in at least two of the following programming languages: Java, Golang, Python, C/C++/C#, with strong skills in Linux and Bash scripting (experience)
  • Hands-on experience working with JSON for data parsing, integration, and API interactions (experience)
  • Expert-level knowledge of Grafana, including creating complex dashboards, queries, and integrations with data sources like Prometheus or InfluxDB (experience)
  • Proven track record of developing dashboards that provide health and monitoring views for operational teams, with examples of how they scaled business operations (experience)
  • Experience managing monitoring tools and dashboards, including optimization, alerting, and integration into CI/CD pipelines (experience)
  • Strong problem-solving skills with a focus on data-driven decision-making and collaboration in fast-paced environments (experience)

Preferred Qualifications

  • Experience in AI/ML infrastructure or high-performance computing monitoring (experience)
  • Familiarity with other monitoring tools (e.g., Grafana) and observability practices (experience)
  • Prior work in a startup or tech company like xAI, with contributions to scalable monitoring systems (experience)

Responsibilities

  • Design, build, and maintain Grafana dashboards tailored for datacenter technician organizations, providing real-time views into system health, performance metrics, and monitoring alerts
  • Develop automation scripts and tools using languages such as Java, Golang, Python, C/C++/C#, Bash, or Linux shell scripting to integrate monitoring systems and process data in JSON formats
  • Collaborate with Datacenter Operations Technicians to identify monitoring needs, troubleshoot issues, and ensure dashboards support efficient incident response and preventive maintenance
  • Evaluate and optimize existing dashboards for scalability, drawing from past experiences in creating monitoring solutions that have driven business growth
  • Manage dashboard lifecycle, including version control, updates, and performance tuning to handle large-scale datacenter environments
  • Participate in on-call rotations, incident analysis, and root cause investigations using monitoring data to improve system reliability
  • Document monitoring strategies, dashboard designs, and best practices to foster knowledge sharing within the team

Target Your Resume for "Site Reliability Engineer - Monitoring" , xAI

Get personalized recommendations to optimize your resume specifically for Site Reliability Engineer - Monitoring. Takes only 15 seconds!

AI-powered keyword optimization
Skills matching & gap analysis
Experience alignment suggestions

Check Your ATS Score for "Site Reliability Engineer - Monitoring" , xAI

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check
Keyword optimization analysis
Skill matching & gap identification
Format & readability score

Tags & Categories

EngineeringEngineering
Quiz Challenge

Answer 10 quick questions to check your fit for Site Reliability Engineer - Monitoring @ xAI.

10 Questions
~2 Minutes
Instant Score

Related Books and Jobs

No related jobs found at the moment.

xAI logo

Site Reliability Engineer - Monitoring

xAI

Site Reliability Engineer - Monitoring

full-timePosted: Dec 29, 2025

Job Description

About xAI

xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.

About the Role

As a Site Reliability Engineer in Monitoring, you will focus on developing and managing monitoring solutions, with heavy emphasis on Grafana for creating dashboards that provide visibility into datacenter health. You will leverage programming skills to automate monitoring, analyze data, and scale business operations through insightful visualizations. This role requires collaboration with datacenter teams to deliver actionable insights and minimize downtime in xAI's infrastructure.

Responsibilities

  • Design, build, and maintain Grafana dashboards tailored for datacenter technician organizations, providing real-time views into system health, performance metrics, and monitoring alerts.
  • Develop automation scripts and tools using languages such as Java, Golang, Python, C/C++/C#, Bash, or Linux shell scripting to integrate monitoring systems and process data in JSON formats.
  • Collaborate with Datacenter Operations Technicians to identify monitoring needs, troubleshoot issues, and ensure dashboards support efficient incident response and preventive maintenance.
  • Evaluate and optimize existing dashboards for scalability, drawing from past experiences in creating monitoring solutions that have driven business growth.
  • Manage dashboard lifecycle, including version control, updates, and performance tuning to handle large-scale datacenter environments.
  • Participate in on-call rotations, incident analysis, and root cause investigations using monitoring data to improve system reliability.
  • Document monitoring strategies, dashboard designs, and best practices to foster knowledge sharing within the team.

Required Qualifications

  • Bachelor's degree in Computer Science, Software Engineering, or a related field (or equivalent experience).
  • 5+ years of experience in site reliability engineering or monitoring roles, preferably in datacenter or cloud environments.
  • Proficiency in at least two of the following programming languages: Java, Golang, Python, C/C++/C#, with strong skills in Linux and Bash scripting.
  • Hands-on experience working with JSON for data parsing, integration, and API interactions.
  • Expert-level knowledge of Grafana, including creating complex dashboards, queries, and integrations with data sources like Prometheus or InfluxDB.
  • Proven track record of developing dashboards that provide health and monitoring views for operational teams, with examples of how they scaled business operations.
  • Experience managing monitoring tools and dashboards, including optimization, alerting, and integration into CI/CD pipelines.
  • Strong problem-solving skills with a focus on data-driven decision-making and collaboration in fast-paced environments.

Preferred Qualifications

  • Experience in AI/ML infrastructure or high-performance computing monitoring.
  • Familiarity with other monitoring tools (e.g., Grafana) and observability practices.
  • Prior work in a startup or tech company like xAI, with contributions to scalable monitoring systems.

xAI is an equal opportunity employer. For details on data processing, view our Recruitment Privacy Notice.

Locations

  • Memphis, TN,

Salary

Salary details available upon request

Estimated Salary Rangemedium confidence

300,000 - 650,000 USD / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

  • Grafanaintermediate
  • Javaintermediate
  • Golangintermediate
  • Pythonintermediate
  • C/C++/C#intermediate
  • Bashintermediate
  • Linux shell scriptingintermediate
  • JSONintermediate
  • Prometheusintermediate
  • InfluxDBintermediate
  • Linuxintermediate
  • problem-solvingintermediate
  • data-driven decision-makingintermediate
  • collaborationintermediate

Required Qualifications

  • Bachelor's degree in Computer Science, Software Engineering, or a related field (or equivalent experience) (experience)
  • 5+ years of experience in site reliability engineering or monitoring roles, preferably in datacenter or cloud environments (experience)
  • Proficiency in at least two of the following programming languages: Java, Golang, Python, C/C++/C#, with strong skills in Linux and Bash scripting (experience)
  • Hands-on experience working with JSON for data parsing, integration, and API interactions (experience)
  • Expert-level knowledge of Grafana, including creating complex dashboards, queries, and integrations with data sources like Prometheus or InfluxDB (experience)
  • Proven track record of developing dashboards that provide health and monitoring views for operational teams, with examples of how they scaled business operations (experience)
  • Experience managing monitoring tools and dashboards, including optimization, alerting, and integration into CI/CD pipelines (experience)
  • Strong problem-solving skills with a focus on data-driven decision-making and collaboration in fast-paced environments (experience)

Preferred Qualifications

  • Experience in AI/ML infrastructure or high-performance computing monitoring (experience)
  • Familiarity with other monitoring tools (e.g., Grafana) and observability practices (experience)
  • Prior work in a startup or tech company like xAI, with contributions to scalable monitoring systems (experience)

Responsibilities

  • Design, build, and maintain Grafana dashboards tailored for datacenter technician organizations, providing real-time views into system health, performance metrics, and monitoring alerts
  • Develop automation scripts and tools using languages such as Java, Golang, Python, C/C++/C#, Bash, or Linux shell scripting to integrate monitoring systems and process data in JSON formats
  • Collaborate with Datacenter Operations Technicians to identify monitoring needs, troubleshoot issues, and ensure dashboards support efficient incident response and preventive maintenance
  • Evaluate and optimize existing dashboards for scalability, drawing from past experiences in creating monitoring solutions that have driven business growth
  • Manage dashboard lifecycle, including version control, updates, and performance tuning to handle large-scale datacenter environments
  • Participate in on-call rotations, incident analysis, and root cause investigations using monitoring data to improve system reliability
  • Document monitoring strategies, dashboard designs, and best practices to foster knowledge sharing within the team

Target Your Resume for "Site Reliability Engineer - Monitoring" , xAI

Get personalized recommendations to optimize your resume specifically for Site Reliability Engineer - Monitoring. Takes only 15 seconds!

AI-powered keyword optimization
Skills matching & gap analysis
Experience alignment suggestions

Check Your ATS Score for "Site Reliability Engineer - Monitoring" , xAI

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check
Keyword optimization analysis
Skill matching & gap identification
Format & readability score

Tags & Categories

EngineeringEngineering
Quiz Challenge

Answer 10 quick questions to check your fit for Site Reliability Engineer - Monitoring @ xAI.

10 Questions
~2 Minutes
Instant Score

Related Books and Jobs

No related jobs found at the moment.