Resume and JobRESUME AND JOB
Confluent logo

Staff Site Reliability Engineer - Incident Management & Reliability Careers at Confluent - Remote, Ontario | Apply Now!

Confluent

Staff Site Reliability Engineer - Incident Management & Reliability Careers at Confluent - Remote, Ontario | Apply Now!

full-timePosted: Jan 23, 2026

Job Description

Staff Site Reliability Engineer - Incident Management & Reliability (Remote - Canada)

We’re not just building better tech. We’re rewriting how data moves and what the world can do with it. With Confluent, data doesn’t sit still. Our platform puts information in motion, streaming in near real-time so companies can react faster, build smarter, and deliver experiences as dynamic as the world around them.

It takes a certain kind of person to join this team. Those who ask hard questions, give honest feedback, and show up for each other. No egos, no solo acts. Just smart, curious humans pushing toward something bigger, together.

One Confluent. One Team. One Data Streaming Platform.

About the Role:

As a Staff Site Reliability Engineer specializing in Incident Management & Reliability at Confluent, you will play a crucial role in ensuring the stability and performance of our cloud-based data streaming platform. Confluent Cloud processes millions of events per second across AWS, GCP, and Azure. When incidents happen in such a complex, multi-cloud environment, they occur at scale, requiring deep systems thinking to resolve. This role demands an expert-level engineer capable of driving proactive reliability improvements that prevent incidents before they occur.

This position blends hands-on technical work with strategic program ownership. You'll dedicate approximately 75% of your time to engineering tasks, including building automation, enhancing tooling, analyzing systemic failure patterns, and designing reliability improvements. The remaining 25% will focus on teaching and coordination, such as coaching teams through post-mortems, training incident commanders, and refining our incident response practices. You'll be part of a global team providing follow-the-sun coverage, ensuring sustainable work hours for everyone. This role resides within Cloud Architecture and Reliability - Supportability, a horizontal team responsible for reliability standards and tooling across engineering. You are the person who makes incident management less necessary.

A Day in the Life:

Your day as a Staff Site Reliability Engineer at Confluent might involve:

  • Analyzing recent incidents to identify recurring patterns and root causes.
  • Developing automation scripts to proactively detect and mitigate potential issues.
  • Improving incident management tooling and workflows.
  • Participating in a post-mortem review, guiding the team to identify actionable improvements.
  • Conducting a training session for new incident commanders.
  • Collaborating with engineering teams to implement reliability enhancements.
  • Reviewing and editing a customer-facing incident document to ensure clarity and accuracy.
  • Defining and maintaining SLO/SLA frameworks

Why Remote in Ontario, Canada?

Confluent embraces remote work, recognizing that talent exists everywhere. By being located remotely in Ontario, Canada, you gain the flexibility to work from where you are most productive, while still being an integral part of our global team. Ontario boasts a thriving tech community, a high quality of life, and a strategic location that bridges North America and Europe. Working remotely allows you to balance your professional and personal life while contributing to a cutting-edge company.

Career Path:

This Staff Site Reliability Engineer role offers a clear path for career advancement within Confluent. You can progress into roles such as Principal SRE, Architect, or Engineering Manager, depending on your interests and skills. Confluent is committed to providing its employees with opportunities for growth and development, offering training programs, mentorship, and challenging projects to help you reach your full potential.

Salary & Benefits:

Confluent offers a competitive salary and benefits package. The estimated salary range for this position in Ontario, Canada, is $170,000 - $250,000 CAD annually. Actual compensation may vary based on experience, skills, and location. In addition to salary, Confluent provides a comprehensive benefits package that includes:

  • Comprehensive health insurance
  • Generous paid time off and holidays
  • Paid parental leave
  • Retirement plan with employer matching
  • Employee stock purchase program
  • Professional development opportunities
  • Wellness programs
  • Flexible work arrangements
  • Remote work options
  • Employee assistance program
  • Life insurance
  • Disability insurance
  • Commuter benefits
  • Company-sponsored events

Confluent Culture:

At Confluent, we believe that belonging isn’t a perk, it’s the baseline. We work across time zones and backgrounds, knowing the best ideas come from different perspectives. We foster a culture of collaboration, innovation, and respect, where everyone feels empowered to contribute their unique skills and experiences. Our core values include:

  • Customer Obsession
  • Bias for Action
  • Candor and Transparency
  • Impactful Innovation
  • One Confluent

How to Apply:

If you are a passionate and experienced Site Reliability Engineer with a strong focus on incident management and reliability, we encourage you to apply for this exciting opportunity. To apply, please submit your resume and cover letter through our online application portal. We look forward to hearing from you!

FAQ:

  1. What is Confluent?

    Confluent is a data streaming platform that enables companies to react faster, build smarter, and deliver experiences as dynamic as the world around them.

  2. What does a Staff Site Reliability Engineer do at Confluent?

    A Staff Site Reliability Engineer at Confluent focuses on ensuring the stability, performance, and reliability of our cloud-based data streaming platform. They analyze incidents, develop automation, improve tooling, and coach teams on incident response best practices.

  3. What skills are required for this role?

    Key skills include experience in SRE, incident management, cloud computing (AWS, GCP, Azure), distributed systems, observability, Kubernetes, CI/CD pipelines, and strong communication skills.

  4. Is this a remote position?

    Yes, this position is remote and based in Ontario, Canada.

  5. What is the career path for this role?

    You can progress into roles such as Principal SRE, Architect, or Engineering Manager.

  6. What is the salary range for this position?

    The estimated salary range is $170,000 - $250,000 CAD annually, depending on experience and skills.

  7. What benefits does Confluent offer?

    Confluent offers a comprehensive benefits package that includes health insurance, paid time off, parental leave, retirement plan, and more.

  8. What is the culture like at Confluent?

    Confluent fosters a culture of collaboration, innovation, and respect, where everyone feels empowered to contribute their unique skills and experiences.

  9. How do I apply for this position?

    Submit your resume and cover letter through our online application portal.

  10. What tools are used for Incident Management?

    We use tools such as Rootly, PagerDuty, Jira, Confluence, and Slack for incident management.

Locations

  • Ontario, Ontario, Canada (Remote)

Salary

Estimated Salary Rangemedium confidence

187,000 - 275,000 USD / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

  • Site Reliability Engineering (SRE)intermediate
  • Incident Managementintermediate
  • Reliability Engineeringintermediate
  • Cloud Computing (AWS, GCP, Azure)intermediate
  • Distributed Systemsintermediate
  • Observability (Metrics, Logging, Tracing)intermediate
  • Kubernetesintermediate
  • Container Orchestrationintermediate
  • CI/CD Pipelinesintermediate
  • Release Processesintermediate
  • Written Communicationintermediate
  • Process Improvementintermediate
  • Cultural Change Managementintermediate
  • Kafkaintermediate
  • Event Streamingintermediate
  • Rootlyintermediate
  • PagerDutyintermediate
  • Jiraintermediate
  • Confluenceintermediate
  • Slackintermediate

Required Qualifications

  • 10+ years of experience in SRE, incident management, or reliability engineering (experience)
  • Cloud experience with at least one of AWS, GCP, or Azure (experience)
  • Experience navigating reliability/incident programs at 500+ engineer organizations (experience)
  • Expertise with incident management tooling (Rootly, PagerDuty, or similar) (experience)
  • Strong understanding of distributed systems and failure modes at scale (experience)
  • Experience with observability: metrics, logging, tracing (experience)
  • Kubernetes and container orchestration experience (experience)
  • Understanding of CI/CD pipelines and release processes (experience)
  • Strong written communication (design docs, runbooks, post-mortems) (experience)
  • Experience driving org-wide process and cultural changes (experience)
  • Kafka/event streaming expertise preferred (experience)

Responsibilities

  • Analyze systemic failure patterns and design reliability improvements to prevent incident recurrence.
  • Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack.
  • Define and maintain SLO/SLA frameworks; use error budgets to guide reliability investments.
  • Own standards, practices, and continuous improvement of incident response across engineering.
  • Edit and review customer-facing incident documents (CRCAs) to ensure quality and clarity.
  • Develop and deliver training programs; coach teams through post-mortems.
  • Partner with engineering leaders to elevate reliability practices org-wide.
  • Build automation to improve incident response and prevention.
  • Improve tooling for incident analysis and resolution.
  • Analyze systemic failure patterns.
  • Design reliability improvements.
  • Coach teams through post-mortems.
  • Train incident commanders.
  • Evolve incident response practices.

Benefits

  • general: Comprehensive health insurance
  • general: Generous paid time off and holidays
  • general: Paid parental leave
  • general: Retirement plan with employer matching
  • general: Employee stock purchase program
  • general: Professional development opportunities
  • general: Wellness programs
  • general: Flexible work arrangements
  • general: Remote work options
  • general: Employee assistance program
  • general: Life insurance
  • general: Disability insurance
  • general: Commuter benefits
  • general: Company-sponsored events

Target Your Resume for "Staff Site Reliability Engineer - Incident Management & Reliability Careers at Confluent - Remote, Ontario | Apply Now!" , Confluent

Get personalized recommendations to optimize your resume specifically for Staff Site Reliability Engineer - Incident Management & Reliability Careers at Confluent - Remote, Ontario | Apply Now!. Takes only 15 seconds!

AI-powered keyword optimization
Skills matching & gap analysis
Experience alignment suggestions

Check Your ATS Score for "Staff Site Reliability Engineer - Incident Management & Reliability Careers at Confluent - Remote, Ontario | Apply Now!" , Confluent

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check
Keyword optimization analysis
Skill matching & gap identification
Format & readability score

Tags & Categories

SREIncident ManagementCloudRemoteCanadaKafkaKubernetesSite Reliability EngineerReliability EngineeringCloud ComputingAWSGCPAzureDistributed SystemsObservabilityContainer OrchestrationCI/CD PipelinesRelease ProcessesAutomationToolingPost-MortemsIncident CommanderSLOSLAError BudgetsEvent StreamingRemote WorkOntarioConfluent CareersData StreamingData StreamingCloudEngineeringGo-To-Market

Answer 10 quick questions to check your fit for Staff Site Reliability Engineer - Incident Management & Reliability Careers at Confluent - Remote, Ontario | Apply Now! @ Confluent.

Quiz Challenge
10 Questions
~2 Minutes
Instant Score

Related Books and Jobs

No related jobs found at the moment.

Confluent logo

Staff Site Reliability Engineer - Incident Management & Reliability Careers at Confluent - Remote, Ontario | Apply Now!

Confluent

Staff Site Reliability Engineer - Incident Management & Reliability Careers at Confluent - Remote, Ontario | Apply Now!

full-timePosted: Jan 23, 2026

Job Description

Staff Site Reliability Engineer - Incident Management & Reliability (Remote - Canada)

We’re not just building better tech. We’re rewriting how data moves and what the world can do with it. With Confluent, data doesn’t sit still. Our platform puts information in motion, streaming in near real-time so companies can react faster, build smarter, and deliver experiences as dynamic as the world around them.

It takes a certain kind of person to join this team. Those who ask hard questions, give honest feedback, and show up for each other. No egos, no solo acts. Just smart, curious humans pushing toward something bigger, together.

One Confluent. One Team. One Data Streaming Platform.

About the Role:

As a Staff Site Reliability Engineer specializing in Incident Management & Reliability at Confluent, you will play a crucial role in ensuring the stability and performance of our cloud-based data streaming platform. Confluent Cloud processes millions of events per second across AWS, GCP, and Azure. When incidents happen in such a complex, multi-cloud environment, they occur at scale, requiring deep systems thinking to resolve. This role demands an expert-level engineer capable of driving proactive reliability improvements that prevent incidents before they occur.

This position blends hands-on technical work with strategic program ownership. You'll dedicate approximately 75% of your time to engineering tasks, including building automation, enhancing tooling, analyzing systemic failure patterns, and designing reliability improvements. The remaining 25% will focus on teaching and coordination, such as coaching teams through post-mortems, training incident commanders, and refining our incident response practices. You'll be part of a global team providing follow-the-sun coverage, ensuring sustainable work hours for everyone. This role resides within Cloud Architecture and Reliability - Supportability, a horizontal team responsible for reliability standards and tooling across engineering. You are the person who makes incident management less necessary.

A Day in the Life:

Your day as a Staff Site Reliability Engineer at Confluent might involve:

  • Analyzing recent incidents to identify recurring patterns and root causes.
  • Developing automation scripts to proactively detect and mitigate potential issues.
  • Improving incident management tooling and workflows.
  • Participating in a post-mortem review, guiding the team to identify actionable improvements.
  • Conducting a training session for new incident commanders.
  • Collaborating with engineering teams to implement reliability enhancements.
  • Reviewing and editing a customer-facing incident document to ensure clarity and accuracy.
  • Defining and maintaining SLO/SLA frameworks

Why Remote in Ontario, Canada?

Confluent embraces remote work, recognizing that talent exists everywhere. By being located remotely in Ontario, Canada, you gain the flexibility to work from where you are most productive, while still being an integral part of our global team. Ontario boasts a thriving tech community, a high quality of life, and a strategic location that bridges North America and Europe. Working remotely allows you to balance your professional and personal life while contributing to a cutting-edge company.

Career Path:

This Staff Site Reliability Engineer role offers a clear path for career advancement within Confluent. You can progress into roles such as Principal SRE, Architect, or Engineering Manager, depending on your interests and skills. Confluent is committed to providing its employees with opportunities for growth and development, offering training programs, mentorship, and challenging projects to help you reach your full potential.

Salary & Benefits:

Confluent offers a competitive salary and benefits package. The estimated salary range for this position in Ontario, Canada, is $170,000 - $250,000 CAD annually. Actual compensation may vary based on experience, skills, and location. In addition to salary, Confluent provides a comprehensive benefits package that includes:

  • Comprehensive health insurance
  • Generous paid time off and holidays
  • Paid parental leave
  • Retirement plan with employer matching
  • Employee stock purchase program
  • Professional development opportunities
  • Wellness programs
  • Flexible work arrangements
  • Remote work options
  • Employee assistance program
  • Life insurance
  • Disability insurance
  • Commuter benefits
  • Company-sponsored events

Confluent Culture:

At Confluent, we believe that belonging isn’t a perk, it’s the baseline. We work across time zones and backgrounds, knowing the best ideas come from different perspectives. We foster a culture of collaboration, innovation, and respect, where everyone feels empowered to contribute their unique skills and experiences. Our core values include:

  • Customer Obsession
  • Bias for Action
  • Candor and Transparency
  • Impactful Innovation
  • One Confluent

How to Apply:

If you are a passionate and experienced Site Reliability Engineer with a strong focus on incident management and reliability, we encourage you to apply for this exciting opportunity. To apply, please submit your resume and cover letter through our online application portal. We look forward to hearing from you!

FAQ:

  1. What is Confluent?

    Confluent is a data streaming platform that enables companies to react faster, build smarter, and deliver experiences as dynamic as the world around them.

  2. What does a Staff Site Reliability Engineer do at Confluent?

    A Staff Site Reliability Engineer at Confluent focuses on ensuring the stability, performance, and reliability of our cloud-based data streaming platform. They analyze incidents, develop automation, improve tooling, and coach teams on incident response best practices.

  3. What skills are required for this role?

    Key skills include experience in SRE, incident management, cloud computing (AWS, GCP, Azure), distributed systems, observability, Kubernetes, CI/CD pipelines, and strong communication skills.

  4. Is this a remote position?

    Yes, this position is remote and based in Ontario, Canada.

  5. What is the career path for this role?

    You can progress into roles such as Principal SRE, Architect, or Engineering Manager.

  6. What is the salary range for this position?

    The estimated salary range is $170,000 - $250,000 CAD annually, depending on experience and skills.

  7. What benefits does Confluent offer?

    Confluent offers a comprehensive benefits package that includes health insurance, paid time off, parental leave, retirement plan, and more.

  8. What is the culture like at Confluent?

    Confluent fosters a culture of collaboration, innovation, and respect, where everyone feels empowered to contribute their unique skills and experiences.

  9. How do I apply for this position?

    Submit your resume and cover letter through our online application portal.

  10. What tools are used for Incident Management?

    We use tools such as Rootly, PagerDuty, Jira, Confluence, and Slack for incident management.

Locations

  • Ontario, Ontario, Canada (Remote)

Salary

Estimated Salary Rangemedium confidence

187,000 - 275,000 USD / yearly

Source: ai estimated

* This is an estimated range based on market data and may vary based on experience and qualifications.

Skills Required

  • Site Reliability Engineering (SRE)intermediate
  • Incident Managementintermediate
  • Reliability Engineeringintermediate
  • Cloud Computing (AWS, GCP, Azure)intermediate
  • Distributed Systemsintermediate
  • Observability (Metrics, Logging, Tracing)intermediate
  • Kubernetesintermediate
  • Container Orchestrationintermediate
  • CI/CD Pipelinesintermediate
  • Release Processesintermediate
  • Written Communicationintermediate
  • Process Improvementintermediate
  • Cultural Change Managementintermediate
  • Kafkaintermediate
  • Event Streamingintermediate
  • Rootlyintermediate
  • PagerDutyintermediate
  • Jiraintermediate
  • Confluenceintermediate
  • Slackintermediate

Required Qualifications

  • 10+ years of experience in SRE, incident management, or reliability engineering (experience)
  • Cloud experience with at least one of AWS, GCP, or Azure (experience)
  • Experience navigating reliability/incident programs at 500+ engineer organizations (experience)
  • Expertise with incident management tooling (Rootly, PagerDuty, or similar) (experience)
  • Strong understanding of distributed systems and failure modes at scale (experience)
  • Experience with observability: metrics, logging, tracing (experience)
  • Kubernetes and container orchestration experience (experience)
  • Understanding of CI/CD pipelines and release processes (experience)
  • Strong written communication (design docs, runbooks, post-mortems) (experience)
  • Experience driving org-wide process and cultural changes (experience)
  • Kafka/event streaming expertise preferred (experience)

Responsibilities

  • Analyze systemic failure patterns and design reliability improvements to prevent incident recurrence.
  • Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack.
  • Define and maintain SLO/SLA frameworks; use error budgets to guide reliability investments.
  • Own standards, practices, and continuous improvement of incident response across engineering.
  • Edit and review customer-facing incident documents (CRCAs) to ensure quality and clarity.
  • Develop and deliver training programs; coach teams through post-mortems.
  • Partner with engineering leaders to elevate reliability practices org-wide.
  • Build automation to improve incident response and prevention.
  • Improve tooling for incident analysis and resolution.
  • Analyze systemic failure patterns.
  • Design reliability improvements.
  • Coach teams through post-mortems.
  • Train incident commanders.
  • Evolve incident response practices.

Benefits

  • general: Comprehensive health insurance
  • general: Generous paid time off and holidays
  • general: Paid parental leave
  • general: Retirement plan with employer matching
  • general: Employee stock purchase program
  • general: Professional development opportunities
  • general: Wellness programs
  • general: Flexible work arrangements
  • general: Remote work options
  • general: Employee assistance program
  • general: Life insurance
  • general: Disability insurance
  • general: Commuter benefits
  • general: Company-sponsored events

Target Your Resume for "Staff Site Reliability Engineer - Incident Management & Reliability Careers at Confluent - Remote, Ontario | Apply Now!" , Confluent

Get personalized recommendations to optimize your resume specifically for Staff Site Reliability Engineer - Incident Management & Reliability Careers at Confluent - Remote, Ontario | Apply Now!. Takes only 15 seconds!

AI-powered keyword optimization
Skills matching & gap analysis
Experience alignment suggestions

Check Your ATS Score for "Staff Site Reliability Engineer - Incident Management & Reliability Careers at Confluent - Remote, Ontario | Apply Now!" , Confluent

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check
Keyword optimization analysis
Skill matching & gap identification
Format & readability score

Tags & Categories

SREIncident ManagementCloudRemoteCanadaKafkaKubernetesSite Reliability EngineerReliability EngineeringCloud ComputingAWSGCPAzureDistributed SystemsObservabilityContainer OrchestrationCI/CD PipelinesRelease ProcessesAutomationToolingPost-MortemsIncident CommanderSLOSLAError BudgetsEvent StreamingRemote WorkOntarioConfluent CareersData StreamingData StreamingCloudEngineeringGo-To-Market

Answer 10 quick questions to check your fit for Staff Site Reliability Engineer - Incident Management & Reliability Careers at Confluent - Remote, Ontario | Apply Now! @ Confluent.

Quiz Challenge
10 Questions
~2 Minutes
Instant Score

Related Books and Jobs

No related jobs found at the moment.