Resume and JobRESUME AND JOB
Cisco logo

Senior Site Reliability Engineer, Production Engineering

Cisco

Software and Technology Jobs

Senior Site Reliability Engineer, Production Engineering

full-timePosted: Dec 17, 2025

Job Description

Job ID: 2002521

Meet the Team

Cisco ThousandEyes is a Digital Experience Assurance platform that empowers organizations to deliver flawless digital experiences across every network - even the ones they don’t own. Powered by AI and an unmatched set of cloud, internet and enterprise network telemetry data, ThousandEyes enables IT teams to proactively detect, diagnose, and remediate issues - before they impact end- user experiences.

ThousandEyes is deeply integrated across the entire Cisco technology portfolio and beyond, helping customers deploy at scale while also delivering AI-powered assurance insights within Cisco’s leading Networking, Security, Collaboration, and Observability portfolios.

Your Impact

We are seeking a skilled Senior Site Reliability Engineer (SRE) in Production Engineering with a strong background in SaaS and operations. You will design and manage large-scale, highly available distributed systems in the cloud, collaborating directly with application development teams to enhance the reliability, performance, and security of our platform.

Technical Leadership & Collaboration: Forge strong partnerships with cross-functional stakeholders to identify requirements and deliver solutions that address project and departmental objectives.

Solution Design & Deployment: Architect and implement small to mid-size or moderately complex solutions that elevate reliability, availability, latency, and performance across diverse environments and customer segments.

Automation & Service Reliability: Combine expertise in design, automation, deployment, and coding to enhance system reliability for new and existing platforms, tailoring approaches to regional, national, or customer-specific needs.

High Availability & Disaster Recovery: Develop and validate automated high-availability and disaster recovery mechanisms, ensuring systems are robust, scalable, and support rapid velocity in delivery. Take part in regular disaster recovery drills.

Capacity Planning & Reporting: Analyze resource usage and produce actionable reports to forecast and address capacity constraints, supporting proactive decision-making and operational excellence.

Monitoring & Tooling: Design, build, and deploy tools that deliver comprehensive visibility into infrastructure performance and reliability. Automate key platform functions for efficiency and resilience.

Incident Response & Continuous Improvement: Monitor production environments, collaborate with Development and Operations to diagnose issues, and develop monitoring tools to preemptively identify and resolve problems. Serve as on-call Site Reliability Engineer (SRE), lead post-mortems, and deliver clear root cause analyses.

Security & Compliance: Embed strong security controls in architectural design, collaborate with security teams to enhance safeguards, and contribute to incident response efforts as needed. Work closely with various teams specializing in security, to ensure various platform components and infrastructure is secure at the highest possible level.

Minimum Qualifications

  • Expert-level knowledge of Kubernetes and its ecosystem.

  • Proficiency in software development with languages such as Python or Go.

  • In-depth knowledge of cloud providers, preferably AWS.

  • Solid conceptual and practical knowledge in Web technologies, Networking, and Linux.

  • Knowledge of Site Reliability principles: Incident Response, Change Management, Distributed Systems, Deployment Strategies, and SLOs.

Preferred Qualifications

  • Familiarity with best practices for operating a large-scale, highly available enterprise platform.

  • 5+ years of experience in a related role.

  • Proven ability to build and implement scalable and well-tested solutions.

  • Excellent communication and documentation skills.

  • Strong sense of ownership, drive, and attention to detail.

Why Cisco? 

At Cisco, we’re revolutionizing how data and infrastructure connect and protect organizations in the AI era – and beyond. We’ve been innovating fearlessly for 40 years to create solutions that power how humans and technology work together across the physical and digital worlds. These solutions provide customers with unparalleled security, visibility, and insights across the entire digital footprint.

Fueled by the depth and breadth of our technology, we experiment and create meaningful solutions. Add to that our worldwide network of doers and experts, and you’ll see that the opportunities to grow and build are limitless. We work as a team, collaborating with empathy to make really big things happen on a global scale. Because our solutions are everywhere, our impact is everywhere. 

We are Cisco, and our power starts with you. 

Locations

  • London, United Kingdom

Salary

75,200 - 103,200 GBP / yearly

Skills Required

  • Kubernetes and its ecosystemintermediate
  • Python or Gointermediate
  • AWSintermediate
  • Web technologiesintermediate
  • Networkingintermediate
  • Linuxintermediate
  • Site Reliability principles: Incident Response, Change Management, Distributed Systems, Deployment Strategies, SLOsintermediate

Required Qualifications

  • Expert-level knowledge of Kubernetes and its ecosystem. (experience)
  • Proficiency in software development with languages such as Python or Go. (experience)
  • In-depth knowledge of cloud providers, preferably AWS. (experience)
  • Solid conceptual and practical knowledge in Web technologies, Networking, and Linux. (experience)
  • Knowledge of Site Reliability principles: Incident Response, Change Management, Distributed Systems, Deployment Strategies, and SLOs. (experience)

Preferred Qualifications

  • Familiarity with best practices for operating a large-scale, highly available enterprise platform. (experience)
  • 5+ years of experience in a related role. (experience)
  • Proven ability to build and implement scalable and well-tested solutions. (experience)
  • Excellent communication and documentation skills. (experience)
  • Strong sense of ownership, drive, and attention to detail. (experience)

Responsibilities

  • Technical Leadership & Collaboration : Forge strong partnerships with cross-functional stakeholders to identify requirements and deliver solutions that address project and departmental objectives.
  • Solution Design & Deployment : Architect and implement small to mid-size or moderately complex solutions that elevate reliability, availability, latency, and performance across diverse environments and customer segments.
  • Automation & Service Reliability : Combine expertise in design, automation, deployment, and coding to enhance system reliability for new and existing platforms, tailoring approaches to regional, national, or customer-specific needs.
  • High Availability & Disaster Recovery : Develop and validate automated high-availability and disaster recovery mechanisms, ensuring systems are robust, scalable, and support rapid velocity in delivery. Take part in regular disaster recovery drills.
  • Capacity Planning & Reporting : Analyze resource usage and produce actionable reports to forecast and address capacity constraints, supporting proactive decision-making and operational excellence.
  • Monitoring & Tooling : Design, build, and deploy tools that deliver comprehensive visibility into infrastructure performance and reliability. Automate key platform functions for efficiency and resilience.
  • Incident Response & Continuous Improvement : Monitor production environments, collaborate with Development and Operations to diagnose issues, and develop monitoring tools to preemptively identify and resolve problems. Serve as on-call Site Reliability Engineer (SRE), lead post-mortems, and deliver clear root cause analyses.
  • Security & Compliance : Embed strong security controls in architectural design, collaborate with security teams to enhance safeguards, and contribute to incident response efforts as needed. Work closely with various teams specializing in security, to ensure various platform components and infrastructure is secure at the highest possible level.

Benefits

  • general: Unparalleled security, visibility, and insights across the entire digital footprint.
  • general: Worldwide network of doers and experts.
  • general: Limitless opportunities to grow and build.
  • general: Collaborative team environment.
  • general: Global impact.

Target Your Resume for "Senior Site Reliability Engineer, Production Engineering" , Cisco

Get personalized recommendations to optimize your resume specifically for Senior Site Reliability Engineer, Production Engineering. Takes only 15 seconds!

AI-powered keyword optimization
Skills matching & gap analysis
Experience alignment suggestions

Check Your ATS Score for "Senior Site Reliability Engineer, Production Engineering" , Cisco

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check
Keyword optimization analysis
Skill matching & gap identification
Format & readability score

Answer 10 quick questions to check your fit for Senior Site Reliability Engineer, Production Engineering @ Cisco.

Quiz Challenge
10 Questions
~2 Minutes
Instant Score

Related Books and Jobs

No related jobs found at the moment.

Cisco logo

Senior Site Reliability Engineer, Production Engineering

Cisco

Software and Technology Jobs

Senior Site Reliability Engineer, Production Engineering

full-timePosted: Dec 17, 2025

Job Description

Job ID: 2002521

Meet the Team

Cisco ThousandEyes is a Digital Experience Assurance platform that empowers organizations to deliver flawless digital experiences across every network - even the ones they don’t own. Powered by AI and an unmatched set of cloud, internet and enterprise network telemetry data, ThousandEyes enables IT teams to proactively detect, diagnose, and remediate issues - before they impact end- user experiences.

ThousandEyes is deeply integrated across the entire Cisco technology portfolio and beyond, helping customers deploy at scale while also delivering AI-powered assurance insights within Cisco’s leading Networking, Security, Collaboration, and Observability portfolios.

Your Impact

We are seeking a skilled Senior Site Reliability Engineer (SRE) in Production Engineering with a strong background in SaaS and operations. You will design and manage large-scale, highly available distributed systems in the cloud, collaborating directly with application development teams to enhance the reliability, performance, and security of our platform.

Technical Leadership & Collaboration: Forge strong partnerships with cross-functional stakeholders to identify requirements and deliver solutions that address project and departmental objectives.

Solution Design & Deployment: Architect and implement small to mid-size or moderately complex solutions that elevate reliability, availability, latency, and performance across diverse environments and customer segments.

Automation & Service Reliability: Combine expertise in design, automation, deployment, and coding to enhance system reliability for new and existing platforms, tailoring approaches to regional, national, or customer-specific needs.

High Availability & Disaster Recovery: Develop and validate automated high-availability and disaster recovery mechanisms, ensuring systems are robust, scalable, and support rapid velocity in delivery. Take part in regular disaster recovery drills.

Capacity Planning & Reporting: Analyze resource usage and produce actionable reports to forecast and address capacity constraints, supporting proactive decision-making and operational excellence.

Monitoring & Tooling: Design, build, and deploy tools that deliver comprehensive visibility into infrastructure performance and reliability. Automate key platform functions for efficiency and resilience.

Incident Response & Continuous Improvement: Monitor production environments, collaborate with Development and Operations to diagnose issues, and develop monitoring tools to preemptively identify and resolve problems. Serve as on-call Site Reliability Engineer (SRE), lead post-mortems, and deliver clear root cause analyses.

Security & Compliance: Embed strong security controls in architectural design, collaborate with security teams to enhance safeguards, and contribute to incident response efforts as needed. Work closely with various teams specializing in security, to ensure various platform components and infrastructure is secure at the highest possible level.

Minimum Qualifications

  • Expert-level knowledge of Kubernetes and its ecosystem.

  • Proficiency in software development with languages such as Python or Go.

  • In-depth knowledge of cloud providers, preferably AWS.

  • Solid conceptual and practical knowledge in Web technologies, Networking, and Linux.

  • Knowledge of Site Reliability principles: Incident Response, Change Management, Distributed Systems, Deployment Strategies, and SLOs.

Preferred Qualifications

  • Familiarity with best practices for operating a large-scale, highly available enterprise platform.

  • 5+ years of experience in a related role.

  • Proven ability to build and implement scalable and well-tested solutions.

  • Excellent communication and documentation skills.

  • Strong sense of ownership, drive, and attention to detail.

Why Cisco? 

At Cisco, we’re revolutionizing how data and infrastructure connect and protect organizations in the AI era – and beyond. We’ve been innovating fearlessly for 40 years to create solutions that power how humans and technology work together across the physical and digital worlds. These solutions provide customers with unparalleled security, visibility, and insights across the entire digital footprint.

Fueled by the depth and breadth of our technology, we experiment and create meaningful solutions. Add to that our worldwide network of doers and experts, and you’ll see that the opportunities to grow and build are limitless. We work as a team, collaborating with empathy to make really big things happen on a global scale. Because our solutions are everywhere, our impact is everywhere. 

We are Cisco, and our power starts with you. 

Locations

  • London, United Kingdom

Salary

75,200 - 103,200 GBP / yearly

Skills Required

  • Kubernetes and its ecosystemintermediate
  • Python or Gointermediate
  • AWSintermediate
  • Web technologiesintermediate
  • Networkingintermediate
  • Linuxintermediate
  • Site Reliability principles: Incident Response, Change Management, Distributed Systems, Deployment Strategies, SLOsintermediate

Required Qualifications

  • Expert-level knowledge of Kubernetes and its ecosystem. (experience)
  • Proficiency in software development with languages such as Python or Go. (experience)
  • In-depth knowledge of cloud providers, preferably AWS. (experience)
  • Solid conceptual and practical knowledge in Web technologies, Networking, and Linux. (experience)
  • Knowledge of Site Reliability principles: Incident Response, Change Management, Distributed Systems, Deployment Strategies, and SLOs. (experience)

Preferred Qualifications

  • Familiarity with best practices for operating a large-scale, highly available enterprise platform. (experience)
  • 5+ years of experience in a related role. (experience)
  • Proven ability to build and implement scalable and well-tested solutions. (experience)
  • Excellent communication and documentation skills. (experience)
  • Strong sense of ownership, drive, and attention to detail. (experience)

Responsibilities

  • Technical Leadership & Collaboration : Forge strong partnerships with cross-functional stakeholders to identify requirements and deliver solutions that address project and departmental objectives.
  • Solution Design & Deployment : Architect and implement small to mid-size or moderately complex solutions that elevate reliability, availability, latency, and performance across diverse environments and customer segments.
  • Automation & Service Reliability : Combine expertise in design, automation, deployment, and coding to enhance system reliability for new and existing platforms, tailoring approaches to regional, national, or customer-specific needs.
  • High Availability & Disaster Recovery : Develop and validate automated high-availability and disaster recovery mechanisms, ensuring systems are robust, scalable, and support rapid velocity in delivery. Take part in regular disaster recovery drills.
  • Capacity Planning & Reporting : Analyze resource usage and produce actionable reports to forecast and address capacity constraints, supporting proactive decision-making and operational excellence.
  • Monitoring & Tooling : Design, build, and deploy tools that deliver comprehensive visibility into infrastructure performance and reliability. Automate key platform functions for efficiency and resilience.
  • Incident Response & Continuous Improvement : Monitor production environments, collaborate with Development and Operations to diagnose issues, and develop monitoring tools to preemptively identify and resolve problems. Serve as on-call Site Reliability Engineer (SRE), lead post-mortems, and deliver clear root cause analyses.
  • Security & Compliance : Embed strong security controls in architectural design, collaborate with security teams to enhance safeguards, and contribute to incident response efforts as needed. Work closely with various teams specializing in security, to ensure various platform components and infrastructure is secure at the highest possible level.

Benefits

  • general: Unparalleled security, visibility, and insights across the entire digital footprint.
  • general: Worldwide network of doers and experts.
  • general: Limitless opportunities to grow and build.
  • general: Collaborative team environment.
  • general: Global impact.

Target Your Resume for "Senior Site Reliability Engineer, Production Engineering" , Cisco

Get personalized recommendations to optimize your resume specifically for Senior Site Reliability Engineer, Production Engineering. Takes only 15 seconds!

AI-powered keyword optimization
Skills matching & gap analysis
Experience alignment suggestions

Check Your ATS Score for "Senior Site Reliability Engineer, Production Engineering" , Cisco

Find out how well your resume matches this job's requirements. Get comprehensive analysis including ATS compatibility, keyword matching, skill gaps, and personalized recommendations.

ATS compatibility check
Keyword optimization analysis
Skill matching & gap identification
Format & readability score

Answer 10 quick questions to check your fit for Senior Site Reliability Engineer, Production Engineering @ Cisco.

Quiz Challenge
10 Questions
~2 Minutes
Instant Score

Related Books and Jobs

No related jobs found at the moment.