Senior Site Reliability Engineer

Senior Site Reliability Engineer

Posted 1 week ago by SixteenFifty

Negotiable
Undetermined
Undetermined
London Area, United Kingdom

Summary: The role of Contract Site Reliability Engineer Team Lead involves providing technical leadership and mentorship to a team of Site Reliability Engineers (SREs) while remaining hands-on with cloud infrastructure and operational excellence. The position requires a strong technical background in AWS and a focus on driving reliability and performance improvements across critical production systems. Occasional travel to the client's offices in Central London is expected. The ideal candidate will champion Infrastructure as Code practices and collaborate closely with various teams to enhance system reliability.

Key Responsibilities:

  • Provide technical leadership for the Site Reliability Engineering function, driving reliability, scalability, and performance improvements across critical production systems.
  • Lead, mentor, and coach a team of SREs and engineers, fostering a culture of operational excellence, collaboration, and continuous improvement.
  • Remain hands-on with the design, implementation, and support of cloud infrastructure, automation, observability, and platform reliability initiatives.
  • Define, implement, and govern SLOs, SLIs, and error budgets, ensuring alignment between engineering priorities and business objectives.
  • Architect, maintain, and optimise highly available, distributed systems within an AWS cloud environment.
  • Drive change management initiatives across infrastructure, platforms, and operational processes, ensuring smooth adoption of new technologies and ways of working.
  • Champion Infrastructure as Code (IaC) and automation practices, reducing manual operational effort through tools such as Terraform and CloudFormation.
  • Collaborate closely with development, platform, and operational teams to embed reliability and resilience best practices throughout the software development lifecycle.
  • Lead incident management, root cause analysis, and continuous service improvement activities.
  • Establish and enhance monitoring, alerting, and observability capabilities across the technology estate.

Key Skills:

  • Proven experience in a Site Reliability Engineering, DevOps, Cloud Engineering, or Infrastructure Engineering role, with experience leading or mentoring technical teams.
  • Demonstrable hands-on technical expertise alongside leadership responsibilities.
  • Strong experience delivering and managing change within complex technology environments.
  • Extensive experience working with AWS cloud services and architectures.
  • Strong Linux/Unix systems administration knowledge.
  • Proficiency in one or more scripting or programming languages such as Python, Bash, Go, or Java.
  • Strong experience with Infrastructure as Code tools, including Terraform and/or CloudFormation.
  • Experience with containerisation and orchestration technologies, including Docker and Kubernetes.
  • Familiarity with CI/CD tooling such as Jenkins, GitHub Actions, GitLab CI, or Azure DevOps.
  • Essential experience with observability and monitoring platforms, including Datadog and Splunk.
  • Strong understanding of distributed systems, networking, security principles, and cloud-native architectures.
  • Excellent troubleshooting, problem-solving, and stakeholder management skills.

Salary (Rate): undetermined

City: London

Country: United Kingdom

Working Arrangements: undetermined

IR35 Status: undetermined

Seniority Level: undetermined

Industry: IT

Detailed Description From Employer:

My client is seeking an experienced Contract Site Reliability Engineer Team Lead to join their team. This is a hands-on leadership role requiring a strong technical background alongside the ability to lead, mentor, and develop a high-performing SRE function. There will be occasional travel to the client's offices in Central London.

Responsibilities

  • Provide technical leadership for the Site Reliability Engineering function, driving reliability, scalability, and performance improvements across critical production systems.
  • Lead, mentor, and coach a team of SREs and engineers, fostering a culture of operational excellence, collaboration, and continuous improvement.
  • Remain hands-on with the design, implementation, and support of cloud infrastructure, automation, observability, and platform reliability initiatives.
  • Define, implement, and govern SLOs, SLIs, and error budgets, ensuring alignment between engineering priorities and business objectives.
  • Architect, maintain, and optimise highly available, distributed systems within an AWS cloud environment.
  • Drive change management initiatives across infrastructure, platforms, and operational processes, ensuring smooth adoption of new technologies and ways of working.
  • Champion Infrastructure as Code (IaC) and automation practices, reducing manual operational effort through tools such as Terraform and CloudFormation.
  • Collaborate closely with development, platform, and operational teams to embed reliability and resilience best practices throughout the software development lifecycle.
  • Lead incident management, root cause analysis, and continuous service improvement activities.
  • Establish and enhance monitoring, alerting, and observability capabilities across the technology estate.

Required Skills & Experience

  • Proven experience in a Site Reliability Engineering, DevOps, Cloud Engineering, or Infrastructure Engineering role, with experience leading or mentoring technical teams.
  • Demonstrable hands-on technical expertise alongside leadership responsibilities.
  • Strong experience delivering and managing change within complex technology environments.
  • Extensive experience working with AWS cloud services and architectures, as the client's platform is hosted within AWS.
  • Strong Linux/Unix systems administration knowledge.
  • Proficiency in one or more scripting or programming languages such as Python, Bash, Go, or Java.
  • Strong experience with Infrastructure as Code tools, including Terraform and/or CloudFormation.
  • Experience with containerisation and orchestration technologies, including Docker and Kubernetes.
  • Familiarity with CI/CD tooling such as Jenkins, GitHub Actions, GitLab CI, or Azure DevOps.
  • Essential experience with observability and monitoring platforms, including Datadog and Splunk.
  • Strong understanding of distributed systems, networking, security principles, and cloud-native architectures.
  • Excellent troubleshooting, problem-solving, and stakeholder management skills.

Desirable Experience

  • Operating within large-scale, mission-critical production environments.
  • Previous experience establishing or maturing SRE practices and operating models.
  • Relevant AWS, Kubernetes, or cloud certifications.

Please apply for immediate consideration.