£330 Per day
Outside
Hybrid
London, England, United Kingdom
Summary: The Site Reliability Engineer - Manager role involves leading the design, development, and delivery of scalable and reliable infrastructure and services for a top mobile industry company. The position requires collaboration with cross-functional teams to enhance observability, automate operations, and build robust systems for cloud-native applications. The role is hybrid remote, based in London, and focuses on maintaining critical services while optimizing performance and deployment pipelines. Candidates should have extensive experience in site reliability engineering and strong technical skills in relevant technologies.
Key Responsibilities:
- Maintain and scale critical services and infrastructure.
- Identify performance bottlenecks and work closely with product engineers to optimize applications.
- Administer, scale, and troubleshoot clusters in GKE, EKS, or other Kubernetes environments.
- Design and maintain scalable infrastructure using Terraform and automate deployments across public, private, or hybrid clouds (mainly AWS).
- Build and improve robust CI/CD pipelines to support fast and safe deployment cycles.
- Implement code-based instrumentation and telemetry.
- Ensure systems are observable with tools for logging, metrics, and alerting.
- Write tooling and automation scripts in Python, Go, or Rust to reduce toil and manual intervention.
- Manage and optimise storage services like Amazon S3 or Google Cloud Storage (GCS).
- Resolve complex networking issues in multi-cloud environments.
Key Skills:
- 5+ years of hands-on experience as a Site Reliability Engineer.
- Proven expertise in Kubernetes (GKE/EKS).
- Strong proficiency in Python, Go, or Rust.
- Solid experience with AWS and Infrastructure as Code using Terraform.
- Deep understanding of Linux internals, standard networking protocols, and distributed systems architecture.
- Hands-on experience with automation and performance optimisation.
- Strong knowledge of SRE principles and methodologies.
- Experience with observability tools and telemetry systems.
- Exposure to Google Cloud Platform (GCP).
- Familiarity with hybrid or multi-cloud architecture.
- Experience with service meshes or edge proxies (e.g., Envoy, Istio).
- Working knowledge of container security best practices.
Salary (Rate): £330 daily
City: London
Country: United Kingdom
Working Arrangements: hybrid
IR35 Status: outside IR35
Seniority Level: undetermined
Industry: IT
Job Title: Site Reliability Engineer - Manager
Location: Hybrid Remote – London EC2M
Contract (12 months)
Rate: Outside IR35 - £300 to £330 Per Day
About the Role: We are partnering with one of the top companies in the mobile industry to hire a Site Reliability Engineer (SRE) Manager. In this role, you will collaborate with cross-functional teams to drive the design, development, and delivery of high-performing, scalable, and reliable infrastructure and services. You’ll be responsible for building robust systems, automating operations, and enhancing observability and deployment pipelines for modern cloud-native applications.
Key Responsibilities:
- System Reliability & Performance: Maintain and scale critical services and infrastructure. Identify performance bottlenecks and work closely with product engineers to optimize applications.
- Kubernetes Operations: Administer, scale, and troubleshoot clusters in GKE, EKS, or other Kubernetes environments.
- Infrastructure as Code (IaC): Design and maintain scalable infrastructure using Terraform and automate deployments across public, private, or hybrid clouds (mainly AWS).
- CI/CD Pipeline Enhancement: Build and improve robust CI/CD pipelines to support fast and safe deployment cycles.
- Observability & Monitoring: Implement code-based instrumentation and telemetry. Ensure systems are observable with tools for logging, metrics, and alerting.
- Automation & Scripting: Write tooling and automation scripts in Python, Go, or Rust to reduce toil and manual intervention.
- Storage & Networking: Manage and optimise storage services like Amazon S3 or Google Cloud Storage (GCS). Resolve complex networking issues in multi-cloud environments.
Essential Requirements: 5+ years of hands-on experience as a Site Reliability Engineer. Proven expertise in Kubernetes (GKE/EKS). Strong proficiency in Python, Go, or Rust. Solid experience with AWS and Infrastructure as Code using Terraform. Deep understanding of Linux internals, standard networking protocols, and distributed systems architecture. Hands-on experience with automation and performance optimisation. Strong knowledge of SRE principles and methodologies. Experience with observability tools and telemetry systems. Exposure to Google Cloud Platform (GCP). Familiarity with hybrid or multi-cloud architecture. Experience with service meshes or edge proxies (e.g., Envoy, Istio). Working knowledge of container security best practices.