Senior Site Reliability Engineer (SRE)

Senior Site Reliability Engineer (SRE)

Posted Today by 1773918608

Negotiable
Inside
Remote
London

Summary: The Senior Site Reliability Engineer (SRE) role involves ensuring the reliability of high-traffic platforms in the video game industry. The position focuses on improving architecture, platform resiliency, and service performance while leading incident response and mentoring teams. This is a remote, 12-month contract with a high chance of extension. The role requires extensive experience in AWS and Kubernetes, among other technical skills.

Key Responsibilities:

  • Lead incident response and troubleshooting for production systems, resolving high-severity issues and driving post-incident improvements.
  • Influence architecture to improve platform-wide reliability, resiliency, and operational efficiency, ensuring services remain available under heavy load.
  • Drive containerisation best practices and manage Kubernetes-based workloads at scale.
  • Build and maintain event-driven architectures that scale globally while ensuring fault-tolerance and high availability.
  • Automate infrastructure provisioning, deployment, and monitoring using Infrastructure as Code (Terraform, CloudFormation, Ansible, CDK).
  • Collaborate with engineering, product, and security teams to define SLOs, SLIs, and error budgets across services.
  • Provide mentorship, advocate SRE best practices, and ensure teams are empowered to deliver resilient, reliable systems.

Key Skills:

  • Extensive experience in AWS and AWS-managed services (EC2, Lambda, S3, VPC, CloudWatch, CloudTrail, IAM, EKS, Service Catalog, multi-account environments).
  • Strong Kubernetes / container orchestration experience, including EKS, OpenShift, Docker, and service mesh.
  • Deep understanding of networking fundamentals: DNS, VPCs, routing, load balancing, TCP/IP, firewall policies.
  • Proven track record in incident response and troubleshooting at scale.
  • Hands-on experience with infrastructure automation and CI/CD pipelines.
  • Experience designing event-driven architectures and resilient systems.
  • High level of autonomy, able to influence platform-wide decisions and architect for reliability across services.
  • Ability and desire to mentor junior staff.
  • Bonus: experience in gaming, interactive entertainment, or other high-traffic, global-scale platforms.

Salary (Rate): undetermined

City: London

Country: UK

Working Arrangements: remote

IR35 Status: inside IR35

Seniority Level: undetermined

Industry: IT

Detailed Description From Employer:

Senior Site Reliability Engineer (SRE)
Remote

12-month contract (high chance of extension)

Job Description
Join a global pioneer in the video game industry and own the reliability of high-traffic, revenue-critical platforms used by millions worldwide. As a Senior SRE, you'll shape the architecture, improve platform-wide resiliency, and ensure services stay performant, scalable, and secure. This isn't just about maintaining a single system, you'll influence reliability across multiple services, driving improvements that touch the entire ecosystem.

Key Responsibilities

  • Lead incident response and troubleshooting for production systems, resolving high-severity issues and driving post-incident improvements.
  • Influence architecture to improve platform-wide reliability, resiliency, and operational efficiency, ensuring services remain available under heavy load.
  • Drive containerisation best practices and manage Kubernetes-based workloads at scale.
  • Build and maintain event-driven architectures that scale globally while ensuring fault-tolerance and high availability.
  • Automate infrastructure provisioning, deployment, and monitoring using Infrastructure as Code (Terraform, CloudFormation, Ansible, CDK).
  • Collaborate with engineering, product, and security teams to define SLOs, SLIs, and error budgets across services.
  • Provide mentorship, advocate SRE best practices, and ensure teams are empowered to deliver resilient, reliable systems.

Experience / Must-Have Skills

  • Extensive experience in AWS and AWS-managed services (EC2, Lambda, S3, VPC, CloudWatch, CloudTrail, IAM, EKS, Service Catalog, multi-account environments).
  • Strong Kubernetes / container orchestration experience, including EKS, OpenShift, Docker, and service mesh.
  • Deep understanding of networking fundamentals: DNS, VPCs, routing, load balancing, TCP/IP, firewall policies.
  • Proven track record in incident response and troubleshooting at scale.
  • Hands-on experience with infrastructure automation and CI/CD pipelines.
  • Experience designing event-driven architectures and resilient systems.
  • High level of autonomy, able to influence platform-wide decisions and architect for reliability across services.
  • Ability and desire to mentor junior staff
  • Bonus: experience in gaming, interactive entertainment, or other high-traffic, global-scale platforms.

If you are interested in this role, please feel free to submit your CV.