Senior Site Reliability Engineer

Senior Site Reliability Engineer

Posted 1 day ago by Haystack

Negotiable
Undetermined
Remote
London, England, United Kingdom

Summary: The role of Senior Site Reliability Engineer involves leading the reliability and resiliency of a vast AWS ecosystem for a global video game company. The position requires managing high-severity incidents, architecting Kubernetes workloads, and automating infrastructure provisioning. The engineer will also define service level objectives in collaboration with product and security teams. This is a long-term contract with remote working flexibility within the UK.

Key Responsibilities:

  • Lead high-severity incident response and troubleshooting for production systems.
  • Architect and manage Kubernetes-based workloads at scale using EKS and OpenShift.
  • Build and maintain robust event-driven architectures that scale globally.
  • Automate infrastructure provisioning and CI/CD pipelines using Infrastructure as Code tools.
  • Define and manage SLOs, SLIs, and error budgets in collaboration with product and security teams.

Key Skills:

  • Extensive hands-on experience with AWS managed services including EC2, Lambda, S3, VPC, CloudWatch, and multi-account IAM environments.
  • Deep expertise in Kubernetes (EKS), Docker, and Service Mesh.
  • Strong mastery of networking fundamentals: DNS, VPC routing, load balancing, TCP/IP, and advanced firewall policies.
  • Proven track record of automating deployments and managing infrastructure through code (Terraform/Ansible).
  • Leadership mindset with the ability to mentor junior engineers.

Salary (Rate): undetermined

City: London

Country: United Kingdom

Working Arrangements: remote

IR35 Status: undetermined

Seniority Level: undetermined

Industry: IT

Detailed Description From Employer:

We're working with a global pioneer in the video game industry on this exciting opportunity. Scale the systems behind revenue-critical platforms used by millions of gamers worldwide. We are looking for an SRE leader to own reliability and resiliency across a massive AWS ecosystem, driving high-performance engineering in a high-traffic, global-scale environment.

The Role

  • Lead high-severity incident response and troubleshooting for production systems, driving post-mortem improvements and long-term platform stability.
  • Architect and manage Kubernetes-based workloads at scale using EKS and OpenShift to ensure container orchestration excellence.
  • Build and maintain robust event-driven architectures that scale globally while maintaining fault-tolerance and high availability.
  • Automate infrastructure provisioning and CI/CD pipelines using Infrastructure as Code tools including Terraform, CloudFormation, Ansible, and CDK.
  • Define and manage SLOs, SLIs, and error budgets in collaboration with product and security teams to maintain world-class service levels.

What You'll Need

  • Extensive hands-on experience with AWS managed services including EC2, Lambda, S3, VPC, CloudWatch, and multi-account IAM environments.
  • Deep expertise in Kubernetes (EKS), Docker, and Service Mesh for managing complex microservices architectures.
  • Strong mastery of networking fundamentals: DNS, VPC routing, load balancing, TCP/IP, and advanced firewall policies.
  • Proven track record of automating deployments and managing infrastructure through code (Terraform/Ansible) in high-traffic environments.
  • Leadership mindset with the ability to mentor junior engineers and influence platform-wide architectural decisions.

What's On Offer

  • Long-term 12-month contract with a high probability of extension based on performance.
  • 100% Remote working flexibility within the UK.
  • The chance to work on world-class gaming infrastructure that supports millions of concurrent users globally.
  • Apply via Haystack today!