Negotiable
Undetermined
Remote
London, England, United Kingdom
Summary: The role of Senior Site Reliability Engineer involves leading the reliability and resiliency of a vast AWS ecosystem for a global video game company. The position requires managing high-severity incidents, architecting Kubernetes workloads, and automating infrastructure provisioning. The engineer will also define service level objectives in collaboration with product and security teams. This is a long-term contract with remote working flexibility within the UK.
Key Responsibilities:
- Lead high-severity incident response and troubleshooting for production systems.
- Architect and manage Kubernetes-based workloads at scale using EKS and OpenShift.
- Build and maintain robust event-driven architectures that scale globally.
- Automate infrastructure provisioning and CI/CD pipelines using Infrastructure as Code tools.
- Define and manage SLOs, SLIs, and error budgets in collaboration with product and security teams.
Key Skills:
- Extensive hands-on experience with AWS managed services including EC2, Lambda, S3, VPC, CloudWatch, and multi-account IAM environments.
- Deep expertise in Kubernetes (EKS), Docker, and Service Mesh.
- Strong mastery of networking fundamentals: DNS, VPC routing, load balancing, TCP/IP, and advanced firewall policies.
- Proven track record of automating deployments and managing infrastructure through code (Terraform/Ansible).
- Leadership mindset with the ability to mentor junior engineers.
Salary (Rate): undetermined
City: London
Country: United Kingdom
Working Arrangements: remote
IR35 Status: undetermined
Seniority Level: undetermined
Industry: IT
We're working with a global pioneer in the video game industry on this exciting opportunity. Scale the systems behind revenue-critical platforms used by millions of gamers worldwide. We are looking for an SRE leader to own reliability and resiliency across a massive AWS ecosystem, driving high-performance engineering in a high-traffic, global-scale environment.
The Role
- Lead high-severity incident response and troubleshooting for production systems, driving post-mortem improvements and long-term platform stability.
- Architect and manage Kubernetes-based workloads at scale using EKS and OpenShift to ensure container orchestration excellence.
- Build and maintain robust event-driven architectures that scale globally while maintaining fault-tolerance and high availability.
- Automate infrastructure provisioning and CI/CD pipelines using Infrastructure as Code tools including Terraform, CloudFormation, Ansible, and CDK.
- Define and manage SLOs, SLIs, and error budgets in collaboration with product and security teams to maintain world-class service levels.
What You'll Need
- Extensive hands-on experience with AWS managed services including EC2, Lambda, S3, VPC, CloudWatch, and multi-account IAM environments.
- Deep expertise in Kubernetes (EKS), Docker, and Service Mesh for managing complex microservices architectures.
- Strong mastery of networking fundamentals: DNS, VPC routing, load balancing, TCP/IP, and advanced firewall policies.
- Proven track record of automating deployments and managing infrastructure through code (Terraform/Ansible) in high-traffic environments.
- Leadership mindset with the ability to mentor junior engineers and influence platform-wide architectural decisions.
What's On Offer
- Long-term 12-month contract with a high probability of extension based on performance.
- 100% Remote working flexibility within the UK.
- The chance to work on world-class gaming infrastructure that supports millions of concurrent users globally.
- Apply via Haystack today!