Negotiable
Inside
Remote
London
Summary: The Senior Site Reliability Engineer (SRE) role involves ensuring the reliability of high-traffic platforms in the video game industry. The position focuses on improving architecture, platform resiliency, and service performance while leading incident response and mentoring teams. This is a remote, 12-month contract with a high chance of extension. The role requires extensive experience in AWS and Kubernetes, among other technical skills.
Key Responsibilities:
- Lead incident response and troubleshooting for production systems, resolving high-severity issues and driving post-incident improvements.
- Influence architecture to improve platform-wide reliability, resiliency, and operational efficiency, ensuring services remain available under heavy load.
- Drive containerisation best practices and manage Kubernetes-based workloads at scale.
- Build and maintain event-driven architectures that scale globally while ensuring fault-tolerance and high availability.
- Automate infrastructure provisioning, deployment, and monitoring using Infrastructure as Code (Terraform, CloudFormation, Ansible, CDK).
- Collaborate with engineering, product, and security teams to define SLOs, SLIs, and error budgets across services.
- Provide mentorship, advocate SRE best practices, and ensure teams are empowered to deliver resilient, reliable systems.
Key Skills:
- Extensive experience in AWS and AWS-managed services (EC2, Lambda, S3, VPC, CloudWatch, CloudTrail, IAM, EKS, Service Catalog, multi-account environments).
- Strong Kubernetes / container orchestration experience, including EKS, OpenShift, Docker, and service mesh.
- Deep understanding of networking fundamentals: DNS, VPCs, routing, load balancing, TCP/IP, firewall policies.
- Proven track record in incident response and troubleshooting at scale.
- Hands-on experience with infrastructure automation and CI/CD pipelines.
- Experience designing event-driven architectures and resilient systems.
- High level of autonomy, able to influence platform-wide decisions and architect for reliability across services.
- Ability and desire to mentor junior staff.
- Bonus: experience in gaming, interactive entertainment, or other high-traffic, global-scale platforms.
Salary (Rate): undetermined
City: London
Country: UK
Working Arrangements: remote
IR35 Status: inside IR35
Seniority Level: undetermined
Industry: IT
Detailed Description From Employer:
Senior Site Reliability Engineer (SRE)
Remote
12-month contract (high chance of extension)
Job Description
Join a global pioneer in the video game industry and own the reliability of high-traffic, revenue-critical platforms used by millions worldwide. As a Senior SRE, you'll shape the architecture, improve platform-wide resiliency, and ensure services stay performant, scalable, and secure. This isn't just about maintaining a single system, you'll influence reliability across multiple services, driving improvements that touch the entire ecosystem.
Key Responsibilities
- Lead incident response and troubleshooting for production systems, resolving high-severity issues and driving post-incident improvements.
- Influence architecture to improve platform-wide reliability, resiliency, and operational efficiency, ensuring services remain available under heavy load.
- Drive containerisation best practices and manage Kubernetes-based workloads at scale.
- Build and maintain event-driven architectures that scale globally while ensuring fault-tolerance and high availability.
- Automate infrastructure provisioning, deployment, and monitoring using Infrastructure as Code (Terraform, CloudFormation, Ansible, CDK).
- Collaborate with engineering, product, and security teams to define SLOs, SLIs, and error budgets across services.
- Provide mentorship, advocate SRE best practices, and ensure teams are empowered to deliver resilient, reliable systems.
Experience / Must-Have Skills
- Extensive experience in AWS and AWS-managed services (EC2, Lambda, S3, VPC, CloudWatch, CloudTrail, IAM, EKS, Service Catalog, multi-account environments).
- Strong Kubernetes / container orchestration experience, including EKS, OpenShift, Docker, and service mesh.
- Deep understanding of networking fundamentals: DNS, VPCs, routing, load balancing, TCP/IP, firewall policies.
- Proven track record in incident response and troubleshooting at scale.
- Hands-on experience with infrastructure automation and CI/CD pipelines.
- Experience designing event-driven architectures and resilient systems.
- High level of autonomy, able to influence platform-wide decisions and architect for reliability across services.
- Ability and desire to mentor junior staff
- Bonus: experience in gaming, interactive entertainment, or other high-traffic, global-scale platforms.
If you are interested in this role, please feel free to submit your CV.