£62 Per hour
Inside
Remote
London, England, United Kingdom
Summary: The Site Reliability Engineer (SRE) role involves architecting and maintaining high-performance observability platforms for a global technology company. The position requires expertise in distributed systems and automation tools to ensure seamless performance across millions of connected devices. The role offers 100% remote working flexibility and an initial 12-month contract with potential for extension. Candidates should have extensive experience in Site Reliability Engineering or DevOps within enterprise-scale cloud environments.
Key Responsibilities:
- Design, deploy, and scale high-performance observability platforms and Prometheus monitoring systems.
- Architect and maintain massive Elasticsearch clusters and robust data pipelines leveraging Kafka.
- Drive "Infrastructure as Code" (IaC) initiatives by automating complex cloud environments using Terraform and Ansible.
- Build custom internal tools and sophisticated automation scripts using Python, Go, or Ruby.
- Optimize Linux systems (Debian/Ubuntu) and participate in a collaborative on-call rotation.
Key Skills:
- 5+ years of experience in Site Reliability Engineering (SRE) or DevOps.
- Mastery of the Observability stack, specifically Prometheus, Grafana, and the full ELK Stack.
- Expert-level Linux systems administration skills.
- Deep knowledge of distributed systems architecture and Kafka messaging.
- Hands-on proficiency with automation and configuration tools, including Terraform and Ansible.
- Programming skills in Python or Golang.
- Ability to thrive in a fast-paced environment.
Salary (Rate): £62 hourly
City: London
Country: United Kingdom
Working Arrangements: remote
IR35 Status: inside IR35
Seniority Level: Senior
Industry: IT
SRE - Site Reliability Engineer | £55 - £62
We're working with a global technology powerhouse supporting millions of connected devices on this exciting opportunity. Step into a high-impact Senior SRE role where you will be the architect of reliability for a massive distributed systems landscape. You will take the lead on scaling mission-critical observability and monitoring platforms using a cutting-edge stack including Prometheus, Kafka, and the ELK stack to ensure seamless performance for a global user base.
The Role
- Design, deploy, and scale high-performance observability platforms and Prometheus monitoring systems to support millions of global devices.
- Architect and maintain massive Elasticsearch clusters and robust data pipelines leveraging Kafka for real-time streaming.
- Drive "Infrastructure as Code" (IaC) initiatives by automating complex cloud environments using Terraform and Ansible.
- Build custom internal tools and sophisticated automation scripts using Python, Go, or Ruby to eliminate toil and boost system performance.
- Optimize Linux systems (Debian/Ubuntu) and participate in a collaborative on-call rotation to maintain 24/7 service availability.
What You'll Need
- 5+ years of battle-tested experience in Site Reliability Engineering (SRE) or DevOps within enterprise-scale cloud environments.
- Mastery of the Observability stack, specifically Prometheus, Grafana, and the full ELK Stack (Elasticsearch, Logstash, Kibana).
- Expert-level Linux systems administration skills and deep knowledge of distributed systems architecture and Kafka messaging.
- Hands-on proficiency with automation and configuration tools, including Terraform, Ansible, and programming in Python or Golang.
- The ability to thrive in a fast-paced environment, tackling complex scaling challenges for high-traffic cloud services.
What's On Offer
- Competitive day rate of £55 - £62 per hour (Inside IR35).
- Long-term stability with an initial 12-month contract and high potential for extension.
- 100% remote working flexibility while supporting a premier London-based technology hub.
- Opportunity to work on a truly global scale, impacting the experience of millions of daily active users.
Apply via Haystack today!