Negotiable
Undetermined
Hybrid
Manchester, England, United Kingdom
Summary: The Site Reliability Engineer will leverage software engineering skills to enhance system reliability and observability, directly impacting operational efficiency. This role involves monitoring critical systems, implementing solutions for improved reliability, and collaborating across functions to integrate best practices into the software development life cycle. The engineer will also support governance standards and foster a culture of reliability within the organization. This position is eligible for hybrid working arrangements.
Key Responsibilities:
- Monitor the health, performance, and availability of critical systems.
- Implement solutions to enhance reliability, including service instrumentation and improved logging practices.
- Develop features for maintainability and engineer tools for effective service management.
- Collaborate across multiple functions to integrate reliability and observability best practices.
- Support governance standards set by central teams.
- Write and contribute to code that enhances reliability and observability of services.
- Develop and maintain tools for effective system management.
- Automate manual activities using orchestration platforms.
- Build dashboards using telemetry data and dashboarding technologies.
- Maintain and administer existing monitoring and analytic toolsets.
- Mentor colleagues in new technologies or practices.
- Participate in live incident resolution and post-mortem analysis.
- Drive initiatives to enhance system reliability and observability.
- Collaborate with Site Reliability Engineering and Observability teams to uphold standards.
- Work with IT Operations to support critical tooling for business value.
Key Skills:
- Excellent knowledge of Site Reliability Engineering principles.
- Knowledge of observability tools and best practices (e.g., Splunk, New Relic, Grafana, Pager Duty).
- Proficiency in programming languages (Python, Golang, JavaScript).
- Experience with modern software development techniques and lifecycles.
- Experience with Infrastructure as Code (IaC) tools (Ansible, Terraform).
- Prior experience in a large-scale, 24/7 enterprise environment.
- Keen interest in industry trends, particularly Platform Engineering.
- Proficiency in shell scripting for automation tasks.
Salary (Rate): undetermined
City: Manchester
Country: United Kingdom
Working Arrangements: hybrid
IR35 Status: undetermined
Seniority Level: undetermined
Industry: IT
You will have software engineering skills, focusing on system reliability and observability. You will monitor the health, performance and availability of critical systems, directly impacting operational efficiency. Using your engineering expertise, you will implement solutions that enhance reliability, including service instrumentation with tools such as Open Telemetry, improve logging practices and develop features for maintainability. You will also help engineer tools and automation for effective service management. Collaboration is key, working across multiple functions to integrate reliability and observability best practices into the software development life cycle. By supporting governance standards set by the central teams, you will foster a culture where these principles are integral to development. Your contributions will ensure our systems meet user demands and enhance overall service performance. This role is eligible for inclusion in the Company’s hybrid working from home policy.
Preferred Skills And Experience
- Excellent knowledge of Site Reliability Engineering principles, including the creation and management of effective Service Level Indicators (SLI) and Service Level Objectives (SLO) for reliability and customer satisfaction.
- Knowledge of contemporary observability tools, techniques and best practice including Splunk, New Relic, Grafana and Pager Duty.
- Excellent knowledge of programming languages including Python, Golang and JavaScript.
- Knowledge and experience of modern software development techniques and lifecycles.
- Experience with Infrastructure as Code (IaC) automation and orchestration tools such as Ansible and Terraform.
- Prior experience working in a large scale, 24/7 enterprise where system uptime and stability is of paramount importance to the Business.
- Keen interest of industry trends, particularly Platform Engineering.
- Proficiency in shell scripting for automation and system management tasks.
What you will be doing
- Writing and contributing to code that enhances the reliability and observability of services, including telemetry, operational APIs and tooling.
- Developing and maintaining tools that facilitate effective management of our systems, ensuring they are operationally efficient and resilient.
- Working with automation and orchestration platforms to automate manual activity and reduce toil.
- Building sophisticated dashboards using a range of telemetry data and dash boarding technologies like Grafana, Splunk and New Relic.
- Maintaining and administering existing monitoring and analytic toolsets.
- Mentoring colleagues in use of new technologies or practices.
- Actively participating in live incident resolution and post-mortem analysis, providing effective remediation strategies to improve overall system health and prevent future issues.
- Driving initiatives to enhance system reliability and observability, contributing to a culture of continuous improvement.
- Collaborating with the central Site Reliability Engineering and Observability teams to establish and uphold standards for reliability and observability, assisting teams in adhering to these practices.
- Working with IT Operations, providing and supporting the use of critical tooling to enable increasing levels of value to the Business.