Site Reliability Engineer

Posted 4 days ago by Manchester Digital

Apply

Negotiable

Undetermined

Hybrid

Manchester, England, United Kingdom

Apply

Ansible Application Programming Interface (API) Automation Generic Programming Go (Programming Language) golang Grafana Information Technology Operations Infrastructure as Code (IaC) JavaScript (Programming Language) Life Cycle Planning Management Management Effectiveness Market Trend New Relic (SaaS) Operational Efficiency Python (Programming Language) Scripting Service Level Objectives Site Reliability Engineering Software Development Software Development Life Cycle Software Engineering Splunk Terraform Uptime

Summary: The Site Reliability Engineer will leverage software engineering skills to enhance system reliability and observability, directly impacting operational efficiency. This role involves monitoring critical systems, implementing solutions for improved reliability, and collaborating across functions to integrate best practices into the software development life cycle. The engineer will also support governance standards and foster a culture of reliability within the organization. This position is eligible for hybrid working arrangements.

Key Responsibilities:

Monitor the health, performance, and availability of critical systems.
Implement solutions to enhance reliability, including service instrumentation and improved logging practices.
Develop features for maintainability and engineer tools for effective service management.
Collaborate across multiple functions to integrate reliability and observability best practices.
Support governance standards set by central teams.
Write and contribute to code that enhances reliability and observability of services.
Develop and maintain tools for effective system management.
Automate manual activities using orchestration platforms.
Build dashboards using telemetry data and dashboarding technologies.
Maintain and administer existing monitoring and analytic toolsets.
Mentor colleagues in new technologies or practices.
Participate in live incident resolution and post-mortem analysis.
Drive initiatives to enhance system reliability and observability.
Collaborate with Site Reliability Engineering and Observability teams to uphold standards.
Work with IT Operations to support critical tooling for business value.

Key Skills:

Excellent knowledge of Site Reliability Engineering principles.
Knowledge of observability tools and best practices (e.g., Splunk, New Relic, Grafana, Pager Duty).
Proficiency in programming languages (Python, Golang, JavaScript).
Experience with modern software development techniques and lifecycles.
Experience with Infrastructure as Code (IaC) tools (Ansible, Terraform).
Prior experience in a large-scale, 24/7 enterprise environment.
Keen interest in industry trends, particularly Platform Engineering.
Proficiency in shell scripting for automation tasks.

Salary (Rate): undetermined

City: Manchester

Country: United Kingdom

Working Arrangements: hybrid

IR35 Status: undetermined

Seniority Level: undetermined

Industry: IT

Detailed Description From Employer:

You will have software engineering skills, focusing on system reliability and observability. You will monitor the health, performance and availability of critical systems, directly impacting operational efficiency. Using your engineering expertise, you will implement solutions that enhance reliability, including service instrumentation with tools such as Open Telemetry, improve logging practices and develop features for maintainability. You will also help engineer tools and automation for effective service management. Collaboration is key, working across multiple functions to integrate reliability and observability best practices into the software development life cycle. By supporting governance standards set by the central teams, you will foster a culture where these principles are integral to development. Your contributions will ensure our systems meet user demands and enhance overall service performance. This role is eligible for inclusion in the Company’s hybrid working from home policy.

Preferred Skills And Experience

Excellent knowledge of Site Reliability Engineering principles, including the creation and management of effective Service Level Indicators (SLI) and Service Level Objectives (SLO) for reliability and customer satisfaction.
Knowledge of contemporary observability tools, techniques and best practice including Splunk, New Relic, Grafana and Pager Duty.
Excellent knowledge of programming languages including Python, Golang and JavaScript.
Knowledge and experience of modern software development techniques and lifecycles.
Experience with Infrastructure as Code (IaC) automation and orchestration tools such as Ansible and Terraform.
Prior experience working in a large scale, 24/7 enterprise where system uptime and stability is of paramount importance to the Business.
Keen interest of industry trends, particularly Platform Engineering.
Proficiency in shell scripting for automation and system management tasks.

What you will be doing

Writing and contributing to code that enhances the reliability and observability of services, including telemetry, operational APIs and tooling.
Developing and maintaining tools that facilitate effective management of our systems, ensuring they are operationally efficient and resilient.
Working with automation and orchestration platforms to automate manual activity and reduce toil.
Building sophisticated dashboards using a range of telemetry data and dash boarding technologies like Grafana, Splunk and New Relic.
Maintaining and administering existing monitoring and analytic toolsets.
Mentoring colleagues in use of new technologies or practices.
Actively participating in live incident resolution and post-mortem analysis, providing effective remediation strategies to improve overall system health and prevent future issues.
Driving initiatives to enhance system reliability and observability, contributing to a culture of continuous improvement.
Collaborating with the central Site Reliability Engineering and Observability teams to establish and uphold standards for reliability and observability, assisting teams in adhering to these practices.
Working with IT Operations, providing and supporting the use of critical tooling to enable increasing levels of value to the Business.

Apply

Inside IR35

Outside IR35

Permanent Employee

IR35

Umbrella Companies

Limited Companies

First Time Contractors

What Is IR35?

InsideIR35

Outside IR35

The Cost of IR35

IR35 Assessments

IR35 Rules

IR35 Compliance

Expenses

Foreign Companies

Overseas Contractors

Limited Companies

Sole Traders

What Is An Umbrella Company?

Choosing an Umbrella Company

Tax and Pay

Tax Avoidance

Fees (Margin)

National Insurance

Holiday Pay

Expenses

Pensions

Maternity Pay

Sick Pay

What Is A Limited Company?

Limited Company vs Sole Trader

Incorporation

Taxes

Filing Responsibilities

Bookkeeping

Insurance

Expenses

Buying a Car or Van

Capital Allowances

Benefits In Kind

Pensions

Employing A Spouse

Managing Excess Money

Dormant Companies

Closing Your Company

Withdrawing Money

Business Asset Disposal Relief

How To Become A Contractor

Inside IR35 Checklist

Outside IR35 Checklist

Self-Assessment Tax Returns

Mortgages

Pensions

Working Multiple Contracts

What is the £100k Abatement?

Inside IR35

Outside IR35

Permanent Employee

IR35

Umbrella Companies

Limited Companies

First Time Contractors

What Is IR35?

InsideIR35

Outside IR35

The Cost of IR35

IR35 Assessments

IR35 Rules

IR35 Compliance

Expenses

Foreign Companies

Overseas Contractors

Limited Companies

Sole Traders

What Is An Umbrella Company?

Choosing an Umbrella Company

Tax and Pay

Tax Avoidance

Fees (Margin)