Site Reliability Engineer

Posted 2 weeks ago by Insight International (UK) Ltd

Apply

Negotiable

Undetermined

Onsite

London Area, United Kingdom

Apply

Summary: The Site Reliability Engineer (SRE) role in London focuses on automation, optimization, and process re-engineering for the Market Risk Platform, emphasizing the use of AI. The position aims to eliminate operational toil and enhance reliability operations, allowing existing SREs to concentrate on engineering rather than firefighting. Success is measured by the reduction of manual steps and improved recovery times. Candidates should possess strong Python skills and experience in agentic AI delivery.

Key Responsibilities:

Build production-grade automation in Python to remove repetitive work.
Create self-service capabilities for common requests.
Implement “automation with Safety” including idempotency and rollback strategies.
Map and redesign current operation processes to reduce waste and cycle time.
Standardize runbooks/playbooks into executable workflows.
Define and track operation KPIs related to toil and alert volume.
Design and implement agentic workflows for diagnostics and remediation.
Put strong controls in place for risky actions and productionize with monitoring.

Key Skills:

Senior SRE experience on distributed systems and batch/intraday workloads.
Strong Python programming skills.
Experience with agentic AI and tool integration.
Demonstrated process optimization abilities.
Strong Linux and troubleshooting fundamentals.
Experience with mixed estates including VMs and Cloud, with Kubernetes exposure.
Exposure to Banking/Finance Market Risk Domains.
Familiarity with the Athena ecosystem or similar.

Salary (Rate): undetermined

City: London

Country: United Kingdom

Working Arrangements: on-site

IR35 Status: undetermined

Seniority Level: undetermined

Industry: IT

Detailed Description From Employer:

Site Reliability Engineer

London, UK

Onsite 5 days

SRE Role description

We need an experienced SRE to focus predominantly on automation, optimization, and process re-engineering using AI for the Market Risk Platform. Success is measured by capacity created 9toil eliminated, fewer manual steps, faster recovery, safer/faster changes) not by being the primary BAU support resources. Strong Python and provable agentic AI delivery

Primary Objectives:

Eliminate Operational toil and recurring manual work through durable automation
Re-engineer support/change processes to reduce handoffs, approvals friction and rerun complexity
Industrialize reliability operations so existing SREs spend less time firefighting and more time engineering

Key Responsibilities (Automation & Process first)

Automation Engineering (Core)

Build production grade automation in Python(tools, services, workflows) to remove repetitive work: environment checks, dependency validation, automated reruns/reprocessing, safe restarts, drift detection, remediation actions, and standardized operation tasks
Create self-service capabilities for common requests(guard railed, auditable, repeatable)
Implement “automation with Safety”: idempotency, dry-run modes, approval gates where needed, rollback/undo strategies, and clear audit trails

Process Re-engineering (Core)

Map current operation processes (incident/problem/change, release readiness, rerun/recovery, access/entitlements, environment onboarding) and redesign them to remove waster and reduce cycle time.
Standardize runbooks/playbooks into executable workflows, reduce tribal knowledge via templates, checklists, and automated pre-flight controls
Defined and track operation KPIs (toil hours removed, alert volume reduction, MTTR improvements, change failure rate reduction, rerun time reduction).

Agentic AI

Design and implement agentic workflows that take action using tools/runbooks(e.g., diagnostics, evidence gathering, correlation, guided remediation, change-risk checks, automated rerun orchestration)
Put strong controls in place: soped permissions, deterministic fallbacks, human-in-the-loop approvals for risky actions, evaluation harnesses and measurable outcomes.
Productionize with monitoring, logging and post incident learnings feeding back into the agent/tooling

Observability (enablemen for automation)

Required skills & Experience

Senior SRE experience on distributed systems and batch/intraday workloads in a production environment.
Strong Python
Provable agentic AI experience showing Tool integration, guard rails, evaluation approach
Measurable impact (toil reduction, MTTR reduction, alert reduction etc)
Demonstrated process optimization ability (removing steps/handoffs, standardizing workflows, implementing light weight controls with metrics)
Strong Linux and troubleshooting fundamentals across application/system/network layers
Experience working across mixed estates ( On Pre VMs + Cloud, with some Kubernetes exposure for operational monitoring/reruns)

Differentiators

Exposure to Banking/Finance Market Risk Domains
Experience and knowledge of Athena eco system familiarity or similar (Sec DB Quartz)

Apply

Inside IR35

Outside IR35

Permanent Employee

IR35

Umbrella Companies

Limited Companies

First Time Contractors

What Is IR35?

InsideIR35

Outside IR35

The Cost of IR35

IR35 Assessments

IR35 Rules

IR35 Compliance

Expenses

Foreign Companies

Overseas Contractors

Limited Companies

Sole Traders

What Is An Umbrella Company?

Choosing an Umbrella Company

Tax and Pay

Tax Avoidance

Fees (Margin)

National Insurance

Holiday Pay

Expenses

Pensions

Maternity Pay

Sick Pay

What Is A Limited Company?

Limited Company vs Sole Trader

Incorporation

Taxes

Filing Responsibilities

Bookkeeping

Insurance

Expenses

Buying a Car or Van

Capital Allowances

Benefits In Kind

Pensions

Employing A Spouse

Managing Excess Money

Dormant Companies

Closing Your Company

Withdrawing Money

Business Asset Disposal Relief

How To Become A Contractor

Inside IR35 Checklist

Outside IR35 Checklist

Self-Assessment Tax Returns

Mortgages

Pensions

Working Multiple Contracts

What is the £100k Abatement?

Inside IR35

Outside IR35

Permanent Employee

IR35

Umbrella Companies

Limited Companies

First Time Contractors

What Is IR35?

InsideIR35

Outside IR35

The Cost of IR35

IR35 Assessments

IR35 Rules

IR35 Compliance

Expenses

Foreign Companies

Overseas Contractors

Limited Companies

Sole Traders

What Is An Umbrella Company?

Choosing an Umbrella Company

Tax and Pay

Tax Avoidance

Fees (Margin)