Site Reliability Engineer

Site Reliability Engineer

Posted Today by Jobserve

Negotiable
Undetermined
Undetermined
London

Summary: The Site Reliability Engineer is responsible for the administration and operational management of the Zabbix monitoring platform, ensuring effective monitoring and alerting across enterprise infrastructure. This role includes providing Tier 1 support, configuring Zabbix components, and implementing monitoring for various systems and applications. The engineer will also focus on improving reliability through SRE practices and maintaining security controls. Additionally, the position supports 24x7 monitoring operations and production readiness activities.

Key Responsibilities:

  • Provide administration, support, and operational management of the Zabbix monitoring platform, ensuring reliable monitoring, alerting, and observability across enterprise infrastructure and services.
  • Provide Tier 1 support including user access management, alert triage, and incident response.
  • Configure and maintain Zabbix Servers, proxies, templates, hosts, triggers, dashboards, discovery rules, and integrations.
  • Implement and support monitoring for Servers, networks, applications, SNMP devices, syslog events, and service health metrics.
  • Support 24x7 monitoring operations, platform availability, patching, upgrades, and deployments.
  • Apply SRE practices to improve reliability, reduce alert noise, enhance monitoring quality, and support operational readiness.
  • Perform capacity planning, performance analysis, and monitoring platform optimization.
  • Maintain security controls including role-based access, credential management, audit compliance, and governance standards.
  • Support production readiness activities including failover testing, change management, documentation, and disaster recovery planning.

Key Skills:

  • Experience with Zabbix monitoring platform administration and configuration.
  • Knowledge of monitoring for servers, networks, applications, and SNMP devices.
  • Familiarity with SRE practices and operational readiness activities.
  • Strong troubleshooting and incident response skills.
  • Understanding of security controls and compliance standards.
  • Ability to perform capacity planning and performance analysis.
  • Experience in supporting 24x7 operations.

Salary (Rate): undetermined

City: London

Country: UK

Working Arrangements: undetermined

IR35 Status: undetermined

Seniority Level: undetermined

Industry: IT

Detailed Description From Employer:

Site Reliability Engineer

  • Provide administration, support, and operational management of the Zabbix monitoring platform, ensuring reliable monitoring, alerting, and observability across enterprise infrastructure and services.
  • Provide Tier 1 support including user access management, alert triage, and incident response.
  • Configure and maintain Zabbix Servers, proxies, templates, hosts, triggers, dashboards, discovery rules, and integrations.
  • Implement and support monitoring for Servers, networks, applications, SNMP devices, syslog events, and service health metrics.
  • Support 24x7 monitoring operations, platform availability, patching, upgrades, and deployments.
  • Apply SRE practices to improve reliability, reduce alert noise, enhance monitoring quality, and support operational readiness.
  • Perform capacity planning, performance analysis, and monitoring platform optimization.
  • Maintain security controls including role-based access, credential management, audit compliance, and governance standards.
  • Support production readiness activities including failover testing, change management, documentation, and disaster recovery planning.