Negotiable
Undetermined
Undetermined
Telford, England, United Kingdom
Summary: The Observability Subject Matter Expert (SME) role at Smart Edge Client involves architecting, implementing, and managing observability frameworks within a hybrid-cloud environment, primarily focusing on AWS-native services and enterprise platforms. The position requires ensuring end-to-end visibility and performance optimization while leading incident management and root cause analysis. Collaboration with various teams to enforce observability standards and compliance is also a key aspect of the role. The ideal candidate will have extensive experience in observability engineering and AWS services.
Key Responsibilities:
- Design and implement observability pipelines using AWS-native and third-party tools.
- Define telemetry standards (metrics, logs, traces) across microservices, APIs, and data pipelines.
- Establish SLIs/SLOs and integrate them into service health dashboards.
- Implement observability for AWS Connect (contact flows, agent metrics, call quality).
- Monitor AWS Data Services (Glue, Redshift, Athena, S3, Lake Formation) for performance, throughput, and data lineage.
- Integrate AWS Integration Services (API Gateway, EventBridge, Step Functions, Lambda) with distributed tracing and structured logging.
- Deploy and manage observability tools: CloudWatch, X-Ray, OpenTelemetry, Prometheus, Grafana, Datadog, Splunk, ELK.
- Automate alerting, anomaly detection, and incident correlation using AI/ML-based tools.
- Integrate observability into CI/CD pipelines and Infrastructure-as-Code (IaC) workflows.
- Lead real-time diagnostics during major incidents using telemetry data.
- Conduct post-incident reviews with detailed root cause analysis and observability insights.
- Work closely with DevOps, Security, and Application teams to enforce observability standards.
- Ensure compliance with data governance, retention, and security policies for telemetry data.
Key Skills:
- 7+ years in observability engineering.
- Deep expertise in AWS services, especially AWS Connect, Glue, Lambda, API Gateway, S3, Infrastructure and Network.
- Strong hands-on experience with observability stacks such as Dynatrace, OpenTelemetry, Prometheus, Grafana, Datadog, Splunk, ELK, CloudWatch/X-Ray.
- Proficient in scripting (Python, Bash) and IaC (Terraform, CloudFormation).
- Experience with monitoring enterprise platforms like Pega and Contact Center systems.
- Solid understanding of distributed systems, networking, and application performance tuning.
Salary (Rate): undetermined
City: Telford
Country: United Kingdom
Working Arrangements: undetermined
IR35 Status: undetermined
Seniority Level: undetermined
Industry: IT
Smart edge Client is looking for an individual to help with their #Observability Subject Matter Expert (SME) @Telford, UK Job Description: Observability Subject Matter Expert (SME) to architect, implement, and manage observability frameworks across a complex hybrid-cloud environment. This role will focus on AWS-native services (Connect, Data, Integration), enterprise platforms (Pega, Contact Center), and the underlying infrastructure, ensuring end-to-end visibility, performance optimization, and proactive incident response.
Key Responsibilities:
- Observability Architecture & Strategy:
- Design and implement observability pipelines using AWS-native and third-party tools.
- Define telemetry standards (metrics, logs, traces) across microservices, APIs, and data pipelines.
- Establish SLIs/SLOs and integrate them into service health dashboards.
- AWS Workload Monitoring:
- Implement observability for AWS Connect (contact flows, agent metrics, call quality).
- Monitor AWS Data Services (Glue, Redshift, Athena, S3, Lake Formation) for performance, throughput, and data lineage.
- Integrate AWS Integration Services (API Gateway, EventBridge, Step Functions, Lambda) with distributed tracing and structured logging.
- Tooling & Automation:
- Deploy and manage observability tools: CloudWatch, X-Ray, OpenTelemetry, Prometheus, Grafana, Datadog, Splunk, ELK.
- Automate alerting, anomaly detection, and incident correlation using AI/ML-based tools.
- Integrate observability into CI/CD pipelines and Infrastructure-as-Code (IaC) workflows.
- Incident Management & RCA:
- Lead real-time diagnostics during major incidents using telemetry data.
- Conduct post-incident reviews with detailed root cause analysis and observability insights.
- Collaboration & Governance:
- Work closely with DevOps, Security, and Application teams to enforce observability standards.
- Ensure compliance with data governance, retention, and security policies for telemetry data.
7+ years in observability engineering. Deep expertise in AWS services, especially AWS Connect, Glue, Lambda, API Gateway, S3, Infrastructure and Network. Strong hands-on experience with observability stacks such as : Dynatrace OpenTelemetry, Prometheus, Grafana, Datadog, Splunk, ELK, CloudWatch/X-Ray. Proficient in scripting (Python, Bash) and IaC (Terraform, CloudFormation). Experience with monitoring enterprise platforms like Pega and Contact Center systems. Solid understanding of distributed systems, networking, and application performance tuning.
If this sounds like a role you would be interested in or if you know someone in this field. Connect with me or email me at rajarathnam.k@smartedgesolutions.co.uk Alternatively, you can call me on Tel: +442038131973