Cloud Engineer - AWS Observability

Cloud Engineer - AWS Observability

Posted 5 days ago by Infoplus Technologies UK Limited

Negotiable
Undetermined
Undetermined
Telford, England, United Kingdom

Summary: The role of Cloud Engineer - AWS Observability involves architecting, implementing, and managing observability frameworks in a hybrid-cloud environment, primarily utilizing AWS-native services. The position requires ensuring end-to-end visibility and performance optimization while leading incident management and root cause analysis. The candidate will collaborate with various teams to enforce observability standards and compliance with data governance policies. A strong background in observability engineering and AWS services is essential for success in this role.

Key Responsibilities:

  • Design and implement observability pipelines using AWS-native and third-party tools.
  • Define telemetry standards (metrics, logs, traces) across microservices, APIs, and data pipelines.
  • Establish SLIs/SLOs and integrate them into service health dashboards.
  • Implement observability for AWS Connect (contact flows, agent metrics, call quality).
  • Monitor AWS Data Services (Glue, Redshift, Athena, S3, Lake Formation) for performance, throughput, and data lineage.
  • Integrate AWS Integration Services (API Gateway, EventBridge, Step Functions, Lambda) with distributed tracing and structured logging.
  • Deploy and manage observability tools: CloudWatch, X-Ray, OpenTelemetry, Prometheus, Grafana, Datadog, Splunk, ELK.
  • Automate alerting, anomaly detection, and incident correlation using AI/ML-based tools.
  • Integrate observability into CI/CD pipelines and Infrastructure-as-Code (IaC) workflows.
  • Lead real-time diagnostics during major incidents using telemetry data.
  • Conduct post-incident reviews with detailed root cause analysis and observability insights.
  • Work closely with DevOps, Security, and Application teams to enforce observability standards.
  • Ensure compliance with data governance, retention, and security policies for telemetry data.

Key Skills:

  • 7+ years in observability engineering.
  • Deep expertise in AWS services, especially AWS Connect, Glue, Lambda, API Gateway, S3, Infrastructure and Network.
  • Strong hands-on experience with observability stacks such as Dynatrace, OpenTelemetry, Prometheus, Grafana, Datadog, Splunk, ELK, CloudWatch/X-Ray.
  • Proficient in scripting (Python, Bash) and IaC (Terraform, CloudFormation).
  • Experience with monitoring enterprise platforms like Pega and Contact Center systems.
  • Solid understanding of distributed systems, networking, and application performance tuning.

Salary (Rate): undetermined

City: Telford

Country: United Kingdom

Working Arrangements: undetermined

IR35 Status: undetermined

Seniority Level: undetermined

Industry: IT

Detailed Description From Employer:

We are looking for a technically proficient Observability Subject Matter Expert (SME) to architect, implement, and manage observability frameworks across a complex hybrid-cloud environment. This role will focus on AWS-native services (Connect, Data, Integration), enterprise platforms (Pega, Contact Center), and the underlying infrastructure, ensuring end-to-end visibility, performance optimization, and proactive incident response.

Key Responsibilities:

  • Observability Architecture & Strategy:
  • Design and implement observability pipelines using AWS-native and third-party tools.
  • Define telemetry standards (metrics, logs, traces) across microservices, APIs, and data pipelines.
  • Establish SLIs/SLOs and integrate them into service health dashboards.
  • AWS Workload Monitoring:
  • Implement observability for AWS Connect (contact flows, agent metrics, call quality).
  • Monitor AWS Data Services (Glue, Redshift, Athena, S3, Lake Formation) for performance, throughput, and data lineage.
  • Integrate AWS Integration Services (API Gateway, EventBridge, Step Functions, Lambda) with distributed tracing and structured logging.
  • Tooling & Automation:
  • Deploy and manage observability tools: CloudWatch, X-Ray, OpenTelemetry, Prometheus, Grafana, Datadog, Splunk, ELK.
  • Automate alerting, anomaly detection, and incident correlation using AI/ML-based tools.
  • Integrate observability into CI/CD pipelines and Infrastructure-as-Code (IaC) workflows.
  • Incident Management & RCA:
  • Lead real-time diagnostics during major incidents using telemetry data.
  • Conduct post-incident reviews with detailed root cause analysis and observability insights.
  • Collaboration & Governance:
  • Work closely with DevOps, Security, and Application teams to enforce observability standards.
  • Ensure compliance with data governance, retention, and security policies for telemetry data.

Required Skills & Experience:

  • 7+ years in observability engineering.
  • Deep expertise in AWS services, especially AWS Connect, Glue, Lambda, API Gateway, S3, Infrastructure and Network.
  • Strong hands-on experience with observability stacks such as : Dynatrace OpenTelemetry, Prometheus, Grafana, Datadog, Splunk, ELK, CloudWatch/X-Ray.
  • Proficient in scripting (Python, Bash) and IaC (Terraform, CloudFormation).
  • Experience with monitoring enterprise platforms like Pega and Contact Center systems.
  • Solid understanding of distributed systems, networking, and application performance tuning.