Lead Observability Engineer Sumo Logic

Lead Observability Engineer Sumo Logic

Posted 1 week ago by 1751439378

Negotiable
Outside
Remote
USA

Summary: The Lead Observability Engineer role at VLink involves leading the implementation of Sumo Logic for clients transitioning from Dynatrace, focusing on cloud-native observability and Site Reliability Engineering (SRE) practices. The position requires extensive experience in Sumo Logic, Kubernetes observability, and the ability to design scalable monitoring solutions. This is a contract position with immediate start requirements, emphasizing collaboration and accountability within a remote work environment. The ideal candidate will drive service-level reliability and operational maturity in observability teams.

Key Responsibilities:

  • Lead the end-to-end implementation of Sumo Logic observability platform for AWS and EKS environments.
  • Migrate monitoring and alerting assets from Dynatrace to Sumo Logic.
  • Define and implement SLIs/SLOs, error budgets, and reliability metrics for containerized services.
  • Deploy and configure Sumo Logic collectors across AWS and Kubernetes workloads (EKS).
  • Configure log, metric, and trace ingestion pipelines using OpenTelemetry and Sumo Logic apps.
  • Design and maintain dashboards for service health, performance, and reliability insights.
  • Implement intelligent alerting and notification workflows, using thresholds, baselines, and anomaly detection.
  • Collaborate with DevOps, SRE, and development teams to ensure complete tracing coverage across services.
  • Ensure best practices for alert noise reduction, escalation policies, and incident response are in place.
  • Contribute to observability runbooks, operational handover, and training for the client SRE team.

Key Skills:

  • Expert-level experience with Sumo Logic, including dashboarding, alerting, collector deployment, and ML features.
  • Strong background in Site Reliability Engineering (SRE), including SLIs/SLOs, error budgets, MTTR/MTTD metrics.
  • Proficiency in AWS services (especially CloudWatch, CloudTrail, Lambda, RDS) and EKS (Amazon Kubernetes Service).
  • Hands-on experience with OpenTelemetry for distributed tracing and service maps.
  • Strong understanding of Kubernetes metrics, pod health, container resource usage, and cluster monitoring.
  • Proven ability to define alert thresholds, configure notification routing (e.g. Slack, PagerDuty, ServiceNow), and manage alert fatigue.
  • Strong scripting experience with tools like Terraform, Helm, YAML, and GitOps workflows.
  • Experience with incident triage, RCA documentation, and building operational maturity in observability teams.
  • Excellent communication and stakeholder engagement skills.

Salary (Rate): undetermined

City: undetermined

Country: USA

Working Arrangements: remote

IR35 Status: outside IR35

Seniority Level: undetermined

Industry: IT

Detailed Description From Employer:

VLink is a leading global provider of software engineering services with next-gen technologies and best-in-class talent. With offices in 7+ countries from North America-Europe to APAC & expansion plans in Middle East, VLink has helped SMBs, and large enterprises achieve their business goals, and gained the trust of Fortune-250 companies. VLink is a 'Great Place to Work Certified ' and has been a consistent winner as- Best Places to Work in CT. Trust, collaboration, and accountability are the three elements that are at the core of VLink s work culture.

We value our professionals, providing comprehensive benefits and the opportunity for growth. This is a Contract position, and the client is looking for someone to start immediately.

Role : Lead Observability Engineer Sumo Logic & SRE

Location : Remote

Hire type : 12+ months Contract

Required skills: Sumo Logic & Cloud-native observability

JD:

Experience: 10+ years (with 3+ years in Sumo Logic & Cloud-native observability)

Job Summary:

We are seeking a highly skilled Lead Observability Engineer to lead a critical implementation of Sumo Logic for a client migrating from Dynatrace. This role requires deep expertise in Sumo Logic, Site Reliability Engineering (SRE) practices, and Kubernetes (EKS) observability. The ideal candidate will design and implement scalable dashboards, alerts, and tracing strategies, drive service-level reliability, and enable a steady-state SRE operations model.

Key Responsibilities:

  • Lead the end-to-end implementation of Sumo Logic observability platform for AWS and EKS environments.
  • Migrate monitoring and alerting assets from Dynatrace to Sumo Logic.
  • Define and implement SLIs/SLOs, error budgets, and reliability metrics for containerized services.
  • Deploy and configure Sumo Logic collectors across AWS and Kubernetes workloads (EKS).
  • Configure log, metric, and trace ingestion pipelines using OpenTelemetry and Sumo Logic apps.
  • Design and maintain dashboards for service health, performance, and reliability insights.
  • Implement intelligent alerting and notification workflows, using thresholds, baselines, and anomaly detection.
  • Collaborate with DevOps, SRE, and development teams to ensure complete tracing coverage across services.
  • Ensure best practices for alert noise reduction, escalation policies, and incident response are in place.
  • Contribute to observability runbooks, operational handover, and training for the client SRE team.

Required Skills & Qualifications:

  • Expert-level experience with Sumo Logic, including dashboarding, alerting, collector deployment, and ML features.
  • Strong background in Site Reliability Engineering (SRE), including SLIs/SLOs, error budgets, MTTR/MTTD metrics.
  • Proficiency in AWS services (especially CloudWatch, CloudTrail, Lambda, RDS) and EKS (Amazon Kubernetes Service).
  • Hands-on experience with OpenTelemetry for distributed tracing and service maps.
  • Strong understanding of Kubernetes metrics, pod health, container resource usage, and cluster monitoring.
  • Proven ability to define alert thresholds, configure notification routing (e.g. Slack, PagerDuty, ServiceNow), and manage alert fatigue.
  • Strong scripting experience with tools like Terraform, Helm, YAML, and GitOps workflows.
  • Experience with incident triage, RCA documentation, and building operational maturity in observability teams.
  • Excellent communication and stakeholder engagement skills.

Preferred Qualifications:

  • Sumo Logic certifications (Admin, Advanced Analytics) are a plus.
  • Experience with Dynatrace (for migration purposes).
  • Familiarity with integrating observability into CI/CD pipelines.
  • Exposure to service mesh (Istio/Linkerd) and monitoring microservices in that context.

Deliverables This Role Will Drive:

  • Sumo Logic observability reference architecture
  • EKS and AWS observability configuration
  • SLI/SLO documentation and tracking
  • Alerting and tracing setup across services
  • Production-ready dashboards and runbooks
  • Knowledge transfer and enablement sessions for SRE/DevOps teams