Negotiable
Outside
Remote
USA
Summary: We are looking for a Lead Systems Engineer with extensive experience in Datadog, AWS, and ServiceNow integration to design, implement, and maintain monitoring and incident management solutions for cloud infrastructure. This role involves providing technical leadership, ensuring operational excellence, and collaborating with IT and engineering teams. The ideal candidate will have a strong background in cloud services and observability tools, along with the ability to mentor junior engineers. The position is remote and requires a proactive approach to system reliability and continuous improvement.
Key Responsibilities:
- Lead the architecture, design, and implementation of end-to-end monitoring solutions using Datadog.
- Oversee the deployment and management of AWS resources, ensuring adherence to best practices.
- Define monitoring strategies and best practices, including Datadog dashboards and custom metrics.
- Architect and manage the integration of Datadog with ServiceNow for incident management workflows.
- Provide technical leadership and mentorship to junior engineers.
- Collaborate with cross-functional teams to integrate monitoring into CI/CD pipelines.
- Drive continuous improvement in system reliability and anomaly detection.
- Contribute to Infrastructure as Code (IaC) standards using Terraform or similar tools.
- Participate in high-severity incident management and root cause analysis.
Key Skills:
- Bachelor's degree in Computer Science, Information Technology, or related field.
- 5+ years of experience with AWS cloud services.
- 3+ years of hands-on experience with Datadog.
- 2+ years of experience integrating Datadog with ServiceNow.
- Experience leading teams or projects in a cloud operations or DevOps environment.
- Strong proficiency in scripting and automation (Python, Bash, etc.).
- Solid understanding of networking, security best practices, and troubleshooting cloud architectures.
Salary (Rate): undetermined
City: undetermined
Country: USA
Working Arrangements: remote
IR35 Status: outside IR35
Seniority Level: undetermined
Industry: IT
Position: Systems Engineer (Datadog, AWS & ServiceNow Integration)
Location: Washington, DC (REMOTE)
STRONG DATADOG EXPERIENCE
Job Summary
We are seeking a seasoned Lead Systems Engineer with deep expertise in Datadog, AWS, and ServiceNow integration. In this leadership role, you will oversee the design, implementation, and maintenance of comprehensive monitoring, observability, and incident management solutions for cloud-based infrastructure and applications. You will play a key role in guiding the team to ensure operational excellence, system reliability, and seamless collaboration across IT and engineering teams.
Responsibilities
Lead the architecture, design, and implementation of end-to-end monitoring solutions using Datadog, ensuring high availability and performance of cloud-based services.
Oversee the deployment and management of AWS resources (EC2, RDS, Lambda, ECS/EKS, S3, etc.), ensuring adherence to best practices for scalability, security, and cost optimization.
Define monitoring strategies and best practices, including Datadog dashboards, monitors, alerts, and custom metrics for comprehensive observability.
Architect and manage the integration of Datadog with ServiceNow to automate incident management workflows, event correlation, and CMDB synchronization.
Provide technical leadership and mentorship to junior engineers on best practices for monitoring, logging, and observability.
Collaborate with cross-functional teams to integrate monitoring and logging into CI/CD pipelines and cloud infrastructure.
Drive continuous improvement in system reliability, including SLO/SLI definitions, synthetic monitoring, and anomaly detection.
Contribute to and enforce Infrastructure as Code (IaC) standards using Terraform, CloudFormation, or similar tools.
Participate in high-severity incident management, root cause analysis, and the implementation of corrective actions to prevent future occurrences.
Requirements
Bachelor's degree in Computer Science, Information Technology, or a related field (or equivalent experience).
5+ years of experience with AWS cloud services, including deployment, management, and optimization of cloud infrastructure.
3+ years of hands-on experience with Datadog, including complex dashboards, integrations, and custom metrics.
2+ years of experience integrating Datadog with ServiceNow, including incident management workflows, event management, and CMDB integration.
Demonstrated experience leading teams or projects in a cloud operations or DevOps environment.
Strong proficiency in scripting and automation (Python, Bash, or similar).
Solid understanding of networking, security best practices, distributed systems, and troubleshooting complex cloud architectures.
Preferred Skills (Nice to Have)
Experience with Infrastructure as Code (Terraform, CloudFormation).
AWS certifications (e.g., AWS Certified Solutions Architect, DevOps Engineer).
Experience with Kubernetes monitoring and log aggregation solutions (Fluentd, ELK stack).
Familiarity with other observability tools like Prometheus or Grafana.
ServiceNow certifications or experience with ServiceNow ITOM modules (Discovery, Event Management, CMDB).
Excellent leadership and mentorship skills with experience in cross-functional collaboration.
Soft Skills
Strong leadership and communication skills to effectively guide a team.
Excellent problem-solving skills with the ability to handle high-pressure situations.
Organizational and prioritization skills to manage multiple tasks and projects effectively.