Datacenter Observability and Site Reliability Engineer

Posted 1 week ago by 1750263405

Apply

Negotiable

Outside

Remote

USA

Apply

Summary: The Datacenter Observability and Site Reliability Engineer is responsible for ensuring the reliability and performance of datacenter infrastructure through observability solutions, incident management, and collaboration with engineering teams. This role involves implementing best practices in site reliability engineering, optimizing performance, and maintaining automation scripts. The engineer will also focus on security compliance and continuous improvement of services. The position requires extensive experience in datacenter observability and a strong technical skill set in relevant tools and technologies.

Key Responsibilities:

Design, implement, and maintain observability solutions for datacenter infrastructure.
Develop, deploy, and maintain operational and reliability components of a large-scale Observability and Telemetry collection platform.
Participate in and enhance the entire lifecycle of services, from inception and design to deployment, operation, and refinement.
Develop and optimize monitoring systems to ensure high availability and performance.
Create and manage dashboards, alerts, and reports to provide visibility into system health and performance.
Implement SRE best practices to improve the reliability, scalability, and performance of datacenter services.
Develop and maintain automation scripts for infrastructure provisioning, monitoring, and management.
Conduct root cause analysis and post-mortem reviews to prevent recurrence of incidents.
Analyze and optimize the performance of datacenter systems and applications.
Implement best practices for resource utilization and efficiency.
Work closely with other engineering teams to understand and meet their observability and reliability requirements.
Collaborate with hardware and software vendors to evaluate and integrate new technologies.
Ensure that observability and reliability solutions comply with security policies and industry standards.
Implement and maintain security measures to protect data and infrastructure.
Provide support for observability and reliability-related issues, including debugging and resolving hardware and software problems.
Develop and maintain documentation for troubleshooting procedures and best practices.
Stay updated with the latest advancements in observability and SRE technologies and integrate them into the infrastructure.
Continuously improve the reliability, scalability, and performance of datacenter services.

Key Skills:

Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
8+ years of experience in datacenter observability and site reliability engineering.
Proven experience in managing and optimizing large-scale datacenter environments.
Proficiency in observability tools and technologies (e.g., Prometheus, Grafana, ELK Stack).
Experience with SRE practices and tools (e.g., Kubernetes, Docker, Terraform).
Strong programming and scripting skills (e.g., Python, Go, Bash).
Familiarity with cloud platforms (AWS, Azure, Google Cloud Platform) and their observability and reliability services.
Strong problem-solving skills and attention to detail.
Excellent communication and collaboration skills.
Ability to work in a fast-paced, dynamic environment.

Salary (Rate): undetermined

City: undetermined

Country: USA

Working Arrangements: remote

IR35 Status: outside IR35

Seniority Level: undetermined

Industry: IT

Detailed Description From Employer:

Skillset Description Summary: This team is responsible for the overall site reliability solution, including alerts, monitoring and incident management related to hardware and Kubernetes infrastructure layer. This team includes L1, L2 and L3 support.

Roles and Responsibilities:

Observability and Monitoring-

Design, implement, and maintain observability solutions for datacenter infrastructure.
Develop, deploy, and maintain the operational and reliability components of a large-scale Observability and Telemetry collection platform, emphasizing performance at scale, real-time monitoring, logging, and alerting.
Participate in and enhance the entire lifecycle of services, from inception and design to deployment, operation, and refinement.
Develop and optimize monitoring systems to ensure high availability and performance.
Create and manage dashboards, alerts, and reports to provide visibility into system health and performance.

Site Reliability Engineering (SRE)-

Implement SRE best practices to improve the reliability, scalability, and performance of datacenter services.
Develop and maintain automation scripts for infrastructure provisioning, monitoring, and management.
Conduct root cause analysis and post-mortem reviews to prevent recurrence of incidents.

Performance Optimization-

Analyze and optimize the performance of datacenter systems and applications.
Implement best practices for resource utilization and efficiency.

Collaboration-

Work closely with other engineering teams to understand and meet their observability and reliability requirements.
Collaborate with hardware and software vendors to evaluate and integrate new technologies.
Security and Compliance:
Ensure that observability and reliability solutions comply with security policies and industry standards.
Implement and maintain security measures to protect data and infrastructure.
Troubleshooting and Support:
Provide support for observability and reliability-related issues, including debugging and resolving hardware and software problems.
Develop and maintain documentation for troubleshooting procedures and best practices.
Continuous Improvement:
Stay updated with the latest advancements in observability and SRE technologies and integrate them into the infrastructure.
Continuously improve the reliability, scalability, and performance of datacenter services.

Qualifications:

Education-

Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
Experience:
8+ years of experience in datacenter observability and site reliability engineering.
Proven experience in managing and optimizing large-scale datacenter environments.

Technical Skills-

Proficiency in observability tools and technologies (e.g., Prometheus, Grafana, ELK Stack).
Experience with SRE practices and tools (e.g., Kubernetes, Docker, Terraform).
Strong programming and scripting skills (e.g., Python, Go, Bash).
Familiarity with cloud platforms (AWS, Azure, Google Cloud Platform) and their observability and reliability services.

Soft Skills-

Strong problem-solving skills and attention to detail.
Excellent communication and collaboration skills.
Ability to work in a fast-paced, dynamic environment.

Apply

Inside IR35

Outside IR35

Permanent Employee

IR35

Umbrella Companies

Limited Companies

First Time Contractors

What Is IR35?

InsideIR35

Outside IR35

The Cost of IR35

IR35 Assessments

IR35 Rules

IR35 Compliance

Expenses

Foreign Companies

Overseas Contractors

Limited Companies

Sole Traders

What Is An Umbrella Company?

Choosing an Umbrella Company

Tax and Pay

Tax Avoidance

Fees (Margin)

National Insurance

Holiday Pay

Expenses

Pensions

Maternity Pay

Sick Pay

What Is A Limited Company?

Limited Company vs Sole Trader

Incorporation

Taxes

Filing Responsibilities

Bookkeeping

Insurance

Expenses

Buying a Car or Van

Capital Allowances

Benefits In Kind

Pensions

Employing A Spouse

Managing Excess Money

Dormant Companies

Closing Your Company

Withdrawing Money

Business Asset Disposal Relief

How To Become A Contractor

Inside IR35 Checklist

Outside IR35 Checklist

Self-Assessment Tax Returns

Mortgages

Pensions

Working Multiple Contracts

What is the £100k Abatement?

Inside IR35

Outside IR35

Permanent Employee

IR35

Umbrella Companies

Limited Companies

First Time Contractors

What Is IR35?

InsideIR35

Outside IR35

The Cost of IR35

IR35 Assessments

IR35 Rules

IR35 Compliance

Expenses

Foreign Companies

Overseas Contractors

Limited Companies

Sole Traders

What Is An Umbrella Company?

Choosing an Umbrella Company

Tax and Pay

Tax Avoidance

Fees (Margin)