Senior Site Reliability Engineer

Senior Site Reliability Engineer

Posted 1 week ago by 1752910813

Negotiable
Outside
Hybrid
USA

Summary: The Senior Site Reliability Engineer will enhance disaster recovery capabilities for Tier 1 applications, collaborating with application and infrastructure teams to establish standards and implement solutions. The role involves designing, deploying, and managing disaster recovery processes, ensuring observability, and leading testing exercises for continuous improvement. Candidates should possess a strong technical background and experience in relevant tools and methodologies. This position offers flexibility in working arrangements, either remote or on-site in St. Louis, Missouri.

Key Responsibilities:

  • Partner with application and infrastructure teams to define Disaster Recovery (DR) standards
  • Design, deploy and manage Tier 1 DR capabilities
  • Standardize and evangelize DR implementation patterns
  • Define and evangelize observability and ops excellence standards as related to DR
  • Define and maintain failover criteria
  • Define, maintain and test Technical Recovery Guides (TRG)
  • Build, review and maintain application design and architecture documents
  • Ensure DR capabilities are built into each system
  • Work with development teams to implement and maintain DR capabilities
  • Participate in DR testing exercises and evaluate results for continuous improvement
  • Lead complex projects focused on observability/monitoring for applications
  • Make decisions around periodic system validation and testing
  • Identify strategies to increase system reliability and performance
  • Implement necessary manual and automated procedures for improved collaborative response
  • Lead lower level Engineers in stress, security, and performance testing
  • Resolve issues through support escalation
  • Keep documentation and runbooks up to date
  • Lead post-incident reviews and document findings
  • Review proposals to optimize Software Development Life Cycle (SDLC)
  • Communicate complex topics with development teams to investigate and document issues

Key Skills:

  • Bachelor's degree in a quantitative or business field (e.g., statistics, mathematics, engineering, computer science)
  • 4-6 years of experience in a relevant field
  • Experience with Rancher and Axway API Gateway
  • AWS, Route 53, Lambda, Mongo DB, Kafka, Kubernetes
  • Load Balancing / Load Redirecting / Load Restricting strategies
  • Monitoring and Observability tools such as Prometheus, Grafana, Dynatrace, Splunk, Elk

Salary (Rate): £64 per hour

City: Saint Louis

Country: USA

Working Arrangements: hybrid

IR35 Status: outside IR35

Seniority Level: Senior

Industry: IT

Detailed Description From Employer:

description: job summary:

Story Behind the Need


Who is Resiliency Engineering Enablement?



  • Partner with application and infrastructure teams to define Disaster Recovery (DR) standards
  • Design, deploy and manage Tier 1 DR capabilities.
  • Standardize and evangelize DR implementation patterns
  • Define and evangelize observability and ops excellence standards as related to DR
  • Define and maintain failover criteria
  • Define, maintain and test Technical Recovery Guides (TRG)


location: Saint Louis, Missouri

job type: Contract

salary: $54 - 64 per hour

work hours: 8am to 5pm

education: Bachelors



responsibilities:

Typical Day in the Role



  • This resource will be working on building and improving the disaster recovery (DR) capabilities of Client's Tier 1 applications. Common responsibilities will include:
  • Building, reviewing and maintaining application design and architecture documents.
  • Ensuring the DR capabilities are built into each system.
  • Working with development teams to implement and maintain the DR capabilities.
  • Participate in DR testing exercises and evaluate the results for continuous improvement.
  • Leads more complex projects focused on building and maintaining observability/monitoring for the application, monitoring key performance indicators, maintaining alerting, and continuously improving visibility.
  • Helps make decisions around periodic system validation and testing, service monitoring, and standing up new services/tools
  • Uses knowledge and experience to identify strategies that increase system reliability and performance through on-call rotation and process optimization
  • Identifies and implements necessary manual and automated procedures for improved collaborative response in real-time
  • Leads lower level Engineers in stress, security, and performance testing
  • Resolves issues that come up through support escalation
  • Keeps documentation and runbooks up to date to effectively deal with new incidents that might arise
  • Leads post incident reviews and documents findings for future informed decision making
  • Reviews proposals to optimize Software Development Life Cycle (SDLC) to boost service reliability and makes decisions around which proposals should move forward.
  • Communicates complex topics with development teams to investigate and document issues and leads internal team to develop solutions to mitigate them


qualifications:

Candidate Requirements



  • Required: A Bachelor's degree in a quantitative or business field (e.g., statistics, mathematics, engineering, computer science). Preferred:
  • Years of experience required: 4-6 years minimum
  • Disqualifiers: missing requirements
  • Additional qualities to look for: Experience with Rancher and Axway API Gateway


skills: Top 3 must-have hard skills stack-ranked by importance



  • 1 AWS, Route 53, Lambda, Mongo DB, Kafka, Kubernetes
  • 2 Load Balancing / Load Redirecting / Load Restricting strategies
  • 3 Monitoring and Observability tools such as Prometheus, Grafana, Dynatrace, Splunk, Elk




Equal Opportunity Employer: Race, Color, Religion, Sex, Sexual Orientation, Gender Identity, National Origin, Age, Genetic Information, Disability, Protected Veteran Status, or any other legally protected group status.

At Randstad Digital, we welcome people of all abilities and want to ensure that our hiring and interview process meets the needs of all applicants. If you require a reasonable accommodation to make your application or interview experience a great one, please contact

Pay offered to a successful candidate will be based on several factors including the candidate's education, work experience, work location, specific job duties, certifications, etc. In addition, Randstad Digital offers a comprehensive benefits package, including: medical, prescription, dental, vision, AD&D, and life insurance offerings, short-term disability, and a 401K plan (all benefits are based on eligibility).

This posting is open for thirty (30) days.


It is unlawful in Massachusetts to require or administer a lie detector test as a condition of employment or continued employment. An employer who violates this law shall be subject to criminal penalties and civil liability.