Site Reliability Engineering (SRE)  REMOTE(Contract)

Site Reliability Engineering (SRE) REMOTE(Contract)

Posted 1 week ago by 1750837489

Negotiable
Outside
Remote
USA

Summary: The role of Site Reliability Engineering (SRE) focuses on providing production support with an emphasis on observability and proactive issue identification. Candidates are expected to utilize various monitoring tools to ensure system performance and lead incident triage discussions with senior management. The position requires flexibility to work in a 24/7 environment and involves technical expertise in debugging across multiple layers of the tech stack. Strong communication skills are essential for effective collaboration and issue resolution.

Key Responsibilities:

  • Proactive issue identification using observability tools.
  • Utilizing monitoring and observability tools to track system performance.
  • Conducting production support activities and correlating inputs from various dashboards to drive resolution.
  • Identifying probable failure points through analysis of logs and observability dashboards.
  • Basic troubleshooting across the tech stack including Application, Database, Infra, and Network.
  • Leading and triaging identified issues/incidents in collaboration with VPs/SVPs.
  • Working in a 24 X 7 environment.
  • Analyzing issues using tools like Splunk, AppDynamics, Grafana, and others.
  • Debugging issues in VMs, Load balancers, Firewalls, API Gateways, and more.
  • Debugging in Containerization, Docker, Kubernetes, AWS, PCF, Azure.
  • Using APM, NMON, and Wireshark for issue analysis.
  • Setting up UEM and synthetic monitoring.

Key Skills:

  • Production support expertise with SRE Observability experience.
  • Excellent communication skills.
  • Flexibility to work in a 24 X 7 environment.
  • Technical expertise in tools like Splunk, AppDynamics, Grafana, etc.
  • Debugging skills across various tech stack layers.
  • Experience with Containerization, Docker, Kubernetes, AWS, PCF, Azure.
  • Knowledge of APM, NMON, and Wireshark.
  • Experience in UEM and synthetic monitoring setup.

Salary (Rate): undetermined

City: undetermined

Country: USA

Working Arrangements: remote

IR35 Status: outside IR35

Seniority Level: undetermined

Industry: IT

Detailed Description From Employer:

Job Description:

Skills

  • Production support expertise with SRE Observability experience :
    • Proactive issue identification using observability tools.
    • Skills in using different monitoring & observability tools to track system performance
    • Production support activities including proactive identification of issues leveraging observability tools, Corelating inputs from various dashboards & tools to drive resolution
    • Experience in swiftly identifying probable failure points through the analysis of multiple inputs from the logs, observability dashboards, recent application changes, infra, network changes etc.
    • Basic level of trouble shooting on every layer of the tech stack (Application, Database, Infra (Container platforms) and Network )
  • Communication : Excellent communicator. They are also expected to actively lead and triage proactively identified issues/incidents where VPs/SVPs are also present in these call.
  • Flexibility to work in 24 X 7 environment
  • Technical expertise:
    • Analysis of issues via Splunk (including Splunk APM and Splunk O11y), AppDynamics, Grafana, RedMetrics, 1000Eyes
    • Debugging of issues in VMs, Load balancers, Firewalls, API Gateways, DB, Network, Linux / Unix
    • Debugging of issues in Containerization, Docker, Kubernetes, AWS, PCF, Azure
    • Analysis of issues via APM, NMON , Wireshark usage and analysis
    • Experience in UEM and synthetic monitoring set up