Site Reliability Engineer (Strong Splunk, AppDynamic & Production Support)|| Remote || Contract

Site Reliability Engineer (Strong Splunk, AppDynamic & Production Support)|| Remote || Contract

Posted 5 days ago by 1752212631

Negotiable
Outside
Remote
USA

Summary: The Site Reliability Engineer role focuses on providing production support with a strong emphasis on observability and proactive issue identification. The position requires extensive experience in monitoring tools and the ability to communicate effectively in high-stakes environments. The role is contract-based and allows for remote work, catering to a 24/7 operational environment. Candidates should possess a deep technical expertise in various systems and platforms, particularly in troubleshooting and debugging.

Key Responsibilities:

  • Proactive issue identification using observability tools.
  • Skills in using different monitoring & observability tools to track system performance.
  • Production support activities including proactive identification of issues leveraging observability tools.
  • Correlating inputs from various dashboards & tools to drive resolution.
  • Experience in swiftly identifying probable failure points through the analysis of multiple inputs.
  • Basic level of troubleshooting on every layer of the tech stack (Application, Database, Infra, and Network).
  • Excellent communication skills to lead and triage issues/incidents.
  • Flexibility to work in a 24 X 7 environment.
  • Analysis of issues via various monitoring tools.
  • Debugging of issues in VMs, Load balancers, Firewalls, API Gateways, DB, Network, Linux/Unix.
  • Debugging of issues in Containerization, Docker, Kubernetes, AWS, PCF, Azure.
  • Experience in UEM and synthetic monitoring setup.

Key Skills:

  • Production support expertise with SRE Observability experience.
  • Excellent communication skills.
  • Technical expertise in Splunk, AppDynamics, Grafana, RedMetrics, 1000Eyes.
  • Debugging skills in VMs, Load balancers, Firewalls, API Gateways, DB, Network, Linux/Unix.
  • Experience in Containerization, Docker, Kubernetes, AWS, PCF, Azure.
  • Experience in UEM and synthetic monitoring setup.
  • Optional skills in ServiceNow, Java, Python, AWS, Azure, Oracle, Cassandra, SQL Server, MySQL, and MongoDB.

Salary (Rate): undetermined

City: undetermined

Country: USA

Working Arrangements: remote

IR35 Status: outside IR35

Seniority Level: undetermined

Industry: IT

Detailed Description From Employer:

Role: Site Reliability Engineer

Location: Remote

Exp Level: 15+ Years

Job type: Contract

Skills

  • Production support expertise with SRE Observability experience :
    • Proactive issue identification using observability tools.
    • Skills in using different monitoring & observability tools to track system performance
    • Production support activities including proactive identification of issues leveraging observability tools, Corelating inputs from various dashboards & tools to drive resolution
    • Experience in swiftly identifying probable failure points through the analysis of multiple inputs from the logs, observability dashboards, recent application changes, infra, network changes etc.
    • Basic level of trouble shooting on every layer of the tech stack (Application, Database, Infra (Container platforms) and Network )
  • Communication : Excellent communicator. They are also expected to actively lead and triage proactively identified issues/incidents where VPs/SVPs are also present in these call.
  • Flexibility to work in 24 X 7 environment
  • Technical expertise:
    • Analysis of issues via Splunk (including Splunk APM and Splunk O11y), AppDynamics, Grafana, RedMetrics, 1000Eyes
    • Debugging of issues in VMs, Load balancers, Firewalls, API Gateways, DB, Network, Linux / Unix
    • Debugging of issues in Containerization, Docker, Kubernetes, AWS, PCF, Azure
    • Analysis of issues via APM, NMON , Wireshark usage and analysis
    • Experience in UEM and synthetic monitoring set up
  • Optional skills:
    • ServiceNow (including AIOps, tools for Self-Heal and automated playbooks)
    • Development experience in some of the technologies -Java, Python, AWS, Azure, Oracle, Cassandra, SQL Server, My SQL and Mongo DB