
Site Reliability Engineer (Strong Splunk, AppDynamic & Production Support)|| Remote || Contract
Posted 5 days ago by 1752212631
Negotiable
Outside
Remote
USA
Summary: The Site Reliability Engineer role focuses on providing production support with a strong emphasis on observability and proactive issue identification. The position requires extensive experience in monitoring tools and the ability to communicate effectively in high-stakes environments. The role is contract-based and allows for remote work, catering to a 24/7 operational environment. Candidates should possess a deep technical expertise in various systems and platforms, particularly in troubleshooting and debugging.
Key Responsibilities:
- Proactive issue identification using observability tools.
- Skills in using different monitoring & observability tools to track system performance.
- Production support activities including proactive identification of issues leveraging observability tools.
- Correlating inputs from various dashboards & tools to drive resolution.
- Experience in swiftly identifying probable failure points through the analysis of multiple inputs.
- Basic level of troubleshooting on every layer of the tech stack (Application, Database, Infra, and Network).
- Excellent communication skills to lead and triage issues/incidents.
- Flexibility to work in a 24 X 7 environment.
- Analysis of issues via various monitoring tools.
- Debugging of issues in VMs, Load balancers, Firewalls, API Gateways, DB, Network, Linux/Unix.
- Debugging of issues in Containerization, Docker, Kubernetes, AWS, PCF, Azure.
- Experience in UEM and synthetic monitoring setup.
Key Skills:
- Production support expertise with SRE Observability experience.
- Excellent communication skills.
- Technical expertise in Splunk, AppDynamics, Grafana, RedMetrics, 1000Eyes.
- Debugging skills in VMs, Load balancers, Firewalls, API Gateways, DB, Network, Linux/Unix.
- Experience in Containerization, Docker, Kubernetes, AWS, PCF, Azure.
- Experience in UEM and synthetic monitoring setup.
- Optional skills in ServiceNow, Java, Python, AWS, Azure, Oracle, Cassandra, SQL Server, MySQL, and MongoDB.
Salary (Rate): undetermined
City: undetermined
Country: USA
Working Arrangements: remote
IR35 Status: outside IR35
Seniority Level: undetermined
Industry: IT
Role: Site Reliability Engineer
Location: Remote
Exp Level: 15+ Years
Job type: Contract
Skills
- Production support expertise with SRE Observability experience :
- Proactive issue identification using observability tools.
- Skills in using different monitoring & observability tools to track system performance
- Production support activities including proactive identification of issues leveraging observability tools, Corelating inputs from various dashboards & tools to drive resolution
- Experience in swiftly identifying probable failure points through the analysis of multiple inputs from the logs, observability dashboards, recent application changes, infra, network changes etc.
- Basic level of trouble shooting on every layer of the tech stack (Application, Database, Infra (Container platforms) and Network )
- Communication : Excellent communicator. They are also expected to actively lead and triage proactively identified issues/incidents where VPs/SVPs are also present in these call.
- Flexibility to work in 24 X 7 environment
- Technical expertise:
- Analysis of issues via Splunk (including Splunk APM and Splunk O11y), AppDynamics, Grafana, RedMetrics, 1000Eyes
- Debugging of issues in VMs, Load balancers, Firewalls, API Gateways, DB, Network, Linux / Unix
- Debugging of issues in Containerization, Docker, Kubernetes, AWS, PCF, Azure
- Analysis of issues via APM, NMON , Wireshark usage and analysis
- Experience in UEM and synthetic monitoring set up
- Optional skills:
- ServiceNow (including AIOps, tools for Self-Heal and automated playbooks)
- Development experience in some of the technologies -Java, Python, AWS, Azure, Oracle, Cassandra, SQL Server, My SQL and Mongo DB