Negotiable
Outside
Remote
USA
Summary: The Cloud Site Reliability Engineer role focuses on enhancing infrastructure through SRE best practices in AWS and Azure environments. The position involves managing critical services, improving observability, and fostering automation to elevate the developer experience. The engineer will also take ownership of IAM governance and promote operational excellence while collaborating with developers and researchers. This role is remote and classified as outside IR35.
Key Responsibilities:
- Oversee the design and improvement of infrastructure using SRE best practices, including IaC, recovery automation, and systems that detect and resolve issues independently.
- Manage and fine-tune critical services across both cloud and on-prem environments: Kubernetes clusters, CI/CD pipelines, artifact registries, and custom workloads.
- Enhance observability through intelligent logging, metrics, tracing, and alerting. Ensuring systems are transparent and actionable in real time.
- Champion automation by eliminating repetitive tasks, from deployment workflows to security audits, through scripting and tooling.
- Elevate the developer experience for 80+ engineers and researchers by streamlining secure, reliable workflows across hybrid and cloud-native platforms.
- Take ownership of IAM governance across platforms like Azure AD and AWS IAM. Implement lifecycle automation, auditing, and access controls.
- Foster a culture of operational excellence with strong practices around security, incident management, and resilience engineering.
- Act as a trusted partner to developers and researchers, enabling their speed and innovation without compromising stability.
Key Skills:
- Experience in Site Reliability Engineering, DevOps, or Systems Engineering within fast-paced, technically demanding environments.
- Strong background in Linux systems and cloud infrastructure, with hands-on experience in AWS (primary) and Azure environments.
- Solid command of Kubernetes and container orchestration in production environments.
- Expertise in Infrastructure as Code tools such as Ansible, building reproducible, scalable infrastructure is second nature to you.
- Deep experience in observability and incident response: you know how to set up effective monitoring, handle incidents, and lead blameless post-mortems.
- A security-first mindset, especially when it comes to protecting distributed systems and developer workflows.
- Proven ability to support and optimize CI/CD pipelines, container image builds, and artifact lifecycle management.
- Strong communication and collaboration skills. You build trust across teams and advocate for thoughtful, scalable solutions.
- Bonus if you've worked with event-driven architectures using technologies like Kafka.
Salary (Rate): undetermined
City: undetermined
Country: USA
Working Arrangements: remote
IR35 Status: outside IR35
Seniority Level: undetermined
Industry: IT
Cloud Site Reliability Engineer - AWS & Azure
Responsibilities
- Oversee the design and improvement of infrastructure using SRE best practices, including IaC, recovery automation, and systems that detect and resolve issues independently.
- Manage and fine-tune critical services across both cloud and on-prem environments: Kubernetes clusters, CI/CD pipelines, artifact registries, and custom workloads.
- Enhance observability through intelligent logging, metrics, tracing, and alerting. Ensuring systems are transparent and actionable in real time.
- Champion automation by eliminating repetitive tasks, from deployment workflows to security audits, through scripting and tooling.
- Elevate the developer experience for 80+ engineers and researchers by streamlining secure, reliable workflows across hybrid and cloud-native platforms.
- Take ownership of IAM governance across platforms like Azure AD and AWS IAM. Implement lifecycle automation, auditing, and access controls.
- Foster a culture of operational excellence with strong practices around security, incident management, and resilience engineering.
- Act as a trusted partner to developers and researchers, enabling their speed and innovation without compromising stability.
Experience
- Experience in Site Reliability Engineering, DevOps, or Systems Engineering within fast-paced, technically demanding environments.
- Strong background in Linux systems and cloud infrastructure, with hands-on experience in AWS (primary) and Azure environments.
- Solid command of Kubernetes and container orchestration in production environments.
- Expertise in Infrastructure as Code tools such as Ansible, building reproducible, scalable infrastructure is second nature to you.
- Deep experience in observability and incident response: you know how to set up effective monitoring, handle incidents, and lead blameless post-mortems.
- A security-first mindset, especially when it comes to protecting distributed systems and developer workflows.
- Proven ability to support and optimize CI/CD pipelines, container image builds, and artifact lifecycle management.
- Strong communication and collaboration skills. You build trust across teams and advocate for thoughtful, scalable solutions.
- Bonus if you've worked with event-driven architectures using technologies like Kafka.