Negotiable
Inside
Hybrid
Cambridge, UK
Summary: The HPC Engineer role involves joining a team in the UK to build and operate high-performance computing capabilities in a hybrid working environment. The position requires collaboration with the scientific community to deliver HPC services, utilizing automation and DevOps practices for optimal performance. The ideal candidate will have extensive experience in large-scale computing environments and a strong understanding of relevant technologies. This role is contract-based and requires a hands-on approach to infrastructure and service management.
Key Responsibilities:
- Design, implement and maintain robust platform infrastructure using Infrastructure as Code (IaC) tools such as Terraform
- Develop, deliver and operate research computing services and applications
- Take a Site Reliability Engineering approach to HPC services, managing development, deployment, monitoring and incident response end-to-end
- Solve complex technical problems related to HPC services and user workflows
- Drive innovative computational solutions and exploit emerging technologies
- Administer large-scale cluster and server computing environments and related software (eg, Slurm, LSF, Grid Engine)
- Apply DevOps practices and agile methodologies for HPC operations
- Manage virtualized private cloud resources (eg, OpenStack)
- Implement and administer large-scale parallel filesystems (eg, Weka, GPFS, Lustre)
- Use configuration management tools (eg, Ansible, Salt, Puppet) for IT operations
- Develop scripts and tools for HPC and DevOps operations using Bash and Python
Key Skills:
- 10+ years of experience operating or engineering large-scale computing environments (HPC, HTC or BC)
- Strong understanding of Linux system administration, TCP/IP stack and storage subsystems
- Experience with high-speed networks (eg, InfiniBand)
- Proven experience with configuration management and automation frameworks
- Hands-on experience with DevOps processes and agile methodologies
- Drive innovative computational solutions and exploit emerging technologies
- Experience in developing and managing relationships with third-party suppliers
- Scientific degree and/or experience in computationally intensive scientific data analysis
- Previous experience in large-scale HPC environments (>10,000 cores)
Salary (Rate): undetermined
City: Cambridge
Country: UK
Working Arrangements: hybrid
IR35 Status: inside IR35
Seniority Level: undetermined
Industry: IT
Detailed Description From Employer:
HPC Engineer - Contract via Umbrella - Cambridge/Hybrid
Location: Cambridge, hybrid (ideal 3 days onsite)
Market rate
Description
We're looking for an HPC Engineer to join our team in the United Kingdom in a hybrid working mode (ideal 3 days onsite). In this role, you will help build and operate industry-leading high-performance computing (HPC) capabilities, including application build frameworks, containerized applications and cloud-based services. You will work closely with the scientific community to deliver high-quality HPC services, leveraging automation, infrastructure-as-code and DevOps practices to ensure scalability, reliability and performance in a rapidly evolving HPC landscape.
Responsibilities
- Design, implement and maintain robust platform infrastructure using Infrastructure as Code (IaC) tools such as Terraform
- Develop, deliver and operate research computing services and applications
- Take a Site Reliability Engineering approach to HPC services, managing development, deployment, monitoring and incident response end-to-end
- Solve complex technical problems related to HPC services and user workflows
- Drive innovative computational solutions and exploit emerging technologies
- Administer large-scale cluster and server computing environments and related software (eg, Slurm, LSF, Grid Engine)
- Apply DevOps practices and agile methodologies for HPC operations
- Manage virtualized private cloud resources (eg, OpenStack)
- Implement and administer large-scale parallel filesystems (eg, Weka, GPFS, Lustre)
- Use configuration management tools (eg, Ansible, Salt, Puppet) for IT operations
- Develop scripts and tools for HPC and DevOps operations using Bash and Python
Requirements
- 10+ years of experience operating or engineering large-scale computing environments (HPC, HTC or BC)
- Strong understanding of Linux system administration, TCP/IP stack and storage subsystems
- Experience with high-speed networks (eg, InfiniBand)
- Proven experience with configuration management and automation frameworks
- Hands-on experience with DevOps processes and agile methodologies
- Drive innovative computational solutions and exploit emerging technologies
- Experience in developing and managing relationships with third-party suppliers
- Scientific degree and/or experience in computationally intensive scientific data analysis
- Previous experience in large-scale HPC environments (>10,000 cores)
Additional
- Experience with public cloud infrastructure (AWS, Azure, GCP)
- Experience managing virtualized private cloud environments (eg, OpenStack)
- Familiarity with container technologies (LXD, Singularity, Docker, Kubernetes)
- Development experience with programming languages and tools (Java/C++, Python/Ruby/Perl, SQL)
- Experience with HashiCorp tools (Terraform, Vault, Consul, Nomad)