Linux Administrator (HPC and NVIDIA GPU) | 100% Remote | (#CH) (#ANB)

Linux Administrator (HPC and NVIDIA GPU) | 100% Remote | (#CH) (#ANB)

Posted 1 day ago by 1763119701

Negotiable
Outside
Remote
USA

Summary: The role of System Administrator for High-Performance Computing (HPC) focuses on providing operational support and troubleshooting for NVIDIA GPU systems in a remote setting. The position requires managing workload scheduling, software updates, and documentation while ensuring system performance and availability. Candidates should have a strong background in system administration, particularly in HPC or GPU environments, along with relevant technical skills. This role emphasizes knowledge transfer and training for team members and users.

Key Responsibilities:

  • Provide operational support and problem resolution for the NVIDIA NVL72 GPU system.
  • Monitor system health and performance, proactively identifying and resolving issues to maintain high uptime and availability.
  • Administer and optimize the Slurm workload manager for efficient job scheduling and resource allocation.
  • Manage container orchestration using Kubernetes (K8s) within the HPC environment.
  • Maintain and update NVIDIA software stacks, ensuring proper patch management, version control, and security compliance.
  • Utilize NVIDIA Base Command Manager for system orchestration, monitoring, and optimization.
  • Author and maintain detailed technical documentation, including system architecture, configurations, and operational procedures.
  • Create clear, user-friendly How To guides to support onboarding and self-service among researchers and staff.
  • Conduct on-the-job training sessions for new team members and end users to facilitate knowledge transfer and best practices.

Key Skills:

  • Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent professional experience.
  • 3-5 years of experience in system administration, preferably in HPC or GPU-accelerated environments.
  • Proficiency in Linux, Slurm, Kubernetes, and NVIDIA GPU technologies.
  • Demonstrated experience writing technical documentation and user guides.

Salary (Rate): undetermined

City: undetermined

Country: USA

Working Arrangements: remote

IR35 Status: outside IR35

Seniority Level: undetermined

Industry: IT

Detailed Description From Employer:
System Administrator High-Performance Computing (HPC)

Location: Remote

DURATION: 6 to 12 Months

Key Responsibilities

System Support & Troubleshooting

  • Provide operational support and problem resolution for the NVIDIA NVL72 GPU system.
  • Monitor system health and performance, proactively identifying and resolving issues to maintain high uptime and availability. Cluster & Workload Management
  • Administer and optimize the Slurm workload manager for efficient job scheduling and resource allocation.
  • Manage container orchestration using Kubernetes (K8s) within the HPC environment. Software & Patch Management
  • Maintain and update NVIDIA software stacks, ensuring proper patch management, version control, and security compliance.
  • Utilize NVIDIA Base Command Manager for system orchestration, monitoring, and optimization. Documentation & Knowledge Transfer
  • Author and maintain detailed technical documentation, including system architecture, configurations, and operational procedures.
  • Create clear, user-friendly How To guides to support onboarding and self-service among researchers and staff.
  • Conduct on-the-job training sessions for new team members and end users to facilitate knowledge transfer and best practices. Qualifications

Required

  • Bachelor s degree in Computer Science, Engineering, or a related field, or equivalent professional experience.
  • 3 5 years of experience in system administration, preferably in HPC or GPU-accelerated environments.
  • Proficiency in Linux, Slurm, Kubernetes, and NVIDIA GPU technologies.
  • Demonstrated experience writing technical documentation and user