Linux Administrator (HPC and NVIDIA GPU) | 100% Remote | (#CH) (#ANB)
Posted 1 day ago by 1763119701
Negotiable
Outside
Remote
USA
Summary: The role of System Administrator for High-Performance Computing (HPC) focuses on providing operational support and troubleshooting for NVIDIA GPU systems in a remote setting. The position requires managing workload scheduling, software updates, and documentation while ensuring system performance and availability. Candidates should have a strong background in system administration, particularly in HPC or GPU environments, along with relevant technical skills. This role emphasizes knowledge transfer and training for team members and users.
Key Responsibilities:
- Provide operational support and problem resolution for the NVIDIA NVL72 GPU system.
- Monitor system health and performance, proactively identifying and resolving issues to maintain high uptime and availability.
- Administer and optimize the Slurm workload manager for efficient job scheduling and resource allocation.
- Manage container orchestration using Kubernetes (K8s) within the HPC environment.
- Maintain and update NVIDIA software stacks, ensuring proper patch management, version control, and security compliance.
- Utilize NVIDIA Base Command Manager for system orchestration, monitoring, and optimization.
- Author and maintain detailed technical documentation, including system architecture, configurations, and operational procedures.
- Create clear, user-friendly How To guides to support onboarding and self-service among researchers and staff.
- Conduct on-the-job training sessions for new team members and end users to facilitate knowledge transfer and best practices.
Key Skills:
- Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent professional experience.
- 3-5 years of experience in system administration, preferably in HPC or GPU-accelerated environments.
- Proficiency in Linux, Slurm, Kubernetes, and NVIDIA GPU technologies.
- Demonstrated experience writing technical documentation and user guides.
Salary (Rate): undetermined
City: undetermined
Country: USA
Working Arrangements: remote
IR35 Status: outside IR35
Seniority Level: undetermined
Industry: IT
Location: Remote
DURATION: 6 to 12 Months
Key Responsibilities
System Support & Troubleshooting
- Provide operational support and problem resolution for the NVIDIA NVL72 GPU system.
- Monitor system health and performance, proactively identifying and resolving issues to maintain high uptime and availability. Cluster & Workload Management
- Administer and optimize the Slurm workload manager for efficient job scheduling and resource allocation.
- Manage container orchestration using Kubernetes (K8s) within the HPC environment. Software & Patch Management
- Maintain and update NVIDIA software stacks, ensuring proper patch management, version control, and security compliance.
- Utilize NVIDIA Base Command Manager for system orchestration, monitoring, and optimization. Documentation & Knowledge Transfer
- Author and maintain detailed technical documentation, including system architecture, configurations, and operational procedures.
- Create clear, user-friendly How To guides to support onboarding and self-service among researchers and staff.
- Conduct on-the-job training sessions for new team members and end users to facilitate knowledge transfer and best practices. Qualifications
Required
- Bachelor s degree in Computer Science, Engineering, or a related field, or equivalent professional experience.
- 3 5 years of experience in system administration, preferably in HPC or GPU-accelerated environments.
- Proficiency in Linux, Slurm, Kubernetes, and NVIDIA GPU technologies.
- Demonstrated experience writing technical documentation and user