Negotiable
Undetermined
Remote
Remote
Summary: The Senior Software Engineer for AI Research Clusters will be responsible for designing, building, and optimizing large-scale AI research infrastructure. This role emphasizes expertise in distributed systems and high-performance computing to support AI/ML workloads. The engineer will collaborate closely with AI researchers to enhance system efficiency and reliability. The position is fully remote and requires a strong technical background in relevant programming and infrastructure management.
Key Responsibilities:
- Design and manage scalable AI research infrastructure and compute clusters.
- Build and optimize distributed systems for large-scale model training and data processing.
- Develop tools and frameworks to support researchers and ML engineers.
- Work closely with AI researchers to understand workload requirements and improve system efficiency.
- Optimize GPU/CPU utilization, storage, and networking performance.
- Implement scheduling, resource allocation, and workload orchestration systems.
- Ensure system reliability, monitoring, and fault tolerance.
- Automate infrastructure provisioning using Infrastructure as Code (IaC).
- Troubleshoot performance bottlenecks and system failures.
Key Skills:
- Bachelor’s or Master’s degree in Computer Science, Engineering, or related field.
- Strong programming skills in Python, Go, C++, or similar.
- Experience with distributed systems and parallel computing.
- Hands-on experience with containerization and orchestration tools (Docker, Kubernetes).
- Familiarity with cloud platforms (AWS, Azure, or Google Cloud Platform) or on-prem HPC clusters.
- Understanding of networking, storage systems, and system performance tuning.
Salary (Rate): £56.00 hourly
City: undetermined
Country: undetermined
Working Arrangements: remote
IR35 Status: undetermined
Seniority Level: undetermined
Industry: IT
Job Title: Senior Software Engineer – AI Research Clusters
Location: Remote
Employment Type: Full-time
Role Overview
We are seeking a Senior Software Engineer to design, build, and optimize large-scale AI research clusters. This role focuses on distributed systems, high-performance computing, and infrastructure that supports AI/ML workloads such as model training and experimentation.
Key Responsibilities
- Design and manage scalable AI research infrastructure and compute clusters.
- Build and optimize distributed systems for large-scale model training and data processing.
- Develop tools and frameworks to support researchers and ML engineers.
- Work closely with AI researchers to understand workload requirements and improve system efficiency.
- Optimize GPU/CPU utilization, storage, and networking performance.
- Implement scheduling, resource allocation, and workload orchestration systems.
- Ensure system reliability, monitoring, and fault tolerance.
- Automate infrastructure provisioning using Infrastructure as Code (IaC).
- Troubleshoot performance bottlenecks and system failures.
Required Qualifications
- Bachelor’s or Master’s degree in Computer Science, Engineering, or related field.
- Strong programming skills in Python, Go, C++, or similar.
- Experience with distributed systems and parallel computing.
- Hands-on experience with containerization and orchestration tools (Docker, Kubernetes).
- Familiarity with cloud platforms (AWS, Azure, or Google Cloud Platform) or on-prem HPC clusters.
- Understanding of networking, storage systems, and system performance tuning.
Preferred Skills
- Experience with ML frameworks (TensorFlow, PyTorch).
- Familiarity with GPU computing (CUDA, NCCL).
- Knowledge of cluster schedulers (Slurm, Kubernetes schedulers).
- Experience with big data tools (Spark, Ray).
- Exposure to MLOps and experiment tracking tools.
Key Competencies
- Strong problem-solving and systems thinking
- Collaboration with research and engineering teams
- Performance optimization mindset
- Ownership and accountability