Senior Software Engineer – AI Research Clusters/Remote

Senior Software Engineer – AI Research Clusters/Remote

Posted 1 week ago by Apetan Consulting

Negotiable
Undetermined
Remote
Remote

Summary: The Senior Software Engineer for AI Research Clusters will be responsible for designing, building, and optimizing large-scale AI research infrastructure. This role emphasizes expertise in distributed systems and high-performance computing to support AI/ML workloads. The engineer will collaborate closely with AI researchers to enhance system efficiency and reliability. The position is fully remote and requires a strong technical background in relevant programming and infrastructure management.

Key Responsibilities:

  • Design and manage scalable AI research infrastructure and compute clusters.
  • Build and optimize distributed systems for large-scale model training and data processing.
  • Develop tools and frameworks to support researchers and ML engineers.
  • Work closely with AI researchers to understand workload requirements and improve system efficiency.
  • Optimize GPU/CPU utilization, storage, and networking performance.
  • Implement scheduling, resource allocation, and workload orchestration systems.
  • Ensure system reliability, monitoring, and fault tolerance.
  • Automate infrastructure provisioning using Infrastructure as Code (IaC).
  • Troubleshoot performance bottlenecks and system failures.

Key Skills:

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or related field.
  • Strong programming skills in Python, Go, C++, or similar.
  • Experience with distributed systems and parallel computing.
  • Hands-on experience with containerization and orchestration tools (Docker, Kubernetes).
  • Familiarity with cloud platforms (AWS, Azure, or Google Cloud Platform) or on-prem HPC clusters.
  • Understanding of networking, storage systems, and system performance tuning.

Salary (Rate): £56.00 hourly

City: undetermined

Country: undetermined

Working Arrangements: remote

IR35 Status: undetermined

Seniority Level: undetermined

Industry: IT

Detailed Description From Employer:

Job Title: Senior Software Engineer – AI Research Clusters

Location: Remote

Employment Type: Full-time


Role Overview

We are seeking a Senior Software Engineer to design, build, and optimize large-scale AI research clusters. This role focuses on distributed systems, high-performance computing, and infrastructure that supports AI/ML workloads such as model training and experimentation.


Key Responsibilities

  • Design and manage scalable AI research infrastructure and compute clusters.
  • Build and optimize distributed systems for large-scale model training and data processing.
  • Develop tools and frameworks to support researchers and ML engineers.
  • Work closely with AI researchers to understand workload requirements and improve system efficiency.
  • Optimize GPU/CPU utilization, storage, and networking performance.
  • Implement scheduling, resource allocation, and workload orchestration systems.
  • Ensure system reliability, monitoring, and fault tolerance.
  • Automate infrastructure provisioning using Infrastructure as Code (IaC).
  • Troubleshoot performance bottlenecks and system failures.

Required Qualifications

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or related field.
  • Strong programming skills in Python, Go, C++, or similar.
  • Experience with distributed systems and parallel computing.
  • Hands-on experience with containerization and orchestration tools (Docker, Kubernetes).
  • Familiarity with cloud platforms (AWS, Azure, or Google Cloud Platform) or on-prem HPC clusters.
  • Understanding of networking, storage systems, and system performance tuning.

Preferred Skills

  • Experience with ML frameworks (TensorFlow, PyTorch).
  • Familiarity with GPU computing (CUDA, NCCL).
  • Knowledge of cluster schedulers (Slurm, Kubernetes schedulers).
  • Experience with big data tools (Spark, Ray).
  • Exposure to MLOps and experiment tracking tools.

Key Competencies

  • Strong problem-solving and systems thinking
  • Collaboration with research and engineering teams
  • Performance optimization mindset
  • Ownership and accountability