Senior Software Engineer – AI Research Clusters/Remote

Posted 2 weeks ago by Apetan Consulting

Apply

Negotiable

Undetermined

Remote

Apply

Artificial Intelligence Azure Kubernetes Service Big Data C++ (Programming Language) Cloud Computing Cloud Technology Computer Science Containerisation Data Processing Device Tracking Software Generic Programming Google Cloud Google Cloud Platform (GCP) Graphics Processing Unit (GPU) Infrastructure as Code (IaC) Kubernetes Machine Learning Management Microsoft Azure Parallel Computing Performance Tuning Python (Programming Language) PyTorch (Machine Learning Library) Resource Allocation Scheduling Software Development Software Engineering Storage Systems

Summary: The Senior Software Engineer for AI Research Clusters will be responsible for designing, building, and optimizing large-scale AI research infrastructure. This role emphasizes expertise in distributed systems and high-performance computing to support AI/ML workloads. The engineer will collaborate closely with AI researchers to enhance system efficiency and reliability. The position is fully remote and requires a strong technical background in relevant programming and infrastructure management.

Key Responsibilities:

Design and manage scalable AI research infrastructure and compute clusters.
Build and optimize distributed systems for large-scale model training and data processing.
Develop tools and frameworks to support researchers and ML engineers.
Work closely with AI researchers to understand workload requirements and improve system efficiency.
Optimize GPU/CPU utilization, storage, and networking performance.
Implement scheduling, resource allocation, and workload orchestration systems.
Ensure system reliability, monitoring, and fault tolerance.
Automate infrastructure provisioning using Infrastructure as Code (IaC).
Troubleshoot performance bottlenecks and system failures.

Key Skills:

Bachelor’s or Master’s degree in Computer Science, Engineering, or related field.
Strong programming skills in Python, Go, C++, or similar.
Experience with distributed systems and parallel computing.
Hands-on experience with containerization and orchestration tools (Docker, Kubernetes).
Familiarity with cloud platforms (AWS, Azure, or Google Cloud Platform) or on-prem HPC clusters.
Understanding of networking, storage systems, and system performance tuning.

Salary (Rate): £56.00 hourly

City: undetermined

Country: undetermined

Working Arrangements: remote

IR35 Status: undetermined

Seniority Level: undetermined

Industry: IT

Detailed Description From Employer:

Job Title: Senior Software Engineer – AI Research Clusters

Location: Remote

Employment Type: Full-time

Role Overview

We are seeking a Senior Software Engineer to design, build, and optimize large-scale AI research clusters. This role focuses on distributed systems, high-performance computing, and infrastructure that supports AI/ML workloads such as model training and experimentation.

Key Responsibilities

Design and manage scalable AI research infrastructure and compute clusters.
Build and optimize distributed systems for large-scale model training and data processing.
Develop tools and frameworks to support researchers and ML engineers.
Work closely with AI researchers to understand workload requirements and improve system efficiency.
Optimize GPU/CPU utilization, storage, and networking performance.
Implement scheduling, resource allocation, and workload orchestration systems.
Ensure system reliability, monitoring, and fault tolerance.
Automate infrastructure provisioning using Infrastructure as Code (IaC).
Troubleshoot performance bottlenecks and system failures.

Required Qualifications

Bachelor’s or Master’s degree in Computer Science, Engineering, or related field.
Strong programming skills in Python, Go, C++, or similar.
Experience with distributed systems and parallel computing.
Hands-on experience with containerization and orchestration tools (Docker, Kubernetes).
Familiarity with cloud platforms (AWS, Azure, or Google Cloud Platform) or on-prem HPC clusters.
Understanding of networking, storage systems, and system performance tuning.

Preferred Skills

Experience with ML frameworks (TensorFlow, PyTorch).
Familiarity with GPU computing (CUDA, NCCL).
Knowledge of cluster schedulers (Slurm, Kubernetes schedulers).
Experience with big data tools (Spark, Ray).
Exposure to MLOps and experiment tracking tools.

Key Competencies

Strong problem-solving and systems thinking
Collaboration with research and engineering teams
Performance optimization mindset
Ownership and accountability

Apply

Inside IR35

Outside IR35

Permanent Employee

IR35

Umbrella Companies

Limited Companies

First Time Contractors

What Is IR35?

InsideIR35

Outside IR35

The Cost of IR35

IR35 Assessments

IR35 Rules

IR35 Compliance

Expenses

Foreign Companies

Overseas Contractors

Limited Companies

Sole Traders

What Is An Umbrella Company?

Choosing an Umbrella Company

Tax and Pay

Tax Avoidance

Fees (Margin)

National Insurance

Holiday Pay

Expenses

Pensions

Maternity Pay

Sick Pay

What Is A Limited Company?

Limited Company vs Sole Trader

Incorporation

Taxes

Filing Responsibilities

Bookkeeping

Insurance

Expenses

Buying a Car or Van

Capital Allowances

Benefits In Kind

Pensions

Employing A Spouse

Managing Excess Money

Dormant Companies

Closing Your Company

Withdrawing Money

Business Asset Disposal Relief

How To Become A Contractor

Inside IR35 Checklist

Outside IR35 Checklist

Self-Assessment Tax Returns

Mortgages

Pensions

Working Multiple Contracts

What is the £100k Abatement?

Inside IR35

Outside IR35

Permanent Employee

IR35

Umbrella Companies

Limited Companies

First Time Contractors

What Is IR35?

InsideIR35

Outside IR35

The Cost of IR35

IR35 Assessments

IR35 Rules

IR35 Compliance

Expenses

Foreign Companies

Overseas Contractors

Limited Companies

Sole Traders

What Is An Umbrella Company?

Choosing an Umbrella Company

Tax and Pay

Tax Avoidance

Fees (Margin)