AI Site Reliability Engineer

Posted Today by 1756891593

Apply

Negotiable

Outside

Remote

USA

Apply

Summary: The AI Site Reliability Engineer will be responsible for building, developing, and expanding artificial intelligence platforms within the IT Infrastructure Services organization. This role focuses on utilizing SRE mechanisms to enhance operational capabilities and maintain Service Level Objectives for NVIDIA DGX and Cisco-UCS based AI platforms. The engineer will lead the automation of pipelines through CI/CD systems to improve service delivery. The position requires a strong background in high-performance computing and software engineering to solve operational challenges effectively.

Key Responsibilities:

Technical knowledge of high-performance compute, NVIDIA DGX/GPUs and/or Cisco Unified Compute System.
Handle availability, latency, scalability and efficiency of NVIDIA and Cisco UCS infrastructure by instilling engineering reliability into the development life cycle with a focus on fault tolerant approaches.
Drive capacity planning, performance analysis, instrumentation, and other non-functional systems requirements.
Automate operational capabilities using Python, Ansible, Terraform, Go etc.
Deliver automation through CI/CD pipeline and chatbot etc.
Implement metrics driven processes to ensure service quality targets are met.

Key Skills:

HPC Clusters (NVIDIA DGX)
Python or Go (intermediate)
Kubernetes/Openshift (Deployment)
Automation (Ansible & Terraform)
Linux
DevOps (GitHub, CICD)

Salary (Rate): undetermined

City: undetermined

Country: USA

Working Arrangements: remote

IR35 Status: outside IR35

Seniority Level: undetermined

Industry: IT

Detailed Description From Employer:

Hi,

Job Title: AI Site Reliability Engineer

Location: Remote

Duration: Long term Contract

Job Description:

building, developing, and expanding our artificial intelligence platforms, which will empower the business to fundamentally change the world. You will be an AI Site Reliability Engineer in the IT Infrastructure Services organization. You will use SRE mechanisms to reduce toil and maintain Service Level Objectives (SLOs) for our internal NVIDIA DGX and Cisco-UCS based AI platforms. You will lead, build, and run fully automated pipelines through our Continuous Integration/ Continuous Delivery (CI/CD) system to deliver operational capabilities and improvements.

Mandatory Skills

HPC Clusters(NVIDIA DGX)

Python or Go (intermediate)

Kubernetes/ Openshift(Deployment)

Automation (Ansible & Terraform)

Linux

DevOps (GitHub, CICD)

Responsibilities include

Technical knowledge of high-performance compute, NVIDIA DGX/GPUs and/or Cisco Unified Compute System.
Handle availability, latency, scalability and efficiency of NVIDIA and Cisco UCS infrastructure by instilling engineering reliability into the development life cycle with a focus on fault tolerant approaches.
Drive capacity planning, performance analysis, instrumentation, and other non-functional systems requirements.
Automate operational capabilities using Python, Ansible, Terraform, Go etc.
Deliver automation through CI/CD pipeline and chatbot etc.
Implement metrics driven processes to ensure service quality targets are met.

Who You Are

You are an experienced Site Reliability Engineer for high performance compute, artificial intelligence, machine learning, and/or integrated computer systems. You have a software engineering approach for solving operational problems. You know HPC and are familiar with Kubernetes. You have experience delivering software solutions and Linux operating systems. You understand IT infrastructure customers and are passionate about diving deep into problems and fixing them.

Our Minimum Requirements include:

Bachelor s degree in computer science, Information Technology or related field; or equivalent years of experience in information technology.
Experience deploying and administrating NVIDIA (DGX) or equivalent high-performance-compute (HPC) clusters (e.g. Cray, HPE, IBM).
5+ year administrating and supporting Linux based operating systems.
Experience writing code in general-purpose programming languages such as: Python, GoLang, C/C++ and using GIT and CI/CD systems (e.g., GitLab, GitHub Actions, Jenkins).
Experience in deploying Enterprise Grade Kubernetes cluster (RedHat OpenShift preferred) and/or Google Anthos.
Sophisticated knowledge of Kubernetes, Dockers, Terraform, Ansible, Jenkins, GitOps, Git, Linux
Software development lifecycle includes design, development, testing, packaging, deployment using Python or Golang

Preferred Qualifications

Master s degree or equivalent experience in relevant field.
Certifications in Linux, Networking, Cloud, or related technologies.
Prior successful experience as a compute or site/systems reliability engineer.
Experience with Kubernetes, Hybrid Cloud, Virtualization, and Container technologies.
Experience with Agile and DevOps operating models, including project tracking tools (e.g., Jira, Rally).
Excellent collaborator who can partner, lead, guide, and communicate advanced technical concept.

Apply

Inside IR35

Outside IR35

Permanent Employee

IR35

Umbrella Companies

Limited Companies

First Time Contractors

What Is IR35?

InsideIR35

Outside IR35

The Cost of IR35

IR35 Assessments

IR35 Rules

IR35 Compliance

Expenses

Foreign Companies

Overseas Contractors

Limited Companies

Sole Traders

What Is An Umbrella Company?

Choosing an Umbrella Company

Tax and Pay

Tax Avoidance

Fees (Margin)

National Insurance

Holiday Pay

Expenses

Pensions

Maternity Pay

Sick Pay

What Is A Limited Company?

Limited Company vs Sole Trader

Incorporation

Taxes

Filing Responsibilities

Bookkeeping

Insurance

Expenses

Buying a Car or Van

Capital Allowances

Benefits In Kind

Pensions

Employing A Spouse

Managing Excess Money

Dormant Companies

Closing Your Company

Withdrawing Money

Business Asset Disposal Relief

How To Become A Contractor

Inside IR35 Checklist

Outside IR35 Checklist

Self-Assessment Tax Returns

Mortgages

Pensions

Working Multiple Contracts

What is the £100k Abatement?

Inside IR35

Outside IR35

Permanent Employee

IR35

Umbrella Companies

Limited Companies

First Time Contractors

What Is IR35?

InsideIR35

Outside IR35

The Cost of IR35

IR35 Assessments

IR35 Rules

IR35 Compliance

Expenses

Foreign Companies

Overseas Contractors

Limited Companies

Sole Traders

What Is An Umbrella Company?

Choosing an Umbrella Company

Tax and Pay

Tax Avoidance

Fees (Margin)