Distributed Systems Framework Engineer

Distributed Systems Framework Engineer

Posted 2 days ago by 1765260119

Negotiable
Outside
Remote
USA

Summary: Our client is looking for a Software Engineer to develop a framework for managing jobs across on-prem and cloud environments. The role involves designing orchestration logic for loading LLM models, processing queries, and ensuring system reliability. Ideal candidates will have 2-3 years of experience in scalable distributed systems. This is a part-time contract position with a focus on collaboration and clean coding practices.

Key Responsibilities:

  • Build a framework to manage jobs across on-prem and cloud compute.
  • Implement job orchestration to allocate compute nodes, load LLMs, process queries, and deliver results.
  • Design fault-tolerant execution with restart/recovery mechanisms.
  • Ensure clean shutdown of child nodes and processes.
  • Work with AWS/Google Cloud Platform for compute, storage, and workflow integrations.
  • Manage containers and scheduling in Kubernetes.
  • Write clean, testable code with unit tests.
  • Collaborate with engineering teams on architecture and reviews.
  • Use Git for branching, PRs, reviews, and merges.

Key Skills:

  • 2-3 years of software engineering experience.
  • Proficiency in Python.
  • Experience with LLM inference libraries (vLLM, transformers, or nemotron).
  • Experience with Kubernetes and distributed container orchestration.
  • Experience building robust distributed applications with graceful recovery.
  • Experience with AWS or Google Cloud Platform.
  • Experience writing unit tests.
  • Strong collaboration and communication skills.
  • PyTorch.
  • API design experience.

Salary (Rate): £41/hr

City: undetermined

Country: USA

Working Arrangements: remote

IR35 Status: outside IR35

Seniority Level: Mid-Level

Industry: IT

Detailed Description From Employer:

Our client is seeking a Software Engineer to build a robust framework that schedules and manages jobs across on-prem and cloud compute environments. You ll design orchestration logic that loads LLM models onto compute nodes, retrieves queries from storage, processes inference, and ensure graceful shutdown and recovery. Great role for an engineer with 2 3 years of experience excited to work on scalable distributed systems.

Responsibilities

Build a framework to manage jobs across on-prem and cloud compute.

Implement job orchestration to allocate compute nodes, load LLMs, process queries, and deliver results.

Design fault-tolerant execution with restart/recovery mechanisms.

Ensure clean shutdown of child nodes and processes.

Work with AWS/Google Cloud Platform for compute, storage, and workflow integrations.

Manage containers and scheduling in Kubernetes.

Write clean, testable code with unit tests.

Collaborate with engineering teams on architecture and reviews.

Use Git for branching, PRs, reviews, and merges.

Requirements

2 3 years of software engineering experience.

Proficiency in Python.

Experience with LLM inference libraries (vLLM, transformers, or nemotron).

Experience with Kubernetes and distributed container orchestration.

Experience building robust distributed applications with graceful recovery.

Experience with AWS or Google Cloud Platform.

Experience writing unit tests.

Strong collaboration and communication skills.

PyTorch

API design experience

Type: Contract - Part Time (20hrs/week)

Duration: 9 months with extension

Location: Remote (U.S.)

Salary Range: $41/hr - $56/hr DOE

No 3rd party agencies or C2C