Negotiable
Outside
Remote
USA
Summary: Our client is looking for a Software Engineer to develop a framework for managing jobs across on-prem and cloud environments. The role involves designing orchestration logic for loading LLM models, processing queries, and ensuring system reliability. Ideal candidates will have 2-3 years of experience in scalable distributed systems. This is a part-time contract position with a focus on collaboration and clean coding practices.
Key Responsibilities:
- Build a framework to manage jobs across on-prem and cloud compute.
- Implement job orchestration to allocate compute nodes, load LLMs, process queries, and deliver results.
- Design fault-tolerant execution with restart/recovery mechanisms.
- Ensure clean shutdown of child nodes and processes.
- Work with AWS/Google Cloud Platform for compute, storage, and workflow integrations.
- Manage containers and scheduling in Kubernetes.
- Write clean, testable code with unit tests.
- Collaborate with engineering teams on architecture and reviews.
- Use Git for branching, PRs, reviews, and merges.
Key Skills:
- 2-3 years of software engineering experience.
- Proficiency in Python.
- Experience with LLM inference libraries (vLLM, transformers, or nemotron).
- Experience with Kubernetes and distributed container orchestration.
- Experience building robust distributed applications with graceful recovery.
- Experience with AWS or Google Cloud Platform.
- Experience writing unit tests.
- Strong collaboration and communication skills.
- PyTorch.
- API design experience.
Salary (Rate): £41/hr
City: undetermined
Country: USA
Working Arrangements: remote
IR35 Status: outside IR35
Seniority Level: Mid-Level
Industry: IT
Our client is seeking a Software Engineer to build a robust framework that schedules and manages jobs across on-prem and cloud compute environments. You ll design orchestration logic that loads LLM models onto compute nodes, retrieves queries from storage, processes inference, and ensure graceful shutdown and recovery. Great role for an engineer with 2 3 years of experience excited to work on scalable distributed systems.
Responsibilities
Build a framework to manage jobs across on-prem and cloud compute.
Implement job orchestration to allocate compute nodes, load LLMs, process queries, and deliver results.
Design fault-tolerant execution with restart/recovery mechanisms.
Ensure clean shutdown of child nodes and processes.
Work with AWS/Google Cloud Platform for compute, storage, and workflow integrations.
Manage containers and scheduling in Kubernetes.
Write clean, testable code with unit tests.
Collaborate with engineering teams on architecture and reviews.
Use Git for branching, PRs, reviews, and merges.
Requirements
2 3 years of software engineering experience.
Proficiency in Python.
Experience with LLM inference libraries (vLLM, transformers, or nemotron).
Experience with Kubernetes and distributed container orchestration.
Experience building robust distributed applications with graceful recovery.
Experience with AWS or Google Cloud Platform.
Experience writing unit tests.
Strong collaboration and communication skills.
PyTorch
API design experience
Type: Contract - Part Time (20hrs/week)
Duration: 9 months with extension
Location: Remote (U.S.)
Salary Range: $41/hr - $56/hr DOE
No 3rd party agencies or C2C