Senior Python Developer (AI / LLM / Open Source)

Senior Python Developer (AI / LLM / Open Source)

Posted Today by Highbrow Technology Inc

Negotiable
Undetermined
Remote
EMEA

Summary: The role of SwarmBench Task Engineer involves building high-quality coding benchmark tasks for evaluating AI agents. The position requires experienced engineers to work with complex open-source codebases and design multi-agent workflows. The contract is short-term, lasting four weeks, and is fully remote, with specific availability requirements. Immediate start is expected for candidates who meet the qualifications.

Key Responsibilities:

  • Build benchmark tasks from real-world code changes (bug fixes, refactors, migrations)
  • Work with Docker environments & Harbor framework
  • Write precise task instructions and verification scripts
  • Break down complex engineering problems into multi-agent workflows
  • Debug, test, and refine tasks for accuracy and reproducibility

Key Skills:

  • 5+ years in Python & JavaScript
  • Experience with AI coding benchmarks (SWE-bench, Terminal-Bench)
  • Strong exposure to large open-source codebases (Django, Flask, FastAPI, Node.js)
  • Solid Git workflow knowledge (PRs, diffs, commits)
  • Hands-on Docker experience
  • Strong problem-solving & technical documentation skills

Salary (Rate): undetermined

City: undetermined

Country: undetermined

Working Arrangements: remote

IR35 Status: undetermined

Seniority Level: undetermined

Industry: IT

Detailed Description From Employer:

We’re Hiring: SwarmBench Task Engineer — SWE / Code (Contract)

Location: Remote (India, Bangladesh, Brazil, Colombia, Egypt, Ghana, Indonesia, Kenya, Nigeria, Turkey, Vietnam)

Duration: 4 Weeks (Short-term Contract)

Engagement: Contractor

Availability: 8 hrs/day (with 4 hrs overlap with PST)

Start: Immediate

Role Overview: We’re looking for experienced engineers to build high-quality, real-world coding benchmark tasks used to evaluate AI agents. You’ll work on complex open-source codebases, design multi-agent workflows, and create structured evaluation tasks.

What You’ll Do:

  • Build benchmark tasks from real-world code changes (bug fixes, refactors, migrations)
  • Work with Docker environments & Harbor framework
  • Write precise task instructions and verification scripts
  • Break down complex engineering problems into multi-agent workflows
  • Debug, test, and refine tasks for accuracy and reproducibility

Requirements:

  • 5+ years in Python & JavaScript
  • Experience with AI coding benchmarks (SWE-bench, Terminal-Bench)
  • Strong exposure to large open-source codebases (Django, Flask, FastAPI, Node.js)
  • Solid Git workflow knowledge (PRs, diffs, commits)
  • Hands-on Docker experience
  • Strong problem-solving & technical documentation skills