Negotiable
Undetermined
Remote
EMEA
Summary: The role of SwarmBench Task Engineer involves building high-quality coding benchmark tasks for evaluating AI agents. The position requires experienced engineers to work with complex open-source codebases and design multi-agent workflows. The contract is short-term, lasting four weeks, and is fully remote, with specific availability requirements. Immediate start is expected for candidates who meet the qualifications.
Key Responsibilities:
- Build benchmark tasks from real-world code changes (bug fixes, refactors, migrations)
- Work with Docker environments & Harbor framework
- Write precise task instructions and verification scripts
- Break down complex engineering problems into multi-agent workflows
- Debug, test, and refine tasks for accuracy and reproducibility
Key Skills:
- 5+ years in Python & JavaScript
- Experience with AI coding benchmarks (SWE-bench, Terminal-Bench)
- Strong exposure to large open-source codebases (Django, Flask, FastAPI, Node.js)
- Solid Git workflow knowledge (PRs, diffs, commits)
- Hands-on Docker experience
- Strong problem-solving & technical documentation skills
Salary (Rate): undetermined
City: undetermined
Country: undetermined
Working Arrangements: remote
IR35 Status: undetermined
Seniority Level: undetermined
Industry: IT
We’re Hiring: SwarmBench Task Engineer — SWE / Code (Contract)
Location: Remote (India, Bangladesh, Brazil, Colombia, Egypt, Ghana, Indonesia, Kenya, Nigeria, Turkey, Vietnam)
Duration: 4 Weeks (Short-term Contract)
Engagement: Contractor
Availability: 8 hrs/day (with 4 hrs overlap with PST)
Start: Immediate
Role Overview: We’re looking for experienced engineers to build high-quality, real-world coding benchmark tasks used to evaluate AI agents. You’ll work on complex open-source codebases, design multi-agent workflows, and create structured evaluation tasks.
What You’ll Do:
- Build benchmark tasks from real-world code changes (bug fixes, refactors, migrations)
- Work with Docker environments & Harbor framework
- Write precise task instructions and verification scripts
- Break down complex engineering problems into multi-agent workflows
- Debug, test, and refine tasks for accuracy and reproducibility
Requirements:
- 5+ years in Python & JavaScript
- Experience with AI coding benchmarks (SWE-bench, Terminal-Bench)
- Strong exposure to large open-source codebases (Django, Flask, FastAPI, Node.js)
- Solid Git workflow knowledge (PRs, diffs, commits)
- Hands-on Docker experience
- Strong problem-solving & technical documentation skills