Senior SDET / AI LLM || Remote || W2 Contract

Senior SDET / AI LLM || Remote || W2 Contract

Posted Today by Integrass

Negotiable
Undetermined
Remote
Remote

Summary: The role of Senior Software Development Engineer in Test (SDET) focuses on test automation, backend systems testing, and AI/LLM validation. This hands-on position involves testing LLM-powered applications, building evaluation workflows, and defining quality standards for generative AI systems. The candidate will work closely with engineering teams to enhance testability and reliability while advocating for best practices in AI quality engineering. The role is remote and requires a strong background in Python and experience with ML or LLM systems.

Key Responsibilities:

  • Testing LLM-powered applications used across the enterprise
  • Building LLM-driven testing and evaluation workflows
  • Defining organization-wide standards for GenAI quality, reliability, and release readiness
  • Design and implement test strategies for LLM-powered systems, including prompt and response validation, regression testing, and evaluation of accuracy, consistency, hallucinations, bias, and safety
  • Build and maintain LLM-based evaluation frameworks using tools such as DeepEval, MLflow, LangChain, and Langflow
  • Develop synthetic and real-world test datasets in collaboration with the Data Engineer
  • Define quality thresholds, scoring mechanisms, benchmarks, and pass/fail criteria for GenAI systems
  • Build and maintain automated test frameworks for LLM APIs and services, agentic workflows, and data ingestion pipelines
  • Integrate LLM testing and evaluation into CI/CD pipelines, enforcing quality gates prior to production release
  • Partner with engineering teams to improve testability, reliability, and observability of AI systems
  • Perform root-cause analysis for failures related to model behavior, data quality, or orchestration logic
  • Instrument LLM applications using Datadog LLM Observability to track latency, token usage, errors, and cost
  • Build dashboards and alerting focused on LLM quality and reliability
  • Use production telemetry to continuously refine test coverage and evaluation strategies
  • Act as a consultative partner to product, platform, and data teams adopting LLM technologies
  • Provide guidance on generative AI test strategies, prompt engineering, and release readiness
  • Contribute to organization-wide standards and best practices for testing AI systems
  • Participate in architecture and design reviews from a quality-first perspective
  • Advocate for automation-first testing, infrastructure as code, and continuous monitoring
  • Drive adoption of Agile, DevOps, and CI/CD best practices within AI quality engineering
  • Conduct code reviews and promote secure, maintainable, and scalable test frameworks
  • Continuously improve internal tooling and frameworks within the QA Center of Excellence

Key Skills:

  • Strong Python development skills
  • Experience testing backend systems, APIs, microservices, or distributed platforms
  • Proven experience building and maintaining automation frameworks
  • Ability to work effectively with ambiguous, non-deterministic systems
  • Hands-on experience testing or validating ML- or LLM-based systems
  • Familiarity with LLM orchestration and evaluation tools, including LangChain, Langflow, DeepEval, MLflow
  • Strong understanding of challenges unique to testing generative AI systems
  • Experience with Datadog, especially LLM Observability (nice to have)
  • Exposure to Hugging Face, PyTorch, or TensorFlow (usage-level) (nice to have)
  • Experience testing RAG pipelines, Vector Databases, or data-driven platforms (nice to have)
  • Background working in platform teams, shared services, or QA Centers of Excellence (nice to have)
  • Experience collaborating closely with Data Engineering or ML Platform teams (nice to have)

Salary (Rate): £34.50 hourly

City: undetermined

Country: undetermined

Working Arrangements: remote

IR35 Status: undetermined

Seniority Level: undetermined

Industry: IT

Detailed Description From Employer:

We are seeking a Senior Software Development Engineer in Test (SDET) with a strong background in test automation, backend systems testing, and AI/LLM validation.

This is a hands-on, highly influential role responsible for:

  • Testing LLM-powered applications used across the enterprise

  • Building LLM-driven testing and evaluation workflows

  • Defining organization-wide standards for GenAI quality, reliability, and release readiness


Key Responsibilities

LLM Testing & Evaluation

  • Design and implement test strategies for LLM-powered systems, including:

    • Prompt and response validation

    • Regression testing across model, prompt, and data changes

    • Evaluation of accuracy, consistency, hallucinations, bias, and safety

  • Build and maintain LLM-based evaluation frameworks using tools such as DeepEval, MLflow, LangChain, and Langflow

  • Develop synthetic and real-world test datasets in collaboration with the Data Engineer

  • Define quality thresholds, scoring mechanisms, benchmarks, and pass/fail criteria for GenAI systems


Test Automation & Framework Development

  • Build and maintain automated test frameworks for:

    • LLM APIs and services

    • Agentic workflows and RAG pipelines

    • Data ingestion and inference pipelines

  • Integrate LLM testing and evaluation into CI/CD pipelines, enforcing quality gates prior to production release

  • Partner with engineering teams to improve testability, reliability, and observability of AI systems

  • Perform root-cause analysis for failures related to model behavior, data quality, or orchestration logic


Observability & Monitoring

  • Instrument LLM applications using Datadog LLM Observability to track:

    • Latency, token usage, errors, and cost

    • Quality regressions, drift, and performance anomalies

  • Build dashboards and alerting focused on LLM quality and reliability

  • Use production telemetry to continuously refine test coverage and evaluation strategies


Shared Services & Collaboration

  • Act as a consultative partner to product, platform, and data teams adopting LLM technologies

  • Provide guidance on:

    • Generative AI test strategies

    • Prompt engineering and workflow validation

    • Release readiness and AI risk assessment

  • Contribute to organization-wide standards and best practices for testing, explaining, and monitoring AI systems

  • Participate in architecture and design reviews from a quality-first perspective


Engineering Excellence

  • Advocate for automation-first testing, infrastructure as code, and continuous monitoring

  • Drive adoption of Agile, DevOps, and CI/CD best practices within AI quality engineering

  • Conduct code reviews and promote secure, maintainable, and scalable test frameworks

  • Continuously improve internal tooling and frameworks within the QA Center of Excellence


Required Skills & Experience

  • Strong Python development skills

  • Experience testing backend systems, APIs, microservices, or distributed platforms

  • Proven experience building and maintaining automation frameworks

  • Ability to work effectively with ambiguous, non-deterministic systems


AI / LLM Experience

  • Hands-on experience testing or validating ML- or LLM-based systems

  • Familiarity with LLM orchestration and evaluation tools, including:

    • LangChain, Langflow

    • DeepEval, MLflow

  • Strong understanding of challenges unique to testing generative AI systems


Nice to Have

  • Experience with Datadog, especially LLM Observability

  • Exposure to Hugging Face, PyTorch, or TensorFlow (usage-level)

  • Experience testing RAG pipelines, Vector Databases, or data-driven platforms

  • Background working in platform teams, shared services, or QA Centers of Excellence

  • Experience collaborating closely with Data Engineering or ML Platform teams