Negotiable
Outside
Remote
USA
Summary: The role of Apache Druid involves ensuring the high availability and reliability of production systems while managing and optimizing Apache Druid clusters. The position requires implementing automation and Infrastructure as Code (IaC) practices, as well as designing and maintaining data orchestration workflows using Apache Airflow. Collaboration with various teams and effective communication regarding system status and incidents are also key aspects of this role.
Key Responsibilities:
- Ensure high availability and reliability of production systems.
- Implement and maintain robust monitoring and alerting systems.
- Participate in on-call rotations to respond to incidents and outages.
- Conduct post-incident reviews and implement preventative measures.
- Automate infrastructure provisioning, configuration, and deployment using IaC tools (e.g., Terraform, Ansible).
- Develop and maintain CI/CD pipelines to streamline software releases.
- Optimize and automate data pipelines and workflows.
- Manage and optimize Apache Druid clusters for high performance and scalability.
- Troubleshoot Druid performance issues and implement solutions.
- Design and implement Druid data ingestion and query optimization strategies.
- Design, develop, and maintain Airflow DAGs for data orchestration and workflow automation.
- Monitor Airflow performance and troubleshoot issues.
- Optimize Airflow workflows for efficiency and reliability.
- Implement and maintain comprehensive monitoring and logging solutions (e.g., Prometheus, Grafana, ELK stack).
- Analyze metrics and logs to identify performance bottlenecks and potential issues.
- Create and maintain dashboards for visualizing system health and performance.
- Collaborate with development, data, and operations teams to ensure smooth operations.
- Communicate effectively with stakeholders regarding system status and incidents.
- Document processes and procedures.
Key Skills:
- Experience with Apache Druid and its management.
- Proficiency in Infrastructure as Code (IaC) tools such as Terraform and Ansible.
- Knowledge of CI/CD pipeline development and maintenance.
- Experience with Apache Airflow for data orchestration.
- Strong troubleshooting skills for performance issues.
- Familiarity with monitoring and logging solutions (e.g., Prometheus, Grafana, ELK stack).
- Ability to analyze metrics and logs for performance optimization.
- Strong collaboration and communication skills.
- Experience in documenting processes and procedures.
Salary (Rate): undetermined
City: undetermined
Country: USA
Working Arrangements: remote
IR35 Status: outside IR35
Seniority Level: undetermined
Industry: IT
Position: Apache Druid
Location: Remote
Duration: Contract C2C
Job Description:
Ensure high availability and reliability of production systems.
Implement and maintain robust monitoring and alerting systems.
Participate in on-call rotations to respond to incidents and outages.
Conduct post-incident reviews and implement preventative measures.
Automation and Infrastructure as Code (IaC):
Automate infrastructure provisioning, configuration, and deployment using IaC tools (e.g., Terraform, Ansible).
Develop and maintain CI/CD pipelines to streamline software releases.
Optimize and automate data pipelines and workflows.
Apache Druid Management:
Manage and optimize Apache Druid clusters for high performance and scalability.
Troubleshoot Druid performance issues and implement solutions.
Design and implement Druid data ingestion and query optimization strategies.
Apache Airflow Orchestration:
Design, develop, and maintain Airflow DAGs for data orchestration and workflow automation.
Monitor Airflow performance and troubleshoot issues.
Optimize Airflow workflows for efficiency and reliability.
Monitoring and Logging:
Implement and maintain comprehensive monitoring and logging solutions (e.g., Prometheus, Grafana, ELK stack).
Analyze metrics and logs to identify performance bottlenecks and potential issues.
Create and maintain dashboards for visualizing system health and performance.
Collaboration and Communication:
Collaborate with development, data, and operations teams to ensure smooth operations.
Communicate effectively with stakeholders regarding system status and incidents.
Document processes and procedures
Thanks,
Nitesh