Negotiable
Undetermined
Undetermined
EMEA
Summary: The Python Developer role at Andela focuses on building scalable data ingestion pipelines and extracting structured content from complex documents, particularly PDFs and scanned materials. This backend-oriented position requires significant experience in document processing and aims to support GenAI applications by ensuring high-quality data extraction. The role emphasizes collaboration within a cross-functional team and integration with AWS infrastructure. Candidates should possess strong Python skills and familiarity with OCR tools and document processing libraries.
Key Responsibilities:
- Design and implement robust data extraction pipelines to process diverse document types, especially PDFs with both text and scanned content.
- Customize extraction logic per data source, including metadata extraction (e.g., machine IDs, customer information).
- Work with document processing tools like Tesseract, Unstructured IO, or similar.
- Integrate with AWS-based infrastructure, including Lambda and ECS for deployment.
- Collaborate with a cross-functional team to onboard and validate new data sources.
- Ensure the high accuracy and quality of extracted data to support downstream GenAI use.
Key Skills:
- 5–10 years of professional experience with Python, especially in backend or data engineering roles.
- Strong hands-on experience with document content extraction, particularly from PDFs with complex formats (e.g., scanned images, drawings).
- Familiarity with OCR tools (e.g., Tesseract) and content extraction libraries (e.g., Unstructured IO, pdfminer).
- Proficient in building modular, production-grade Python code with data models and validation (e.g., Pydantic).
- Working knowledge of AWS services, especially Lambda, ECS, and containerization with Docker.
- Ability to quickly understand new data structures and design custom ingestion strategies.
Salary (Rate): undetermined
City: undetermined
Country: undetermined
Working Arrangements: undetermined
IR35 Status: undetermined
Seniority Level: undetermined
Industry: IT
About Andela
Andela exists to connect brilliance and opportunity. Since 2014, we have been dedicated to breaking down global barriers and accelerating the future of work for both technologists and organizations around the world. For technologists, Andela offers competitive long-term career opportunities with leading organizations, access to a global community of professionals, and education opportunities with leading technology providers. For companies, Andela provides access to a global network of fully integrated team members that unlock their business innovation and growth potential. At Andela, we are deeply passionate about creating long-lasting and transformative growth opportunities for all and doing it in an E.P.I.C. way. We are excited to continue building our remote-first team with incredible people like you!
About the role
The role focuses on building scalable data ingestion pipelines and extracting structured content from complex, often unstructured documents, especially PDF reports, scanned documents, and technical drawings. You will play a key part in enabling the GenAI application to access and reason over new data sources. This is a backend-focused role, with responsibilities centered on content extraction and processing. While exposure to GenAI technologies is beneficial, the primary requirement is deep hands-on experience with PDF/document processing.
Responsibilities
- Design and implement robust data extraction pipelines to process diverse document types, especially PDFs with both text and scanned content.
- Customize extraction logic per data source, including metadata extraction (e.g., machine IDs, customer information).
- Work with document processing tools like Tesseract, Unstructured IO, or similar.
- Integrate with AWS-based infrastructure, including Lambda and ECS for deployment.
- Collaborate with a cross-functional team to onboard and validate new data sources.
- Ensure the high accuracy and quality of extracted data to support downstream GenAI use.
Qualifications
- 5–10 years of professional experience with Python, especially in backend or data engineering roles.
- Strong hands-on experience with document content extraction, particularly from PDFs with complex formats (e.g., scanned images, drawings).
- Familiarity with OCR tools (e.g., Tesseract) and content extraction libraries (e.g., Unstructured IO, pdfminer).
- Proficient in building modular, production-grade Python code with data models and validation (e.g., Pydantic).
- Working knowledge of AWS services, especially Lambda, ECS, and containerization with Docker.
- Ability to quickly understand new data structures and design custom ingestion strategies.
Preferred Qualifications
- Prior experience working on GenAI or LLM-powered applications, especially in document understanding or search contexts.
- Experience with AWS Textract or Azure Document Intelligence for cloud-based content extraction.
- Familiarity with chunking strategies and data preparation for vector databases (e.g., for retrieval-augmented generation).
- Experience in fast-paced, deadline-driven projects and ability to deliver with minimal supervision.
- Comfortable working in globally distributed teams, with flexibility to align with European time zones.
- Overlap Hours: 5-8 hours with CET (UTC+2)
At Andela, we outcompete through diversity. We know that our strengths lie in the multiplicity of talents, perspectives, backgrounds, and orientations of residents in our community and we take pride in that. Andela is committed to a work environment in which all individuals are treated with respect and dignity. Each individual has the right to work in a professional atmosphere that promotes equal employment opportunities and prohibits discriminatory practices. Andela provides equal employment opportunities and workplace to all employees and applicants without regard to factors including but not limited to race, color, religion, gender, sexual orientation, gender identity, national origin, age, disability, pregnancy (including breastfeeding), genetic information, HIV/AIDS or any other medical status, family or parental status, marital status, amnesty or status as a covered veteran in accordance with applicable federal, state and local laws. This commitment applies to all terms and conditions of employment, including but not limited to hiring, placement, promotion, termination, layoff, recall, transfer, leaves of absence, compensation, and training. Our policies expressly prohibit any form of harassment and/or discrimination as stated above. Andela is home for all, come as you are.