Jul 2023 — Jul 2028
Feb 2025 — Dec 2025
• Deployment, versioning, and maintenance of a Python package for structured information extraction from PDF documents, based on open-source Machine Learning projects • Design and implementation of a data extraction pipeline, integrating multiple Machine Learning techniques to generate structured tables from historical documents • Development of an intelligent agent system for the institutional knowledge repository, using state-of-the-art frameworks and RAG (Retrieval-Augmented Generation) architectures
Oct 2023 — Feb 2025
• Leadership and coordination of development teams using Scrum, increasing operational efficiency and delivery predictability • Development and maintenance of web applications with Next.js, NestJS, and PostgreSQL, with a stronger focus on front-end architecture and implementation • Automation of internal processes through a member management system, reducing the execution time of organizational processes by up to half
• Development of a complete web portal for managing internal processes of a junior enterprise’s members • Implementation of data collection and analysis to support internal policy planning • Design of a scalable architecture using modern front-end and back-end frameworks • Integration with AWS services, including Amazon RDS and S3 for data persistence and storage
• Development of a RAG system for search and question answering in Ipea’s public knowledge repository, with support for tables, charts, and images • Document ingestion pipeline using Docling, local computer vision models, and embedding generation • Implementation of a multi-agent system using the Agno framework and the ChatGPT API to handle different query profiles • Deployment with Docker and Kubernetes; Streamlit interface and FastAPI backend
• Development of a hybrid pipeline (CV + OCR + LLM) for structured extraction of student data from scanned historical official documents • Fine-tuning and validation of a YOLO model for robust multi-column layout detection, increasing name recall to up to 75% • Integration of OCR with typographic metadata (font size, style) and reconstruction of educational hierarchy in noisy documents • Application of LLMs for reliable semantic parsing of semi-structured text, replacing fragile regex-based approaches • Empirical evaluation of multiple approaches (OCR-only, k-means, CV-only), including cost, error, and scalability analysis • Generation of a longitudinal dataset for educational studies (1982–2001), enabling previously infeasible analyses