- 🔭 Currently, my main interest is developing Data related projects, such as ETL (Extract, Transform and Load) Pipelines on the cloud, WebScrapping with Python and Data Warehouses for Analytics.
- 🌱 I am learning how to use various technologies, a few examples are: Python, Pandas, Airflow, AWS, SQL, Postgres, Selenium, Langchain
- 📫 You can reach me at the email cauepaivalira@outlook.com.
Data Warehouse and automatic ETL pipeline for extracting and analyzing public brazilian goverment data
This project aims to develop a Data Warehouse (DW) that consolidates multiple public government data points over several years, focusing on socio-economic indicators. The DW will support analytical queries and time-series analysis, providing decision-makers with deeper insights into areas such as Economic Activity, Environmental Policies and Damage, and Public Health. Additionally, the project features an ETL pipeline to automate the collection, transformation, and loading of data from public sources into the DW.
The project builds upon the educational capabilities of Large Language Models (ex: GPT-3.5 and GPT-4) for education ,while also mitigating weaknesses such as hallucination and lack of knowledge about certain subjects and tests within the brazilian university admittance standardized test (ENEM).
To achieve these results an LLM application, using openAI models (gpt-3.5 turbo or gpt-4), along with aditional modules, such as internet search and retrieval augmented generarion for extra functionality, was developed.
According to feedback, over 60% of users said our solution has better and more accurate answers than chatGPT
Helpful Prompts and data extracted from official sources about the ENEM test was used for better results.
For the purpose of RAG over ENEM test questions a GPT action and its associated API was used, the API is hosted on AWS API gateway and uses a Lambda Function for taking user inputs, embedding them with openAI embeddings and then querying Qdrant vectorDB for the N questions more similar to user input, with N being the number of questions the user asked.
For the educational chatbots, both the website and the customGPT version, i needed a large dataset of ENEM questions and their correct answers for the purpose of RAG and reduce LLM hallucinations (such as giving the wrong answer to a question) but no such large scale data was available online.
In such context i created this project, which combines PDF/data mining through libraries like PyMuPDF2 to transform the ENEM pdf into either textual data or into JSON files (Extraction and Transform part) and then a Qdrant VectorDB loader to load the data into the vectorstore (Load part). That combination is able to process either single tests PDFs (and their associated answer PDFs) or entire folders with multiple tests, loading hundreds of questions at once, all while providing metadata and stats about the extraction process (number of extracted questions per year and subject) to a CSV file, through a Pandas DataFrame.
This project aims to collect and update data on cryptocurrencies like Bitcoin and Ethereum, storing the information in CSV files. These files cover extensive periods of trading data collected from the Binance US API.
The main technologies used are AWS Cloud (Lambda, API gateway, EC2 and S3), Apache Airflow for Data pipeline orchestration, Python and Pandas for manipulating the data
Heres the architecture of the Project/Pipeline:
Projects i developed as part of the Universidade of São Paulo Cientific Initiation Symposium (SIICUSP 2023)
Project developed in group for an eletronics class in university
The goal of this effort was the integrate Machine Learning Models , such as Computer vision and text classification, with a robot powered by a microcontroller (ESP-32)
My main contribution was with software development for the ESP-32 embedded systems, using C++ and modules such as Wi-Fi HTTP request handlers.
Heres the certificate for the Symposium