Caue Paiva caue-paiva

Hello! My name is Cauê and I am a computer science student at USP.

🔭 Currently, my main interest is developing Data related projects, such as ETL (Extract, Transform and Load) Pipelines on the cloud, WebScrapping with Python and Data Warehouses for Analytics.
🌱 I am learning how to use various technologies, a few examples are: Python, Pandas, Airflow, AWS, SQL, Postgres, Selenium, Langchain
📫 You can reach me at the email cauepaivalira@outlook.com.

My projects

Projects i develop as part of a São Paulo State Research Foundation (FAPESP) R&D grant program

Data Warehouse and automatic ETL pipeline for extracting and analyzing public brazilian goverment data

This project aims to develop a Data Warehouse (DW) that consolidates multiple public government data points over several years, focusing on socio-economic indicators. The DW will support analytical queries and time-series analysis, providing decision-makers with deeper insights into areas such as Economic Activity, Environmental Policies and Damage, and Public Health. Additionally, the project features an ETL pipeline to automate the collection, transformation, and loading of data from public sources into the DW.

Modules of the Project

Automatic ETL pipeline for extracting, cleaning and processing public brazilian goverment data using APIs and Webscrapping

Python and SQL scripts related to the Data Warehouse, its schema and the insertion and retrieval of Data

Projects i develop as part of a brazilian goverment R&D grant program (PIBIT CNPq)

Educational Chatbot for Brazilian high school students

The project builds upon the educational capabilities of Large Language Models (ex: GPT-3.5 and GPT-4) for education ,while also mitigating weaknesses such as hallucination and lack of knowledge about certain subjects and tests within the brazilian university admittance standardized test (ENEM).

To achieve these results an LLM application, using openAI models (gpt-3.5 turbo or gpt-4), along with aditional modules, such as internet search and retrieval augmented generarion for extra functionality, was developed.

According to feedback, over 60% of users said our solution has better and more accurate answers than chatGPT

CustomGPTs using APIs hosted on AWS

Implementation of the Educational Chatbot described above but using the new OpenAI customGPTs service.

Helpful Prompts and data extracted from official sources about the ENEM test was used for better results.

For the purpose of RAG over ENEM test questions a GPT action and its associated API was used, the API is hosted on AWS API gateway and uses a Lambda Function for taking user inputs, embedding them with openAI embeddings and then querying Qdrant vectorDB for the N questions more similar to user input, with N being the number of questions the user asked.

ETL pipeline for processing PDFs and feeding data into vectorDBs

For the educational chatbots, both the website and the customGPT version, i needed a large dataset of ENEM questions and their correct answers for the purpose of RAG and reduce LLM hallucinations (such as giving the wrong answer to a question) but no such large scale data was available online.

In such context i created this project, which combines PDF/data mining through libraries like PyMuPDF2 to transform the ENEM pdf into either textual data or into JSON files (Extraction and Transform part) and then a Qdrant VectorDB loader to load the data into the vectorstore (Load part). That combination is able to process either single tests PDFs (and their associated answer PDFs) or entire folders with multiple tests, loading hundreds of questions at once, all while providing metadata and stats about the extraction process (number of extracted questions per year and subject) to a CSV file, through a Pandas DataFrame.

Projects i developed to learn new technologies and concepts!

Crypto Data ETL pipeline with Airflow and AWS

This project aims to collect and update data on cryptocurrencies like Bitcoin and Ethereum, storing the information in CSV files. These files cover extensive periods of trading data collected from the Binance US API.

The main technologies used are AWS Cloud (Lambda, API gateway, EC2 and S3), Apache Airflow for Data pipeline orchestration, Python and Pandas for manipulating the data

Heres the architecture of the Project/Pipeline:

Projects i developed as part of the Universidade of São Paulo Cientific Initiation Symposium (SIICUSP 2023)

Robot with Computer Vision and Speech Recognition

Project developed in group for an eletronics class in university

The goal of this effort was the integrate Machine Learning Models , such as Computer vision and text classification, with a robot powered by a microcontroller (ESP-32)

My main contribution was with software development for the ESP-32 embedded systems, using C++ and modules such as Wi-Fi HTTP request handlers.

Heres the certificate for the Symposium

Technologies i am familiar with:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly