End-to-End Engineering & Analytics
Using cloud storage solutions, coding, pipelines, and visualization software to analyze rideshare data
Scroll ↓
Project Overview
Use Jupyter Notebook to mock up data table modeling
Create a bucket to store data in GCS
Create a virtual machine in Cloud Compute to host pipeline
SSH into VM to install Mage Tool to create data loader, transformation, and exporter for full ETL process
Data exported to be housed in Google BigQuery
Load data to Looker Studio to visualize how data partners could use data for analysis
Tools Used
Google Cloud Platform Storage
Python
Compute Instance
Mage Data Pipeline Tool
Google BigQuery
Looker Studio
Github Link
Want More Details? ↓
Data Table Exploration Modeling
Step 1
I started this project by loading the target data (CSV) into a Jupyter notebook. By loading the data here first, I was able to make sure that I could create the code to make my fact and dimension tables before I started in my pipeline. This way I can use fewer resources for development before starting the deployment process.
Into the Cloud & Creating VMs
Step 2
After conceptualizing how the tables would work, I loaded the data into a Google Cloud Storage bucket. Like other cloud solutions, GCS provides a highly scalable and durable object storage service, making it an ideal choice for storing this data. It offers redundancy, data durability, and the ability to back up your data, reducing the risk of data loss.
Step 3
Next, I created a VM instance in Google Compute Instance. By creating a virtual machine, I am able to control the level of isolation between different workloads, allocate resources specifically for each job, and run different operating systems on the same physical hardware. This solution also looks to the future as we can scale up processes as needed.
Creating ETL Processes in Mage
Step 4
I then used an SSH connection to communicate with my virtual machine which allowed me to install Python, Pandas, Mage, and Google Cloud Library. With the data connection ready, I used Mage to create the data pipeline. This tool is similar to Airflow but reduces the need to write additional code to create the pipeline.
Here is the code that creates the different steps of the pipeline: loader, transformer, and exporter
Exporting Data for Analysis & Visualization
Step 5
Even though the data engineering portion of this project is finished, I wanted to showcase how that data would surface to my team of data analysts, business partners, or data scientists. We can see the 9 tables that I created on the left. Let’s next run a query that could be exported to a visualization software!
Step 6
After loading the data into Google Looker, we can start to see our hard work come together! This data started as a source that was not easy to digest. Now with some data engineering magic, we have scalable, durable, and safe pipelines that will empower any data team to create insights for our clients and leadership.