Pizofreude

Posted on Jan 28

Study Notes: DE Zoomcamp 1.2.2 - Ingesting NY Taxi Data to Postgres

#dataengineering #dezoomcamp #docker #postgres

Introduction and Context

The video builds upon the previous lesson where Docker was introduced, focusing on how it can be used for data engineering tasks.
The goal of this session is to set up a PostgreSQL database using Docker, ingest data into it, and practice SQL queries.
The NY Taxi dataset will be used throughout the course to learn SQL and data ingestion techniques.

Key Concepts Covered

1. Running PostgreSQL in Docker

Docker Image: Official PostgreSQL Docker image (version 13) is used.

Configuration: Includes setting environment variables for the database:

User: root
Password: root

Database: ny_taxi

docker run -it \
  -e POSTGRES_USER="root" \
  -e POSTGRES_PASSWORD="root" \
  -e POSTGRES_DB="ny_taxi" \
  -v c:/workspace/de-zoomcamp/1_intro_to_data_engineering/docker_sql/ny_taxi_postgres_data:/var/lib/postgresql/data \
  -p 5433:5432 \
  postgres:13

Commands for Configuration:
- Use e flags to set environment variables.
- Use v flag to map a folder on the host machine to the container for persistent storage. path-to-host-folder:path-to-container-folder
- Use p flag to map ports (5432 for PostgreSQL). If you happen to install Postgres on your local machine, make sure to use other port number for your Docker Postgresql to avoid conflict such as: PGCLI -connection failed: FATAL: password authentication failed for user "root"

2. Persisting Data with Docker

Mounting Volumes:
- A folder (e.g., ny_taxi_postgres_data) is mapped to store database files.
- Ensures that data remains available even if the Docker container is restarted.
Windows-Specific Note: Windows paths require absolute paths with specific formatting (e.g., C:\path\to\folder).

3. Accessing PostgreSQL Database

Using pgcli:
- A Python-based command-line client for PostgreSQL.
- Installed using pip install pgcli.
- Connection command format: pgcli -h localhost -p 5432 -u root -d ny_taxi. If you installed Postgres on your local machine, then consider mapping other port to your local machine to run Docker Postgres successfully. E.g.: pgcli -h localhost -p 5433 -u root -d ny_taxi
- Allows running SQL queries directly from the terminal, e.g.:
  - \dt = list tables
  - \db = list tablespaces
  - SELECT COUNT(1) FROM yellow_taxi_data; = count data rows available

4. Exploring the NY Taxi Dataset

Dataset Overview:
- Contains taxi trip records (yellow and green taxis) from NYC.
- Fields include pickup/dropoff times, passenger count, distance, fare, tips, etc.
- Data dictionary is available for detailed field descriptions.
Downloading the Dataset:
- Use wget or manual download to get the CSV files:
  - yellow_tripdata_2021-01.csv
  - taxi_zone_lookup.csv e.g. wget https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv
- For performance, process only a subset of rows (e.g., first 100 rows).

5. Preparing the Dataset with Pandas

Environment Setup:
- Use Jupyter Notebook for interactive data exploration.
- Libraries used: pandas and sqlalchemy.
Initial Data Exploration:
- Read the dataset using pandas.read_csv().
- Examine the first few rows and basic statistics.

Data Schema Definition:

Create an SQL table schema based on the dataset structure.

Example schema:

CREATE TABLE yellow_taxi_data (
    vendor_id INTEGER,
    pickup_datetime TIMESTAMP,
    dropoff_datetime TIMESTAMP,
    passenger_count INTEGER,
    trip_distance FLOAT,
    fare_amount DECIMAL,
    tip_amount DECIMAL,
    total_amount DECIMAL
);

Data Definition Language: print(pd.io.sql.get_schema(df, name='yellow_taxi_data')):

CREATE TABLE yellow_taxi_data (
    "VendorID" BIGINT, 
    tpep_pickup_datetime TIMESTAMP WITHOUT TIME ZONE, 
    tpep_dropoff_datetime TIMESTAMP WITHOUT TIME ZONE, 
    passenger_count BIGINT, 
    trip_distance FLOAT(53), 
    "RatecodeID" BIGINT, 
    store_and_fwd_flag TEXT, 
    "PULocationID" BIGINT, 
    "DOLocationID" BIGINT, 
    payment_type BIGINT, 
    fare_amount FLOAT(53), 
    extra FLOAT(53), 
    mta_tax FLOAT(53), 
    tip_amount FLOAT(53), 
    tolls_amount FLOAT(53), 
    improvement_surcharge FLOAT(53), 
    total_amount FLOAT(53), 
    congestion_surcharge FLOAT(53)
)

Data ingestion in batch due to large number of dataset:
- shift+tab = open help for function in jupyter notebook

6. Ingesting Data into PostgreSQL

Chunk Processing:
- To handle large datasets efficiently, read and insert data in chunks (e.g., 100,000 rows per chunk).
- Use Pandas’ to_sql method to insert data into PostgreSQL.
Connection Setup:
- Use sqlalchemy to create a database engine.
- Connection string format: postgresql://<user>:<password>@<host>:<port>/<database>.
Ingestion Workflow:
- Generate the SQL schema.
- Create the table in PostgreSQL.
- Insert data chunk by chunk.
- Verify the data using SQL queries (e.g., SELECT COUNT(*) FROM yellow_taxi_data;).

Practical Tips and Insights

Setting Up Docker:
- Use docker run commands with appropriate flags to configure and run containers.
- Troubleshoot port conflicts using docker ps and docker stop commands.
Optimizing Data Handling:
- For large datasets, avoid loading everything into memory at once; use iterators.
- Save subsets of data for quicker initial exploration.
Windows-Specific Adjustments:
- Pay attention to path formats and ensure compatibility with Docker.
SQL Best Practices:
- Define appropriate data types (e.g., DECIMAL for monetary values).
- Use indexes for faster querying when working with large datasets (covered in later lessons).
Debugging and Verification:
- Always test connections and data integrity after setup.
- Use simple queries like SELECT 1; to verify database readiness.

Next Steps in the Course

Learn about Docker Compose to manage multiple Docker containers.
Use Airflow for orchestrating data pipelines.
Perform advanced SQL queries and integrate with BigQuery for analytics.

DEV Community

Study Notes: DE Zoomcamp 1.2.2 - Ingesting NY Taxi Data to Postgres

Introduction and Context

Key Concepts Covered

1. Running PostgreSQL in Docker

2. Persisting Data with Docker

3. Accessing PostgreSQL Database

4. Exploring the NY Taxi Dataset

5. Preparing the Dataset with Pandas

6. Ingesting Data into PostgreSQL

Practical Tips and Insights

Next Steps in the Course

Top comments (0)

Read next

How to Build a CRUD App with Golang, Gin, and PostgreSQL

Short: User Data file for Ubuntu based AWS ec2 instance with docker and docker compose.

Understanding Views in Databases: A Complete Guide 🧠💡

Expediting Productivity: My Favorite macOS Apps of 2024