Pizofreude

Posted on Feb 4

Study Notes 1.2.6: SQL Refresher

#dataengineering #dezoomcamp #sql #postgres

1. Introduction

This session covers SQL basics and is part of a series on Docker and SQL.
A taxi zone lookup file has been downloaded and loaded into a Postgres database as a table named zones.
The table contains various zones, including locations in Manhattan, Brooklyn, Queens, and Newark Airport.
The objective is to refresh SQL basics that will be useful later in the course, particularly for analytics engineering and dbt.

2. SQL Queries and Joins

Basic Query Execution

Standard way of counting rows:
This is similar to wc -l <dataset.csv> which returns total number of lines in the dataset.csv.
```
SELECT COUNT(*)
FROM table_name
```

The yellow taxi trips dataset is queried using:

SELECT
    *
FROM
    yellow_taxi_trips
LIMIT 100;

The dataset contains columns like:
- pickup_time
- dropoff_time
- pu_location_id (pickup location ID)
- do_location_id (drop-off location ID)
Objective: Replace location IDs with actual zone names from the zones table.

INNER JOIN (Matching Records Only)

The yellow_taxi_trips table is joined with the zones table to fetch zone names:

SELECT t.pickup_time, t.dropoff_time, t.total_amount,
       zp.borough AS pickup_borough, zp.zone AS pickup_zone,
       zd.borough AS dropoff_borough, zd.zone AS dropoff_zone
FROM yellow_taxi_trips t
JOIN zones zp ON t.pu_location_id = zp.location_id
JOIN zones zd ON t.do_location_id = zd.location_id;

Explanation:
- JOIN ensures only records where IDs match in both tables are included.
- zones table is aliased twice (zp for pickup zones, zd for drop-off zones).
- The result replaces location IDs with actual zone names.

In our case, here’s the implicit inner join:

SELECT
    tpep_pickup_datetime,
    tpep_dropoff_datetime,
    total_amount,
    CONCAT(zpu."Borough", ' / ', zpu."Zone") AS "pickup_loc",
    CONCAT(zdo."Borough", ' / ', zdo."Zone") AS "dropoff_loc"   
FROM
    yellow_taxi_data t,
    taxi_zones zpu,
    taxi_zones zdo
WHERE
    t."PULocationID" = zpu."LocationID" AND
    t."DOLocationID" = zdo."LocationID"
LIMIT 100;

Alternative Approach Using WHERE Clause (Older Method)

An alternative way to achieve the same result using a WHERE clause:

SELECT t.*, zp.zone AS pickup_zone, zd.zone AS dropoff_zone
FROM yellow_taxi_trips t, zones zp, zones zd
WHERE t.pu_location_id = zp.location_id
AND t.do_location_id = zd.location_id;

Drawback: This approach is less readable and less commonly used than explicit JOIN syntax.

Here’s the explicit inner join in our case: Notice no WHERE clause is needed.

SELECT
    tpep_pickup_datetime,
    tpep_dropoff_datetime,
    total_amount,
    CONCAT(zpu."Borough", ' / ', zpu."Zone") AS "pickup_loc",
    CONCAT(zdo."Borough", ' / ', zdo."Zone") AS "dropoff_loc"   
FROM
    yellow_taxi_data t JOIN taxi_zones zpu
        ON t."PULocationID" = zpu."LocationID"
    JOIN taxi_zones zdo
        ON t."DOLocationID" = zdo."LocationID"

LIMIT 100;

Checking for Missing Pickup/Dropoff Locations

To check if any taxi trips have NULL location IDs:

SELECT * FROM yellow_taxi_trips WHERE pu_location_id IS NULL;
SELECT * FROM yellow_taxi_trips WHERE do_location_id IS NULL;

In our case:

SELECT
    tpep_pickup_datetime,
    tpep_dropoff_datetime,
    total_amount,
    "PULocationID",
    "DOLocationID"
FROM
    yellow_taxi_data t
WHERE
    "PULocationID" is NULL AND
    "DOLocationID" is NULL
LIMIT 100;

To verify if all location IDs exist in the zones table:

SELECT pu_location_id FROM yellow_taxi_trips
WHERE pu_location_id NOT IN (SELECT location_id FROM zones);

If no records are returned, all locations are valid.

In our case:

SELECT
    tpep_pickup_datetime,
    tpep_dropoff_datetime,
    total_amount,
    "PULocationID",
    "DOLocationID"
FROM
    yellow_taxi_data t
WHERE
    "PULocationID" NOT IN (SELECT "LocationID" FROM taxi_zones) AND
    "DOLocationID" NOT IN (SELECT "LocationID" FROM taxi_zones)
LIMIT 100;

3. Outer Joins (Handling Missing Data)

LEFT JOIN (Keeping Unmatched Records from the Left Table)

If a location ID exists in yellow_taxi_trips but not in zones, a normal join would exclude that row.

A LEFT JOIN ensures these rows are included, with NULL values for missing zone data:

SELECT t.pickup_time, t.dropoff_time, t.total_amount,
       COALESCE(zp.zone, 'Unknown') AS pickup_zone,
       COALESCE(zd.zone, 'Unknown') AS dropoff_zone
FROM yellow_taxi_trips t
LEFT JOIN zones zp ON t.pu_location_id = zp.location_id
LEFT JOIN zones zd ON t.do_location_id = zd.location_id;

COALESCE replaces NULL values with 'Unknown'.

RIGHT JOIN and FULL OUTER JOIN

RIGHT JOIN: Keeps all records from the zones table, even if there are no matching trips.
FULL OUTER JOIN: Keeps all records from both tables, filling missing values with NULLs.

4. Aggregation Using GROUP BY

Counting Trips Per Day

To group taxi trips by day and count them:

SELECT DATE_TRUNC('day', dropoff_time) AS trip_day, COUNT(*) AS trip_count
FROM yellow_taxi_trips
GROUP BY trip_day
ORDER BY trip_day ASC;

Alternative approach using CAST:

SELECT CAST(dropoff_time AS DATE) AS trip_day, COUNT(*) AS trip_count
FROM yellow_taxi_trips
GROUP BY trip_day
ORDER BY trip_day ASC;

Finding the Busiest Day

Sorting by count to find the day with the most trips:

SELECT CAST(dropoff_time AS DATE) AS trip_day, COUNT(*) AS trip_count
FROM yellow_taxi_trips
GROUP BY trip_day
ORDER BY trip_count DESC
LIMIT 1;

Example result: January 28th was the busiest day.

Finding Maximum Revenue Per Day

To check highest fare per trip per day:

SELECT CAST(dropoff_time AS DATE) AS trip_day, MAX(total_amount) AS max_fare
FROM yellow_taxi_trips
GROUP BY trip_day
ORDER BY max_fare DESC;

Some drivers made over $1000 in a single trip.

Grouping by Multiple Columns

To group by both day and district:

SELECT CAST(dropoff_time AS DATE) AS trip_day, zd.borough, COUNT(*) AS trip_count
FROM yellow_taxi_trips t
JOIN zones zd ON t.do_location_id = zd.location_id
GROUP BY trip_day, zd.borough
ORDER BY trip_day ASC, zd.borough ASC;

Instead of listing each column, use positional reference:
```
GROUP BY 1, 2;
```

5. Summary and Key Takeaways

SQL Joins are crucial for combining datasets and replacing foreign keys with meaningful data.
INNER JOIN returns only matching records, while LEFT JOIN preserves unmatched records.
GROUP BY is essential for aggregations and analytics, allowing operations like counting trips per day and finding peak revenue days.
These SQL techniques will be essential in later topics, such as analytics engineering (dbt).
The session also touched on data ingestion from CSV into Postgres, which is relevant for future lessons.

DEV Community