DEV Community

Cover image for Study Notes 1.2.6: SQL Refresher
Pizofreude
Pizofreude

Posted on

Study Notes 1.2.6: SQL Refresher

1. Introduction

  • This session covers SQL basics and is part of a series on Docker and SQL.
  • A taxi zone lookup file has been downloaded and loaded into a Postgres database as a table named zones.
  • The table contains various zones, including locations in Manhattan, Brooklyn, Queens, and Newark Airport.
  • The objective is to refresh SQL basics that will be useful later in the course, particularly for analytics engineering and dbt.

2. SQL Queries and Joins

Basic Query Execution

  • Standard way of counting rows:
    This is similar to wc -l <dataset.csv> which returns total number of lines in the dataset.csv.

    SELECT COUNT(*)
    FROM table_name
    
  • The yellow taxi trips dataset is queried using:

    SELECT
        *
    FROM
        yellow_taxi_trips
    LIMIT 100;
    
    
  • The dataset contains columns like:

    • pickup_time
    • dropoff_time
    • pu_location_id (pickup location ID)
    • do_location_id (drop-off location ID)
  • Objective: Replace location IDs with actual zone names from the zones table.

INNER JOIN (Matching Records Only)

  • The yellow_taxi_trips table is joined with the zones table to fetch zone names:

    SELECT t.pickup_time, t.dropoff_time, t.total_amount,
           zp.borough AS pickup_borough, zp.zone AS pickup_zone,
           zd.borough AS dropoff_borough, zd.zone AS dropoff_zone
    FROM yellow_taxi_trips t
    JOIN zones zp ON t.pu_location_id = zp.location_id
    JOIN zones zd ON t.do_location_id = zd.location_id;
    
    
  • Explanation:

    • JOIN ensures only records where IDs match in both tables are included.
    • zones table is aliased twice (zp for pickup zones, zd for drop-off zones).
    • The result replaces location IDs with actual zone names.
  • In our case, here’s the implicit inner join:

    SELECT
        tpep_pickup_datetime,
        tpep_dropoff_datetime,
        total_amount,
        CONCAT(zpu."Borough", ' / ', zpu."Zone") AS "pickup_loc",
        CONCAT(zdo."Borough", ' / ', zdo."Zone") AS "dropoff_loc"   
    FROM
        yellow_taxi_data t,
        taxi_zones zpu,
        taxi_zones zdo
    WHERE
        t."PULocationID" = zpu."LocationID" AND
        t."DOLocationID" = zdo."LocationID"
    LIMIT 100;
    

Alternative Approach Using WHERE Clause (Older Method)

  • An alternative way to achieve the same result using a WHERE clause:

    SELECT t.*, zp.zone AS pickup_zone, zd.zone AS dropoff_zone
    FROM yellow_taxi_trips t, zones zp, zones zd
    WHERE t.pu_location_id = zp.location_id
    AND t.do_location_id = zd.location_id;
    
    
  • Drawback: This approach is less readable and less commonly used than explicit JOIN syntax.

  • Here’s the explicit inner join in our case: Notice no WHERE clause is needed.

    SELECT
        tpep_pickup_datetime,
        tpep_dropoff_datetime,
        total_amount,
        CONCAT(zpu."Borough", ' / ', zpu."Zone") AS "pickup_loc",
        CONCAT(zdo."Borough", ' / ', zdo."Zone") AS "dropoff_loc"   
    FROM
        yellow_taxi_data t JOIN taxi_zones zpu
            ON t."PULocationID" = zpu."LocationID"
        JOIN taxi_zones zdo
            ON t."DOLocationID" = zdo."LocationID"
    
    LIMIT 100;
    

Checking for Missing Pickup/Dropoff Locations

  • To check if any taxi trips have NULL location IDs:

    SELECT * FROM yellow_taxi_trips WHERE pu_location_id IS NULL;
    SELECT * FROM yellow_taxi_trips WHERE do_location_id IS NULL;
    
    

    In our case:

    SELECT
        tpep_pickup_datetime,
        tpep_dropoff_datetime,
        total_amount,
        "PULocationID",
        "DOLocationID"
    FROM
        yellow_taxi_data t
    WHERE
        "PULocationID" is NULL AND
        "DOLocationID" is NULL
    LIMIT 100;
    
  • To verify if all location IDs exist in the zones table:

    SELECT pu_location_id FROM yellow_taxi_trips
    WHERE pu_location_id NOT IN (SELECT location_id FROM zones);
    
    
    • If no records are returned, all locations are valid.

    In our case:

    SELECT
        tpep_pickup_datetime,
        tpep_dropoff_datetime,
        total_amount,
        "PULocationID",
        "DOLocationID"
    FROM
        yellow_taxi_data t
    WHERE
        "PULocationID" NOT IN (SELECT "LocationID" FROM taxi_zones) AND
        "DOLocationID" NOT IN (SELECT "LocationID" FROM taxi_zones)
    LIMIT 100;
    

3. Outer Joins (Handling Missing Data)

LEFT JOIN (Keeping Unmatched Records from the Left Table)

  • If a location ID exists in yellow_taxi_trips but not in zones, a normal join would exclude that row.
  • A LEFT JOIN ensures these rows are included, with NULL values for missing zone data:

    SELECT t.pickup_time, t.dropoff_time, t.total_amount,
           COALESCE(zp.zone, 'Unknown') AS pickup_zone,
           COALESCE(zd.zone, 'Unknown') AS dropoff_zone
    FROM yellow_taxi_trips t
    LEFT JOIN zones zp ON t.pu_location_id = zp.location_id
    LEFT JOIN zones zd ON t.do_location_id = zd.location_id;
    
    
  • COALESCE replaces NULL values with 'Unknown'.

RIGHT JOIN and FULL OUTER JOIN

  • RIGHT JOIN: Keeps all records from the zones table, even if there are no matching trips.
  • FULL OUTER JOIN: Keeps all records from both tables, filling missing values with NULLs.

4. Aggregation Using GROUP BY

Counting Trips Per Day

  • To group taxi trips by day and count them:

    SELECT DATE_TRUNC('day', dropoff_time) AS trip_day, COUNT(*) AS trip_count
    FROM yellow_taxi_trips
    GROUP BY trip_day
    ORDER BY trip_day ASC;
    
    
  • Alternative approach using CAST:

    SELECT CAST(dropoff_time AS DATE) AS trip_day, COUNT(*) AS trip_count
    FROM yellow_taxi_trips
    GROUP BY trip_day
    ORDER BY trip_day ASC;
    
    

Finding the Busiest Day

  • Sorting by count to find the day with the most trips:

    SELECT CAST(dropoff_time AS DATE) AS trip_day, COUNT(*) AS trip_count
    FROM yellow_taxi_trips
    GROUP BY trip_day
    ORDER BY trip_count DESC
    LIMIT 1;
    
    
  • Example result: January 28th was the busiest day.

Finding Maximum Revenue Per Day

  • To check highest fare per trip per day:

    SELECT CAST(dropoff_time AS DATE) AS trip_day, MAX(total_amount) AS max_fare
    FROM yellow_taxi_trips
    GROUP BY trip_day
    ORDER BY max_fare DESC;
    
    
  • Some drivers made over $1000 in a single trip.

Grouping by Multiple Columns

  • To group by both day and district:

    SELECT CAST(dropoff_time AS DATE) AS trip_day, zd.borough, COUNT(*) AS trip_count
    FROM yellow_taxi_trips t
    JOIN zones zd ON t.do_location_id = zd.location_id
    GROUP BY trip_day, zd.borough
    ORDER BY trip_day ASC, zd.borough ASC;
    
    
  • Instead of listing each column, use positional reference:

    GROUP BY 1, 2;
    
    

5. Summary and Key Takeaways

  • SQL Joins are crucial for combining datasets and replacing foreign keys with meaningful data.
  • INNER JOIN returns only matching records, while LEFT JOIN preserves unmatched records.
  • GROUP BY is essential for aggregations and analytics, allowing operations like counting trips per day and finding peak revenue days.
  • These SQL techniques will be essential in later topics, such as analytics engineering (dbt).
  • The session also touched on data ingestion from CSV into Postgres, which is relevant for future lessons.

Top comments (0)