1. Introduction
- This session covers SQL basics and is part of a series on Docker and SQL.
- A taxi zone lookup file has been downloaded and loaded into a Postgres database as a table named
zones
. - The table contains various zones, including locations in Manhattan, Brooklyn, Queens, and Newark Airport.
- The objective is to refresh SQL basics that will be useful later in the course, particularly for analytics engineering and dbt.
2. SQL Queries and Joins
Basic Query Execution
-
Standard way of counting rows:
This is similar towc -l <dataset.csv>
which returns total number of lines in thedataset.csv
.
SELECT COUNT(*) FROM table_name
-
The yellow taxi trips dataset is queried using:
SELECT * FROM yellow_taxi_trips LIMIT 100;
-
The dataset contains columns like:
pickup_time
dropoff_time
-
pu_location_id
(pickup location ID) -
do_location_id
(drop-off location ID)
Objective: Replace location IDs with actual zone names from the
zones
table.
INNER JOIN (Matching Records Only)
-
The
yellow_taxi_trips
table is joined with thezones
table to fetch zone names:
SELECT t.pickup_time, t.dropoff_time, t.total_amount, zp.borough AS pickup_borough, zp.zone AS pickup_zone, zd.borough AS dropoff_borough, zd.zone AS dropoff_zone FROM yellow_taxi_trips t JOIN zones zp ON t.pu_location_id = zp.location_id JOIN zones zd ON t.do_location_id = zd.location_id;
-
Explanation:
-
JOIN
ensures only records where IDs match in both tables are included. -
zones
table is aliased twice (zp
for pickup zones,zd
for drop-off zones). - The result replaces location IDs with actual zone names.
-
-
In our case, here’s the implicit inner join:
SELECT tpep_pickup_datetime, tpep_dropoff_datetime, total_amount, CONCAT(zpu."Borough", ' / ', zpu."Zone") AS "pickup_loc", CONCAT(zdo."Borough", ' / ', zdo."Zone") AS "dropoff_loc" FROM yellow_taxi_data t, taxi_zones zpu, taxi_zones zdo WHERE t."PULocationID" = zpu."LocationID" AND t."DOLocationID" = zdo."LocationID" LIMIT 100;
Alternative Approach Using WHERE Clause (Older Method)
-
An alternative way to achieve the same result using a WHERE clause:
SELECT t.*, zp.zone AS pickup_zone, zd.zone AS dropoff_zone FROM yellow_taxi_trips t, zones zp, zones zd WHERE t.pu_location_id = zp.location_id AND t.do_location_id = zd.location_id;
Drawback: This approach is less readable and less commonly used than explicit
JOIN
syntax.-
Here’s the explicit inner join in our case: Notice no
WHERE
clause is needed.
SELECT tpep_pickup_datetime, tpep_dropoff_datetime, total_amount, CONCAT(zpu."Borough", ' / ', zpu."Zone") AS "pickup_loc", CONCAT(zdo."Borough", ' / ', zdo."Zone") AS "dropoff_loc" FROM yellow_taxi_data t JOIN taxi_zones zpu ON t."PULocationID" = zpu."LocationID" JOIN taxi_zones zdo ON t."DOLocationID" = zdo."LocationID" LIMIT 100;
Checking for Missing Pickup/Dropoff Locations
-
To check if any taxi trips have NULL location IDs:
SELECT * FROM yellow_taxi_trips WHERE pu_location_id IS NULL; SELECT * FROM yellow_taxi_trips WHERE do_location_id IS NULL;
In our case:
SELECT tpep_pickup_datetime, tpep_dropoff_datetime, total_amount, "PULocationID", "DOLocationID" FROM yellow_taxi_data t WHERE "PULocationID" is NULL AND "DOLocationID" is NULL LIMIT 100;
-
To verify if all location IDs exist in the zones table:
SELECT pu_location_id FROM yellow_taxi_trips WHERE pu_location_id NOT IN (SELECT location_id FROM zones);
- If no records are returned, all locations are valid.
In our case:
SELECT tpep_pickup_datetime, tpep_dropoff_datetime, total_amount, "PULocationID", "DOLocationID" FROM yellow_taxi_data t WHERE "PULocationID" NOT IN (SELECT "LocationID" FROM taxi_zones) AND "DOLocationID" NOT IN (SELECT "LocationID" FROM taxi_zones) LIMIT 100;
3. Outer Joins (Handling Missing Data)
LEFT JOIN (Keeping Unmatched Records from the Left Table)
- If a location ID exists in
yellow_taxi_trips
but not inzones
, a normal join would exclude that row. -
A
LEFT JOIN
ensures these rows are included, withNULL
values for missing zone data:
SELECT t.pickup_time, t.dropoff_time, t.total_amount, COALESCE(zp.zone, 'Unknown') AS pickup_zone, COALESCE(zd.zone, 'Unknown') AS dropoff_zone FROM yellow_taxi_trips t LEFT JOIN zones zp ON t.pu_location_id = zp.location_id LEFT JOIN zones zd ON t.do_location_id = zd.location_id;
COALESCE
replaces NULL values with 'Unknown'.
RIGHT JOIN and FULL OUTER JOIN
-
RIGHT JOIN: Keeps all records from the
zones
table, even if there are no matching trips. - FULL OUTER JOIN: Keeps all records from both tables, filling missing values with NULLs.
4. Aggregation Using GROUP BY
Counting Trips Per Day
-
To group taxi trips by day and count them:
SELECT DATE_TRUNC('day', dropoff_time) AS trip_day, COUNT(*) AS trip_count FROM yellow_taxi_trips GROUP BY trip_day ORDER BY trip_day ASC;
-
Alternative approach using
CAST
:
SELECT CAST(dropoff_time AS DATE) AS trip_day, COUNT(*) AS trip_count FROM yellow_taxi_trips GROUP BY trip_day ORDER BY trip_day ASC;
Finding the Busiest Day
-
Sorting by count to find the day with the most trips:
SELECT CAST(dropoff_time AS DATE) AS trip_day, COUNT(*) AS trip_count FROM yellow_taxi_trips GROUP BY trip_day ORDER BY trip_count DESC LIMIT 1;
Example result:
January 28th
was the busiest day.
Finding Maximum Revenue Per Day
-
To check highest fare per trip per day:
SELECT CAST(dropoff_time AS DATE) AS trip_day, MAX(total_amount) AS max_fare FROM yellow_taxi_trips GROUP BY trip_day ORDER BY max_fare DESC;
Some drivers made over $1000 in a single trip.
Grouping by Multiple Columns
-
To group by both day and district:
SELECT CAST(dropoff_time AS DATE) AS trip_day, zd.borough, COUNT(*) AS trip_count FROM yellow_taxi_trips t JOIN zones zd ON t.do_location_id = zd.location_id GROUP BY trip_day, zd.borough ORDER BY trip_day ASC, zd.borough ASC;
-
Instead of listing each column, use positional reference:
GROUP BY 1, 2;
5. Summary and Key Takeaways
- SQL Joins are crucial for combining datasets and replacing foreign keys with meaningful data.
- INNER JOIN returns only matching records, while LEFT JOIN preserves unmatched records.
- GROUP BY is essential for aggregations and analytics, allowing operations like counting trips per day and finding peak revenue days.
- These SQL techniques will be essential in later topics, such as analytics engineering (dbt).
- The session also touched on data ingestion from CSV into Postgres, which is relevant for future lessons.
Top comments (0)