Welcome to the first blog post from this blog for 2025 (and also ever, but who's counting), today I'm going sink my teeth and my fangs into AWS Glue DataBrew. And as this is the first blog (of many), we're going to start it off in the right way: no explanation, just implementation. That's a lie, there will be an explanation, but after the implementation.
In the Glue DataBrew console, create a sample project
This sample project contains all the data we will explore. Let's pick up the dataset for Chess moves.
As always create a new AWS IAM role with the appropriate permissions for the task. I said this blog works backwards, right? In reality, this is best practices but you only do it after you have determined the scope with some nice almost-admin permissions. At least that's the shark's reality.
Start processing the data from the project console
This is a load of data, 17 columns, 2500 rows, lets reduce some of this down, give it some quality. We're looking for the most common opening move for which black wins when two players are within 22 points of each other in ratings (close games), according to this table:
To do this, we have to reduce pretty much everything that's not those things
Remove duplicates
Click the three dots on the column, and you'll find your answer
Apply the changes and continue.
Remove unnecssary columns
Look at the columns and remove everything that is not related to the ratings, the opening move, the winner and the ID, your final columns should look like this:
Filter black not winning
Filter out the non-black winning values using the filter icon:
Create difference column
Create a column to calculate the differences between ratings as follows:
Let's filter this column as we did before with two filters this time, one for -22 and another for 22 in ratings difference.
Not a lot of data, 176 rows, but it helps us with our point, we can even find the frequency of opening moves right there in the console:
The opening that wins the most is A00 (Benko's opening) which makes sense since it is unconventional and not very advantageous for white:
Introduction to GlueDataBrew
So, now we can introduce DataBrew based off of what we understood up there. It is, in its most simplest form, a way to clean (or brew) data and make it easier to consume. Which is abundantly clear from the tutorial, but its good to put it into words. Your final result is a series of data cleaning steps:
You can use these recipes to create data jobs that will follow this recipe for similar data.
There is however, things we could do better, which I would encourage you to look at:
- Provision for potential missing values and filter them.
- Widened the search to a larger ratings range.
- Made the range so that if black was a more experienced player (far greater range), those would be reflected in the dataset.
But I digress, you can start with this improve on it, and remember, I didn't become a SpiderShark just for the fun of it, I did it so I would have 10 appendages to type with, just like a human! ChompChomp!
Top comments (0)