Background
After my last visualisation of Harry Potter data, I decided to use some other data to create word clouds. I am a huge Skyrim fan and always wanted to learn how to use xEdit Scripts. As a result, here I am with another Word Cloud.
Disclaimer: A lot of the code is shared between my previous project and this one
Visualisations
Getting the data
I don't like opening my Windows installation (I have a dual boot setup, and use Manjaro mainly), and looked around the internet for some sort of data dump of Skyrim Dialogues. Unfortunately, I couldn't find any and then decided to extract the data myself. I had recently formatted my Windows Partition so had to reinstall the game. It also provided the benefit that no mods would pollute the data. (I had over 150 mods before the format). I downloaded the latest xEdit and used the Export dialogues.pas
script that comes with it to export all the dialogues. (It took me 22:05 minutes).
I am going to look into other data I can extract this way, and maybe make some other stuff.
Processing the data
In the CSV, there were two columns of data I was interested in RESPONSE TEXT & TOPIC TEXT. Response Text was the larger one, with over 40k unique dialogues. Topic Text had only around 5.5K unique dialogues and also needed some additional processing. Topic Text contained some game constants such as RoomCost
, HorseCost
, and other prices, which had to be filtered out. I did all that in csv_to_json.py
.
Counting the Words
Like the previous visualisation, I used nltk's stopwords corpus, along with a modified version of 20k most common words by Google. Interestingly, the modifications I did for Harry Potter were valid for Skyrim as well, because there is no dialogue with names like Harry, Ron, Arthur, etc and they share words like vampires, magic, etc.
I counted both RESPONSE TEXT & TOPIC TEXT data separately and then merged them into a single file count.json
Additional Tip: progress is a great Python Package to show progress in your scripts.
Making the WordCloud
I used pretty much the same process as the last visualisation. I changed the maximum font size to depict the variation properly and used a custom font this time.
To make the WordCloud, I used the wordcloud package. For the mask, I used Skyrim Logo Vector. For the font, I used Sovngarde font.
Making the Graph
I initially planned on making a set of graphs from the data, but wasn't able to due to two reasons:
- Some of the data was weird. Argneir had the highest dialogue count due to dialogues of many NPCs (including General Tullius, I think) is assigned to him.
- Some of the data doesn't; produce interesting visualisations. Nords have the highest dialogue count, and after the difference between the first few races and the remaining is so huge that a lot of the races aren't visible.
Since I had already made this, I thought of sharing it here, in case someone is interested in the image or its code.
Future Plans
I will look into making custom scripts (if someone already has them, do share it with me) to extract other interesting data from Skyrim and see what I can do with them.
Top comments (0)