P.S. This turned out to be a rant...just so you know...
Motivation
I am currently involved in a small web scraping project. My job is to retrieve information about local charity organizations from a website that only offers a web-based search interface. The complexity of the project is manageable but I learned a thing or two in that process of making a runnable Python script to capture the data and transform it into a readable Excel spreadsheet.
1. Web Scraping Can Be Fun
Within the bounds of laws and regulations, programmatic ways of gathering information from the web are what I would recognize as different acts of web scraping. I believe it is not an unfamiliar concept to many, possibly due to the popularity of Python and how easy it is to do some of the simple web scraping tasks in Python. After working on web development projects for a while, I gained more understanding of how websites work behind the scene and in the most recent web scraping project that I worked on, I was able to make use of my insights to explore ways of gathering required data.
There was a working Python script wrote by someone else for the above-mentioned project when I took over. It was making use of basic Selenium selectors to crawl the information field by field. The issue with this approach is that the Chrome web driver involved has to be kept up-to-date with the user's Chrome browser. The script often fails to run after a month or two, and it is annoying to always have to download the driver again. The problem is made worst when the script has to be run by someone else.
Change is the only constant. I got the chance to update the script because the website it was scraping from had a major upgrade and the logic of the selectors no longer worked. With the web development experiences that I have now, I decided to do some preliminary checks and see if I could find an easier way to get the data.
The first thing that came to my mind was to check the network calls that the site is making and see if I can make them directly. Cutting the middleman out of the race is always a strategy worth trying. With the inspector tool opened, I was able to observe the requests and the responses when the website refreshes.
Whipping out an API testing tool, I was able to replicate the network calls directly. In fact, I can now gather the entire list of organization information in JSON representations. For individual details of an organization, I had to find the corresponding query string that the site used to identify it. This was interesting as the query string looks like this: M2E5M2Q1N2YtNzk2NS1lMzExLTgyZGItMDA1MDU2YjMwNDg0
.
I was pretty clueless at first but as someone who has now been through the entire journey of front-end & back-end development, I know that people don't write perfect software and there are always clues hidden in the source code. Given that we can inspect the HTML of a website easily, I decided to look for hidden treasures in the HTML file. After some inspection, I found the piece of code that is used to make the query string: btoa(charityID)
. After googling btoa
, I found out that it's a way to encode a string into base64. With that, I was able to simplify the web scraping process by encoding the string programmatically and using the requests
package to simply making POST requests to get what I wanted.
2. It's not a bug but a feature?
I thought the above experience is interesting but the following point is what triggered me to write this article. After I made the script, I was informed that the resultant files had a few issues. Looking at the code again, I realized that there was a mistake.
To understand the problem, let me briefly introduce the background. The information organized in JSON format contains primary categories and sub-categories. Thus, one combination could be
- Primary category: Personal
- sub-category: Expenditure
- Primary category: Business
- sub-category: Expenditure
In the example given here, it is clear that both categories contain a sub-category called "Expenditure". This does not seem like a problem unless the JSON format is something like the following:
[
{ "key": "someOtherValue", "value": 123},
{ "key": "expenditure", "value": 123},
{ "key": "expenditure ", "value": 456},
{ "key": "someMoreValue", "value": 123},
]
It is simply an array of key-value pairs. So, how do the developers that created this schema find out whether an expenditure amount belongs to the "personal" or the "business" category?
Initially, I was unaware that the same identifier was being used twice. What I did realize is that some identifiers have a trailing space. I thought they were careless mistakes and put in some code to strip out trailing spaces while processing the data. Later, I found out that trailing spaces were intentional and that was how they differentiate one value from the other. The best part is that because the trailing spaces are practically visually hidden, the developers simply loop through the values in the array and displayed them normally as a table on the website. When I inspected the HTML, there were indeed trailing spaces for some of the identifiers. I was rather speechless to find out that a trailing space was used as part of an unique identifier. This is worse than having a bad name...
Conclusion
We all tend to take the shortest, most efficient path to make something work. This could mean copy-pasting code and making the slightest change to satisfy a new requirement. If the software is important and used by many, we ought to stop in our tracks sometimes and plan proper refactoring to make it right. Or else...
Top comments (0)