In the last installment, we had a simple 10 line script (excluding imports) that pulled or extracted the share value and u3o8 stock from a website. It then cleaned up or transformed those values into numbers and printed them to the screen. You'll notice though that even in those 10 lines of code, there was some repetition.
fund_values = soup.find_all('div', class_='fundHeader_value')
shareprice = fund_values[4].contents
shareprice_value = str(shareprice[0]).strip().replace('$US', '')
u3o8 = fund_values[6].contents
u3o8_stock = str(u3o8[0]).strip().replace(',', '')
Both share price and u3o8 stock are essentially the same thing: they reach into the 'div' called 'fundHeader_value'
at an index and then clean up the string. This violates a general rule in good coding practices called "DRY" or "Don't Repeat Yourself." Early in my coding days, my mentor always used to tell me, if you have to do it twice, create a function. Creating these functions are the first step to abstraction and more robust, scaleable code!
How can we break this code up into functions? In any data processing, I like to think through the common lens of extraction, transformation, and loading or ETL. It's also common nowadays to find ELT, especially in streaming pipelines, so don't get caught up in what order those letters are in as long as you're extracting first.
What's our extraction? It's pulling the webpage down into BeautifulSoup.
What's our Transformation? It's processing the string and producing a clean number.
What's our Load? It's just loading it to the screen in a print statement.
Let's turn these into functions:
Extract
def web_call(url): # Extract
r = requests.get(url)
if r.status_code == 200:
soup: BeautifulSoup = BeautifulSoup(r.content, "html.parser")
return soup
else:
return r.status_code
Now we have a generic function that you can put any url into, not just the Sprott url, and do in 1 line of code what you were previously doing in the above 6 lines of code.
Here is how you'd do it for Sprott:
soup = web_call(
url='https://sprott.com/investment-strategies/physical-commodity-funds/uranium/')
That's it!
Transform
def get_fund_values(soup, index, class_name): # Transform
fund_values = soup.find_all('div', class_=class_name)
value = fund_values[index].contents
return str(value[0]).strip().replace('$US', '').replace(',', '')
This is only slightly abstracted (not as generic), but instead of pulling the soup content, getting the contents, and cleaning the contents n number of times, you can do it now n times with one line of code each.
Here is how you'd do it now for share price:
shareprice_value = get_fund_values(soup, 4, 'fundHeader_value')
We have 2 values we're grabbing but there are more on the website. We could grab each value with a single line of code.
Load is just a Python function called print()
at the moment, but let's make this useful. Why don't we create a JSON file with the two values we grab? Here is a simple function that will write a JSON file:
Load
def write_json(data, filename='data.json'): # Load
with open(filename, 'w') as f:
json.dump(data, f, indent=4)
Now when we run this script, it won't just run it from top to bottom because we have everything in functions; so we tell Python that when this script is run, here is your course of actions:
if __name__ == '__main__':
soup = web_call(
url='https://sprott.com/investment-strategies/physical-commodity-funds/uranium/')
data = {}
data['shareprice'] = get_fund_values(soup, 4, 'fundHeader_value')
data['u3o8_stock'] = get_fund_values(soup, 6, 'fundHeader_value')
write_json(data=data)
This will call the website, create a dictionary called data, fill the dictionary with the 2 values, and write the dictionary to a JSON file.
What's really cool about these functions is that they're no longer just usable in this script. We can import these functions into other Python scripts and create similar scrapers!
Code is below and in Github (here)[https://github.com/CincyBC/bootstrap-to-airflow/blob/main/src/functional_scraper.py]. Join me next time for a quick chat about Unit Tests and then off to the land of Object Oriented Programming (OOP) to see how we can turn this into a Scraper Class that will form the baseline for our Airflow DAG.
Top comments (0)