Ah, Python. What a great language! I use it for number crunching and related data tasks professionally, as an easier to use (and often faster) alternative to VBA. To prove it, at least for my use cases, some stats are in order...
First, a dashboard that I used to run monthly until the data source vanished took 2.5 days to run by hand. It was slow and productivity-sucking, so I automated it in VBA. This reduced the run time to 14 hours - better, but still slow. So I rewrote things again in Python. The run time now? 10 minutes! Talk about time saved...
Secondly, another one of my more popular reports, analyzing usage of the various PDF publications we produce - originally, the controller script for this was in VBA, and it was SLOW, taking almost 3 days to run! Like the last one, it was rewritten in Python - however, one part of the processing step has ideal parallelism: I have to process a bunch of independent rows of data the same way, making it a perfect task to divide up among CPU cores with the multiprocessing
module. After additional tweaks and optimizations, the typical run time is just a smidge under 4 minutes; a fast-path, where the results of some of the processing can be loaded from a saved file, takes as little as 13 seconds!
Finally, my most recent Python thing I have been doing is generating an HTML to-do list, so I can open it up in a browser and see, at a glance, what I have on my plate. This had a short-lived VBA version too - but performance wasn't the driving factor behind a rewrite. Rather, it was VBA lacking robust HTML and regular expression support. Later additions defined the structure of the data files used in object form, which enables both strict error checking and MUCH better error reporting. One of the data files is a CSV, which I update by hand, so it's easy to make errors like forgetting a comma or not quoting text with commas embedded in it. The script will detect these problems and tell you not only where it stopped parsing, but also what error it found. Examples of error messages:
- File 'data.csv', line 2, item 4: '2020-1-09' is not a valid ISO date!
- File 'data.csv', line 34: Found 21 items, expected 23!
- File 'data.csv', line 5, item 13: 'Highets' is not a valid priority!
This makes fixing the error easy, as I know exactly where to look. Contrast with the Python runtime's much less helpful errors:
- ValueError: '2020-1-09' could not be parsed as datetime
- TypeError: 'Highest' not valid for int() in base 10
- No error.
As for future work, I am looking into scientific Python libraries to further accelerate the heavy-duty data lifting work I do, in particular with the publication usage report, which has to ingest over 100 GB of raw data, and merge it with data from a second source to produce the final result. To some extent these can run concurrently, but they are still slow put together. Shame I can't speed up downloading the data files.
Oh, and a shout-out to @ronsoak for inspiring me to write this!
Top comments (2)
I love the context about how you used to use VBA. Most articles donβt talk about what they are replacing and the time gains you have achieved, 14hours to a few minutes. Amazing !!!!
Great article keep it up!
Very intriguing and cool!