Nageen Yerramsetty

Posted on Nov 26, 2024 • Originally published at learn2infiniti.com

DuckDB vs Pandas - Exploring DuckDB's capabilities

#duckdb #pandas #performance #experimenting

Ever since I came across DuckDB, I have been fascinated by its capabilities. For those who are not aware of DuckDB, it is a super fast in-process OLAP database. While I started to use it for ad-hoc analysis and noticed that it is blazingly fast, I only happen to use it on small datasets. In this blog post, I experiment with DuckDB using medium-sized datasets. We will see how DuckDB can handle over 100 million rows on my local machine. Note that all the testing is done on my local laptop which is a 4-core 16 GB machine running Windows.

Datasets

I used two large datasets that I found on Kaggle and a medium-sized proprietary dataset. Here are the details about the datasets:

A proprietary dataset that has half a million rows and is hosted on MySQL server. Unfortunately, I won't be able to share details about the dataset, but below I have shared how DuckDB performed on this dataset.
Anime dataset from Kaggle. This dataset contains the user ratings given to various anime. The dataset has three files. You can find the details below.
Binance 1-second bitcoin dataset from Kaggle. This dataset contains the bitcoin rates at a second timeframe between 2017-08-17 and 2021-02-23. There is a part 2 to this data which you can find here. I have only used the Part 1 for now.

Dataset	File name	File type	Rows count	File size
Anime dataset	user_details	csv	700k	73 MB
Anime dataset	anime_details	csv	24k	15 MB
Anime dataset	user_ratings	csv	24 million	1.1 GB
Binance 1-second bitcoin dataset	half1_BTCUSDT_1s	csv	110 million	12.68 GB

Table: Anime & Binance 1-second bitcoin datasets - files and details

Results - Quick Summary

Here is the summary of the results in order of the datasets shared in the previous step:

Proprietary dataset - DuckDB processed the result of half a million rows in less than a second. If I consider the time it took to read the data from MySQL and the query process time, it came to around 55 seconds. MySQL took around 6 minutes to complete the same query! The query involved calculating active users in the last 6 months on a monthly rolling basis.
Anime dataset - DuckDB breezed through the anime dataset. I have tried multiple things which I will discuss in detail later. For now, it was able to get the top-rated animes by calculating the average user ratings across 24 million rows in 18 seconds. There is a surprise query that made DuckDB sweat for 4.5minutes which will be discussed later. Pandas broke down on this last query with a memory allocation error.
Bitcoin dataset - On the bitcoin dataset, I calculated the basic 50-second and 200-second moving averages. DuckDB processed this in about 55seconds on over 100 million rows all the while keeping the overall laptop's memory usage under 70% (including other processes that were using the system's memory)

Experimenting with DuckDB - Anime dataset

As already explained, the anime dataset contains three files. I used both DuckDB and Pandas to do the analysis, so we can understand how DuckDB compares to Pandas. We compare the following analysis between DuckDB and Pandas

Load times of datasets between DuckDB and Pandas
Since we have three datasets, we join the datasets to get the user details, anime details and ratings in a single place.
Next, we calculate the average user rating for each anime.
Finally, we calculate the genre-level ratings.

1. Loading Datasets

Let us load the three files using both DuckDB and Pandas. Note that DuckDB load times are in a few milliseconds while Pandas takes a couple of seconds to load the data. This is because DuckDB constructs what are called relations. These relations are symbolic representations of SQL queries. They do not hold any data and nothing is executed until a method that triggers execution is called. This explains why DuckDB loads look instantaneous.


Comparing csv file load times of DuckDB vs Pandas.

In the above screenshot we see CPU times and Wall time. Below these are explained briefly.
CPU times - Measures the actual time the CPU spends working on a specific task. This only includes time when the CPU is actively executing the process, so it excludes waiting periods like I/O operations. It’s often used to gauge how much CPU power a task consumes.

Wall time - Also known as Elapsed Time, measures the real-world time from start to finish of a task. This includes all pauses and waiting periods, such as waiting for data from a disk or network, making it a full picture of the user’s wait time.

2. Joining Datasets

Next, we join the three tables. anime_details is joined with user_ratings and user_details tables. We also convert the genres column to a list data type, so it can later be used for ratings calculations across genres. And finally, we only select the required columns. While I was able to achieve this using a single SQL query in DuckDB, it required multiple steps to achieve the same in pandas. Below is a screenshot comparing the both. From the below screenshot, we see that DuckDB is still not processing the query as the processing time is 0 nanoseconds. Pandas processed everything in about 1 minute 36 seconds.


Joining the three tables in the anime dataset

3. Best anime by user ratings

Here, we compute the average anime ratings. We compute the average ratings of the anime and sort them in descending order of rating. We only consider animes which received at least 1000 user ratings. We can see DuckDB is winning this round too. Given that it is doing lazy operations, my understanding is that DuckDB has in this step loaded the required data, did the joins on the three tables and then computed the average anime ratings and sorted them in around 18 seconds. Pandas took 23 seconds just for the average computation and sorting the results given that it has already loaded the data and performed the joins.


Computing the average user ratings of anime sorted in order of user rating

In the previous step, notice that CPU time is greater than Wall time for DuckDB. This means the process is running in a multi-threaded or parallel manner, where multiple CPU cores are working on the task simultaneously. Each core's time adds up, resulting in a higher total CPU time (sum of time across all cores) compared to the actual elapsed wall time.
For example, if a process takes 2 seconds of Wall Time to complete but uses four CPU cores simultaneously, each for 2 seconds, the CPU Time would be 2 seconds * 4 cores = 8 seconds.

4. Best anime genre by user rating

The final one is to compute the average ratings for different genres. But each anime can belong to multiple genres. This is the reason we created the genres as a list in the previous step. We unnest this genre column in DuckDB so the list of genres is split into multiple rows one for each genre. Similarly, we use explode in Pandas to expand the list into multiple rows. Each anime can on average belong to three genres. Given this, the final data can expand to over 72 million rows. Then the average rating is computed for every genre by taking the average of user ratings. While we can argue against this logic, the idea was to push DuckDB to see how it can handle such an explosion of data. This is the step that took 4.5 minutes for DuckDB to compute the average. Pandas has given up at this point with an "Unable to allocate memory" error. Below is the query.


Calculating the average user ratings across anime genres. Notice how pandas is unable to process this because of memory issues.

In conclusion, we see that DuckDB does lazy evaluation delaying the loading of datasets and joining until the average calculation. We also saw that DuckDB handled the explosion of the dataset very well without any memory issues.

Experimenting with DuckDB - Bitcoin dataset

Next, we use the bitcoin dataset extracted from Binance. The dataset has about 110 million rows. I tried using pandas but it wasn't even able to load the whole dataset into memory. So, I had to abandon pandas for this dataset and only focused on using DuckDB.

1. Loading the dataset

First, we loaded the csv file which is about 12 GB. As we discussed in the previous dataset, DuckDB loads it lazily. Hence the loading seems instantaneous. In the next step, we print the dataset. But here also we see that the output is shown instantaneously. This is because DuckDB doesn't yet read the dataset fully and only scans the first 10000 rows to show the output. This is the reason why even the show() method is very quick.


Loading the Bitcoin dataset

2. Computing the moving averages

Next, we compute the moving averages on the dataset. Moving averages are one of the basic indicators in trading to decide when to buy and sell an asset. In this, we compute the 50-second moving average and 200-second moving average on the entire dataset. As we can see the result is returned in under a minute.


Computing the 50 seconds and 200 seconds moving averages for the entire dataset

3. Lazy evaluation on moving averages variable

Next, I tried calculating the moving average and assigned it to the moving_averages variable rather than directly showing the output. As expected, DuckDB has not evaluated the query at this point. Notice it shows 0 nanoseconds to process in step 1 of the below screenshot. Next, we do two simple calculations on this variable.

First, we calculate the maximum Open time, minimum Open time and the total rows in the moving_averages variable.
We again calculate the same maximum and minimum Open time and the total rows in the moving_averages variables but this time the variable moving_average is filtered to dates before 2017-08-08. This filtered dataset has only about 79k rows.

The surprising output here is that both the queries took the same time and they took almost double the time compared to the moving averages calculation from the previous step. We can see that in the second calculation, the filtered dataset only has around 79k records to process which is only a fraction of the 110million records. But it still took the same time. This is unclear to me and came as a surprise how DuckDB is planning the query execution in both these scenarios in the backend. Do comment if you know how this works!


Lazily calculate the moving averages. Get the maximum open time, minimum open time and total rows from this dataset Get the maximum open time, minimum open time and total rows from this dataset where open time is before 2017-08-18

In conclusion, we see that DuckDB is able to handle 110 million rows without any issues on a local laptop. This shows how efficiently DuckDB uses the resources.

Experimenting with DuckDB - Observations

DuckDB delays the execution of the queries until a method is called that triggers the execution.
Given its columnar in-memory processing engine, it is much faster compared to pandas.
Extremely efficient memory processing. While the datasets are big enough to challenge the RAM on my system, the RAM usage was always under 70% given that even other processes were running on the system.
The above point means DuckDB can handle larger than memory datasets.
Notice that in DuckDB CPU time is always more than the Wall time. This shows that DuckDB is engaging multiple cores/threads to process the data. The default threads that DuckDB uses are set to 8. Notice that this is not the case with Pandas.

Experimenting with DuckDB - Conclusion

We can conclude that DuckDB is capable of handling hundreds of millions of rows of datasets with ease even on a local machine. For professionals like me who love SQL, DuckDB gives the ability to work with large datasets even with minimal infrastructure and using SQL dialect. Whether you're performing complex analytical queries or just need fast results on a budget, DuckDB offers a balance of speed and flexibility that makes it a strong choice in large-scale scenarios.

You can find all the code used in this blog in my GitHub repository here. I hope to do another part on experimenting with DuckDB where I use even larger datasets.

Hope you enjoyed the read! Do share your valuable feedback and comments.

DEV Community