For long GPUs have been used to train neural networks and generate inferences from them. But this wasn't the case when working with pandas
on large datasets. It usually uses CPU to do all its processing, which is fine for usual operations but if your dataset is large then the processing time goes beyond several hours or sometimes even days.
That was until recently enough RAPIDS AI launched cuDF
, which is a "Python GPU DataFrame library" built for manipulating data. The cuDF.pandas
which is built on cuDF
, accelerates Pandas by using GPU supported operations (it falls back to CPU operations for unsupported functions).
As a research assistant and a master's student, while cleaning a large dataset, using the pandas ffill
and bfill
functions to forward fill and backward fill the missing values in a column, I felt the need for a faster implementation.
While ffill
and bfill
are vectorized functions in Pandas, using cuDF can speed them up significantly due to its ability to perform parallel processing on GPUs. The time required to run each of these functions went down by roughly 80x (using NVIDIA Tesla T4 GPU on Google Colab).
As mentioned above, the cuDF
implementation doesn't speed up all the pandas
functions, especially if you have a complex User Defined Function. To speed up such functions we can leverage parallel processing on CPU.
To set up cuDF, one can follow the guide at RAPIDS Installation Guide. For set up on Google Colab, checkout this GitHub Repo: cuDF-Setup.
Have you tried using cuDF to speed up your data processing tasks? Share your experiences and insights in the comments below, or reach out with any questions you might have about integrating cuDF into your workflows
Top comments (0)