I’m always on the lookout for quick hacks and code snippets that might help improve efficiency. Most of the time that’s through stackoverflow but here’s one that deals with parallelization and efficiency that I thought would be helpful.
Since Pandas doesn’t have an internal parallelism feature yet, it makes doing apply functions with huge datasets a pain if the functions have expensive computation times. One way to shorten that amount of time is to split the dataset into separate pieces, perform the apply function, and then re-concatenate the pandas dataframes.
Let’s take an example pandas dataframe.
I’m going to use the multiprocessing package in Python and import Pool. Pool helps spin up new threads on the machine.
df and func are the dataframe and function being applied to the dataframe respectively. Split the dataframe into the set of partitions. Note that if you specify a number greater than the number of rows in the dataset then the function will throw an error.
Instantiate a Pool incident with the number of cores on your machine. Then the pool.map function essentially applies func to the list of partitioned dataframes by iterating through the given list. pd.concat just re-concatenates all of the partitioned dataframes into one again.
If you use AWS and maximize the number of your cores, it can vastly improve the speed of expensive functions on pandas. I’m not too sure how memory effective concatenating the dataframes. More on the map function is here.
Let me know if it works for you