I’m always on the lookout for quick hacks and code snippets that might help improve efficiency. Most of the time that’s through stackoverflow but here’s one that deals with parallelization and efficiency that I thought would be helpful.

Since Pandas doesn’t have an internal parallelism feature yet, it makes doing apply functions with huge datasets a pain if the functions have expensive computation times. One way to shorten that amount of  time is to split the dataset into separate pieces, perform the apply function, and then re-concatenate the pandas dataframes.

Let’s take an example pandas dataframe.

import pandas as pd
import numpy as np
import seaborn as sns
from multiprocessing import Pool

num_partitions = 10 #number of partitions to split dataframe
num_cores = 4 #number of cores on your machine

iris = pd.DataFrame(sns.load_dataset('iris'))

I’m going to use the multiprocessing package in Python and import Pool. Pool helps spin up new threads on the machine.

def parallelize_dataframe(df, func):
    df_split = np.array_split(df, num_partitions)
    pool = Pool(num_cores)
    df = pd.concat(pool.map(func, df_split))
    pool.close()
    pool.join()
    return df

df and func are the dataframe and function being applied to the dataframe respectively. Split the dataframe into the set of partitions. Note that if you specify a number greater than the number of rows in the dataset then the function will throw an error.

Instantiate a Pool incident with the number of cores on your machine. Then the pool.map function essentially applies func  to the list of partitioned dataframes by iterating through the given list. pd.concat  just re-concatenates all of the partitioned dataframes into one again.

Example:

def multiply_columns(data):
    data['length_of_word'] = data['species'].apply(lambda x: len(x))
    return data
    
iris = parallelize_dataframe(iris, multiply_columns)

If you use AWS and maximize the number of your cores, it can vastly improve the speed of expensive functions on pandas. I’m not too sure how memory effective concatenating the dataframes. More on the map function is here.

Let me know if it works for you