Stop using Pandas Apply and start using Numpy’s Vectorize
np.vectorize — a much faster alternative to the Pandas apply() method in Python, and a much, much faster alternative to for-loops.
Here’s the scenario: you have a function that you need to apply to two columns in a Pandas Dataframe. The function takes the values of two columns, and computes a list of new values as an output.
The package we’ll use for this example is compare-strings, and we’ll benchmark three different methods of using this function across columns of a dataframe. The test dataset is a list of just over 10,000 email addresses and names, and the function compare_strings
will return a value based on the Levenshtein Distance between the two strings.
A quick note on
compare_strings
— the way we’ll use this function for this test includes an argumentemail=1
, which simply tells the function that the first value is an email address and has it complete some pre-processing before computing the output. By default, the output is the Levenshtein distance between the two strings, as a proportion of the length of the first string.
For Loop
The very first way you may learn to do this is with the familiar For-Loop. We’ll use iterrows()
to iterate through the Dataframe, computing the required value for each row, and appending that value to a list.
Computing the compare_strings
value for the 10K records this way takes approximately 0.34 seconds. Not too bad, but when we start to look at real-world datasets with many more records (think, 1 million +), we might need something faster.
Apply
A smarter way to compute the desired output based on two columns of a Dataframe is to use the Pandas Apply function combined with a Lambda function. Here’s what that looks like:
Using this method over 10k records took 0.13 seconds, about 2.6x faster than the For-Loop method. Fantastic! But I think we can do better.
np.Vectorize
Using Numpy’s Vectorize, the processing time for these 10k records is 0.066 seconds, ~2x times faster than Apply and 5.2x faster than the For-Loop method.
100x runs of each (1 million records):
Looking at a comparison of run times for the 3 methods running on ~1 million records, we see that Numpy’s Vectorize function is the clear winner, coming in at approximately 6 seconds, versus 13 seconds for Apply and 32 seconds when using a For Loop. In fact, for simpler, pure math based operations, Numpy’s Vectorize function outshines Apply even more, sometimes by a factor of 50.
Interestingly, Numpy’s own documentation states:
The
vectorize
function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.
But, the results speak for themselves.
Hopefully you’ll start using np.vectorize to optimise your use of functions in the future! Thank you for reading.