Stop using Pandas Apply and start using Numpy’s Vectorize

Michael Taverner
3 min readSep 18, 2022

--

np.vectorize — a much faster alternative to the Pandas apply() method in Python, and a much, much faster alternative to for-loops.

Here’s the scenario: you have a function that you need to apply to two columns in a Pandas Dataframe. The function takes the values of two columns, and computes a list of new values as an output.

The package we’ll use for this example is compare-strings, and we’ll benchmark three different methods of using this function across columns of a dataframe. The test dataset is a list of just over 10,000 email addresses and names, and the function compare_strings will return a value based on the Levenshtein Distance between the two strings.

A quick note on compare_strings — the way we’ll use this function for this test includes an argument email=1 , which simply tells the function that the first value is an email address and has it complete some pre-processing before computing the output. By default, the output is the Levenshtein distance between the two strings, as a proportion of the length of the first string.

For Loop

The very first way you may learn to do this is with the familiar For-Loop. We’ll use iterrows() to iterate through the Dataframe, computing the required value for each row, and appending that value to a list.

Computing the compare_strings value for the 10K records this way takes approximately 0.34 seconds. Not too bad, but when we start to look at real-world datasets with many more records (think, 1 million +), we might need something faster.

Apply

A smarter way to compute the desired output based on two columns of a Dataframe is to use the Pandas Apply function combined with a Lambda function. Here’s what that looks like:

Using this method over 10k records took 0.13 seconds, about 2.6x faster than the For-Loop method. Fantastic! But I think we can do better.

np.Vectorize

Using Numpy’s Vectorize, the processing time for these 10k records is 0.066 seconds, ~2x times faster than Apply and 5.2x faster than the For-Loop method.

100x runs of each (1 million records):

Looking at a comparison of run times for the 3 methods running on ~1 million records, we see that Numpy’s Vectorize function is the clear winner, coming in at approximately 6 seconds, versus 13 seconds for Apply and 32 seconds when using a For Loop. In fact, for simpler, pure math based operations, Numpy’s Vectorize function outshines Apply even more, sometimes by a factor of 50.

Interestingly, Numpy’s own documentation states:

The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.

But, the results speak for themselves.

Hopefully you’ll start using np.vectorize to optimise your use of functions in the future! Thank you for reading.

Until next time!

--

--

Michael Taverner

I'm an Australian Fraud Prevention Data Analyst, live sound engineer and data science enthusiast.