Modin – How to speedup pandas by changing one line of code

Modin is a python library that can be used to handle large datasets using parallelisation. The syntax is similar to pandas and its astounding performance has made it a promising solution. By changing just one line of code. This article shows you why you should start using Modin and how to use it with hands-on examples.

Contents

  1. The need for Modin
  2. What is Modin and why it matters?
  3. Getting started with Modin
  4. Comparisions Modin VS Pandas
  5. How Modin compares with other alternatives?

The need for Modin

In Python, pandas is the most popular library used for data analysis. Every Pythonista in the field of Data Science uses it. Well, nearly at least. The main reason behind the success is the neat and easy API pandas offers as a result of the tremendous effort gone behind from the Author and team.

But every coin has two sides.

As long as the data we work with is small enough (to fit in the RAM), pandas is amazing. But often, in reality, you have to deal with much larger datasets in the size of several gigabytes or larger. In such cases, pandas may not cut it. pandas is designed to work only on a single core. Even though most of our machines have multiple CPU cores, pandas cannot utilize the multi-cores available.

We could benefit from a solution that accelerates pandas and speeds up the larger dataframe computations. One of the main requirement remains that its API should be convenient for pandas users to adapt. Because the last thing you want is to learn another whole new set of syntax for handling dataframes.

This is where Modin comes in. Yes, you don’t need a new set of syntax to start using Modin. More on that shortly.

I hope you understood the problem with pandas. In the next section, you’ll see how Modin solves the problem.

What is Modin and why it matters?

Modin is a python library that can be used to replace Pandas especially while dealing with huge dataset processing. Modin is capable of speeding up your pandas scripts up to 4x.

Modin runs with Ray or Dask as backend.

So, what does Modin do differently?

Modin enables you to use all the CPU cores available in your machine, unlike pandas. When you can run the same code with 4 processors instead of one (like in pandas), the time taken decreases significantly.

You can simply say that modin parallelizes your pandas operations.

What else?

  1. Modin is an extremely light-weight, robust DataFrame.
  2. It’s super compatible with the pandas code making it easier for users.
  3. To use Modin, it’s not required to know how many cores your system has. You also don’t have to specify the data distribution.
  4. Because of the similar to pandas API, Modin can provide the best of both worlds: Speed and convenience.
  5. Aims to be the one tool for all dataframes from 1 MB To 1 TB+ !

Getting started with Modin

First, let’s install the modin library using pip. As I said earlier, Modin either uses Dask or Ray in the backend. You can do the same for dask through pip install modin[dask].

# Install Modin dependencies and Dask to run on Dask
!pip install modin[dask]

Next comes the important part. Modin claims that you just need to change 1 line to speed up your code which is this. You just need to change import pandas as pd to import modin.pandas as pd and you get all the advantages of additional speed.

import modin.pandas as pd

Modin also allows you to choose which engine you wish to use for computation. The environment variable MODIN_ENGINE is used for this. The below code shows how to specify the computation engine

import os

os.environ["MODIN_ENGINE"] = "ray"  # Modin will use Ray
os.environ["MODIN_ENGINE"] = "dask"  # Modin will use Dask

After this, everything is similar is pandas mostly.

Let’s start with the simple task of reading ‘.CSV’ files. Then, compare the speed of both pandas and modin. In the below code we read a dataset using both pandas and modin and record the time.

# Read in the data with Pandas
import pandas as pd
import time
s = time.time()
df = pd.read_csv("/content/my_dataset.csv")
e = time.time()
print("Pandas Loading Time = {}".format(e-s))

# Read in the data with Modin
import modin.pandas as pd

s = time.time()
df = pd.read_csv("/content/my_dataset.csv")
e = time.time()
print("Modin Loading Time = {}".format(e-s))
Pandas Loading Time = 0.1672
Modin Loading Time = 0.2508

You can observe that read_csv function was speeded up almost by 2x with Modin. It’s because my machine supports 4 cores that were all utilized.

Comparing Modin Vs Pandas

Quick Recap: You can just import modin.pandas as pd and execute almost all codes just like you did in pandas.

In this section, I demonstrate a few examples using python and modin.

You can see that the code is exactly the same (except the import statement), but there is a significant speed-up in execution time.

For demonstrating, I will be using the Credit Card detection dataset (144 MB) from kaggle. you can download from this link. We’ll perform a series of operations.

1) Demonstrating read_csv()

First, let’s read the above dataset into dataframe using all options: Pandas, Modin (Ray) and Modin(Dask). We record the time taken for each.

Using pandas:

# Load csv file using pandas
import pandas as pandas_pd
%time  pandas_df = pandas_pd.read_csv("/content/creditcard.csv")
#> CPU times: user 2.91 s, sys: 226 ms, total: 3.14 s
#> Wall time: 3.09 s

Using modin ray:

# Load csv file using Modin and Ray
import os
os.environ["MODIN_ENGINE"] = "ray"  # Modin will use Ray
import ray
import modin.pandas as ray_pd

%time  mray_df = ray_pd.read_csv("/content/creditcard.csv")

#> CPU times: user 762 ms, sys: 473 ms, total: 1.24 s
#> Wall time: 2.67 s

Using modin dask:

# Load csv for Modin with Dask
import os
os.environ["MODIN_ENGINE"] = "dask"  # Modin will use Dask

from distributed import Client
client = Client(memory_limit='8GB')
import modin.pandas as dask_pd
%time  mdask_df = dask_pd.read_csv("/content/creditcard.csv")

#> CPU times: user 604 ms, sys: 288 ms, total: 892 ms
#> Wall time: 1.74 s

You can see the difference in time taken in all the methods. Modin has clearly out-done pandas.

I explicitly set the memory limit, so as to avoid out-of-memory situations.

NOTE: The above numbers are results I got by running them on my machine. Your results might vary based on the hardware resources that are available to Modin.

2) Demonstrating append()

In the above code, we loaded the csv dataset using both pandas and modin. Next, we use the append() using pandas and modin. I will be recording time taken in each case again.

# Using Modin
%time df1 = pandas_df.append(pandas_df)

# Using Modin
%time df2 = mdask_df.append(mdask_df)
%time df3 = mray_df.append(mray_df)

CPU times: user 29.6 ms, sys: 74.4 ms, total: 104 ms
Wall time: 102 ms
CPU times: user 3.13 ms, sys: 0 ns, total: 3.13 ms
Wall time: 2.57 ms
CPU times: user 2.57 ms, sys: 839 µs, total: 3.41 ms
Wall time: 2.94 ms

Observe the outputs. Using pandas, it took 102 ms for the task. Using modin, it reduced to around 2.6 ms!

Imagine the same effect when the time will be minutes! Yes, that’s the level of problems Modin could handle.

3) Demonstrating Concat()

Now, we will use the concat() using pandas and modin.

This function appends one or more dataframes to either axis of this dataframe. I will be recording the time taken in each case again. Also notice that, there is no change in the construct of the statements proving that it’s easy to adapt to modin

%time df1 = pandas_pd.concat([pandas_df for _ in range(5)])
%time df2 = dask_pd.concat([mdask_df for _ in range(5)])
%time df3 = ray_pd.concat([mray_df for _ in range(5)])
CPU times: user 90.1 ms, sys: 99.8 ms, total: 190 ms
Wall time: 181 ms
CPU times: user 4.75 ms, sys: 426 µs, total: 5.18 ms
Wall time: 4.49 ms
CPU times: user 4.89 ms, sys: 864 µs, total: 5.76 ms
Wall time: 4.52 ms

The time reduced from 181 ms to around 5 ms by the use of Modin. Wow!

Similarly, the majority of the pd.DataFrame methods can be implemented in modin.

You can find the list of all the pandas API supported methods here.

How Modin compares with other alternatives?

To speed up the python and pandas, Modin is not the only option. We have a few other important and popular APIs. Dask, Modin, Vaex, Ray, and CuDF are often considered potential alternatives to each other. Let me give a quick look into how Modin differs from each of these.

Modin Vs Vaex

As you can see in the above examples, Modin provides a full Pandas replacement. The entire API replicates pandas. Whereas, Vaex is not so similar to pandas.

So, When to use what?

If you want to quickly speed up the existing Pandas code, go for modin. But, if you have the need to visualize large datasets then choose Vaex.

Modin Vs Dask

First, the Dask I mentioned previously and now is somewhat different. Dask can be used as a low-level scheduler to run Modin. It also provides the high level dataframe, an alternative to pandas via dask.dataframe.

Dask does solve the problems through parallel proessing, but it doesn’t have full Pandas compatibility. That is, you need to make small changes in your code base, which is usually not so much. But definitely not like what you saw in modin by changing just one line of code.

So, say you have a complex pandas code. By simply switching pandas dataframe to Dask.Dataframe, there won’t be great results. You’ll have to make more changes. This is a disadvantage compared to modin.

Modin vs. RAPIDS (cuDF)

RAPIDS is very effective in speeding up the code, as it scales Pandas code by running it on GPUs. The problem is that RAPIDS requires you to have Nvidia graphics. If you have, you can give RAPIDS a try and the speed gains are enormous. Otherwise, it’s easier and straightforward to simply use modin.

I hope you understood the need for Modin and how to use it to speed up your pandas code. Stay tuned for more such articles.