How to deal with Big Data in Python for ML Projects (100+ GB)?

The size of the data you can load and train in your local computer is typically limited by the memory in your computer. Let’s explore various options of how to deal with big data in python for your ML projects.

In certain situations, you might have a hard time working on large data, that is data is large enough that even simple data wrangling operations take a lot of time and computing resource.

Or worse, you are not even able to load the large dataset which you might have as a ‘csv’ file or other formats.

Sometimes, you might be able to load and do data wrangling, but training the model takes lot of time, thereby slowing down or limiting your ability to experiment various techniques.

Let’s look at various options you can try to manage big data in python.

Main Approaches
1. Optimize dataframes size in Pandas
2. Function to reduce the memory usage.
3. Use only required columns
4. Chunking data
5. Sparse data formats
6. Efficient Data file formats
7. Pandas alternates
– Modin
– Vaex
8. Dask – Effiencient parallel computing for data analysis and machine learning
9. Distributed Computing with spark
10. Intel(R) extension for sklearn
11. General tools and tips you can exploit
– Apply Vectorized functions
– Numba
– pd.eval and pd.query
– Rapids cuDF

1. Optimize dataframes size in Pandas

Typically when pandas creates a dataframe it assigns a larger datatype than what is really necessary to hold the data. For example, you don’t need an int64 datatype to store ‘age’ variable.

Why?

Because, ‘Age’ typically goes from 0 to 100, whereas int64 can hold much larger number. An int8 datatype, which requires much lesser memory should be sufficient to represent ‘Age’.

This is most useful if you are able to load the large data to jupyter notebook / python but the data wrangling operations you do seem to take a lot of time.

Chances are you are facing memory constraints, and the dataframe you are using is consuming a major chunk.

It is possible to reduce the size of the dataframe by changing the datatype of the columns.

This applies to a lot of data that occur in the real world. By checking the max and min possible values of the columns respectively, you can use the appropriate datatype and save up on memory, thereby be able to do your other data wrangling activities.

Now how to optimize a dataframe in Python?

Simply use the reduce_mem_usage function defined below. Very useful!

Get Free Complete Python Course

Facing the same situation like everyone else?

Build your data science career with a globally recognised, industry-approved qualification. Get the mindset, the confidence and the skills that make Data Scientist so valuable.

Logo

Get Free Complete Python Course

Build your data science career with a globally recognised, industry-approved qualification. Get the mindset, the confidence and the skills that make Data Scientist so valuable.

# Show max and min values a datatype can hold
import numpy as np, pandas as pd
print(np.iinfo('int8'))
print(np.iinfo('int16'))
print(np.iinfo('int32'))
print(np.iinfo('int64'))

Output

Machine parameters for int8
---------------------------------------------------------------
min = -128
max = 127
---------------------------------------------------------------

Machine parameters for int16
---------------------------------------------------------------
min = -32768
max = 32767
---------------------------------------------------------------

Machine parameters for int32
---------------------------------------------------------------
min = -2147483648
max = 2147483647
---------------------------------------------------------------

Machine parameters for int64
---------------------------------------------------------------
min = -9223372036854775808
max = 9223372036854775807
---------------------------------------------------------------

Function to reduce the memory usage.

# Reduce memory usage
def reduce_mem_usage(df, verbose=True):
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
start_mem = df.memory_usage(deep=True).sum() / 1024**2
for col in df.columns:
col_type = df[col].dtypes
if col_type in numerics:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max: df[col] = df[col].astype(np.int8) elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max: df[col] = df[col].astype(np.int16) elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max: df[col] = df[col].astype(np.int32) elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max: df[col] = df[col].astype(np.int64) else: if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max: df[col] = df[col].astype(np.float16) elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
end_mem = df.memory_usage(deep=True).sum() / 1024**2
if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
return df

Create dataframe

df = pd.DataFrame({'age':[1,2,3,4,5,6]})
print(df.dtypes)
df.memory_usage() # 64 bits per item

Result

age int64
dtype: object

Index 128
age 48
dtype: int64

Compress dataframe

Now, let’s compress it and check the size.

df_compressed = reduce_mem_usage(df)
print(df_compressed.dtypes)
df.memory_usage() # 8 bits per item

Result

Mem. usage decreased to 0.00 Mb (23.9% reduction)
age int8
dtype: object

Index 128
age 6
dtype: int64

That’s a drop from 48 bytes to 6 bytes. When extrapolated for large real world datasets, that’s significant.

2. Use only required columns

When you are importing the dataframe, use the usecols argument in pd.read_csv to load only the columns you are going to use.

This is more applicable when you use a wide dataset that contains numerous columns, but you need only a handful of them.

But, How to know the column names in the first place?

Load only a sample of the data, say the first 5 columns by setting nrows=5. Get the column names and decide which ones you want to keep.

# Import columns in positions 1,3,9 instead of all columns.
location = 'data.csv'
df = pd.read_csv(location, usecols=[1,3,9])

3. Chunking data

If you must import all the columns and you want to be able to process all of the data, BUT, your processed data is going to have only a fraction of the rows that can be held in memory, then the chunking method might be helpful.

That is, read your data in chunks of defined number of rows, do the calculations and retain the processed output.

import pandas as pd
from pprint import pprint

location = 'https://raw.githubusercontent.com/selva86/datasets/2bcde4513ea9a7d9c2542bbeafe488f707415980/Glass.csv'
location = 'data.csv'
df = pd.read_csv(location, chunksize=20)

# See the shape.
# You can do any processing here if needed and retain only the processed data.
# In this example, I am only checking the shape of each chunk.
# Prctically, You might want to do some agregation operation and store only the result of each aggregation

for df_chunk in df:
print(df_chunk.shape)

Output

(20, 10)
(20, 10)
(20, 10)
(20, 10)
(20, 10)
(20, 10)
(20, 10)
(20, 10)
(20, 10)
(20, 10)
(14, 10)

4. Sparse data formats

If your data is sparse, that is, majority of the values present in the data are zeros, then, instead of storing it in a dataframe, you can store it in a sparse format.

This type of format is provided in the scipy library. The difference between a dataframe and sparse matrix is, while a dataframe stores the actual value in the matrix cells, the sparse matrix stores only the positions of the non-zero values.

Since the proportion of non-zero values is only a small fraction, you can save on not having to store the zeros in memory.

You can create sparse matrix using the csr_matrix method from scipy.sparse.

# Create sparse matrix using csr_matrix()
import numpy as np
from scipy.sparse import csr_matrix

row = np.array([0, 0, 1, 2, 1])
col = np.array([0, 1, 0, 2, 2])

# data (the non-zero values)
data = np.array([1, 2, 3, 4, 5])

# creating sparse matrix
sparseMatrix = csr_matrix((data, (row, col)),
shape = (3, 3))
# print the sparse matrix
print(sparseMatrix)

Output

(0, 0) 1
(0, 1) 2
(1, 0) 3
(1, 2) 5
(2, 2) 4

Convert a sparse matrix to a dense array

s_array = sparseMatrix.toarray()
s_array

Output

array([[1, 2, 0],
[3, 0, 5],
[0, 0, 4]], dtype=int32)

Convert dense array back to a sparse matrix

sm = csr_matrix(s_array)
sm

<3×3 sparse matrix of type ‘<class ‘numpy.intc’>’
with 5 stored elements in Compressed Sparse Row format>

print(sm)

Output

(0, 0) 1
(0, 1) 2
(1, 0) 3
(1, 2) 5
(2, 2) 4

5. Efficient Data file formats

Typically, when storing data, people tend to store it as a ‘csv’ file or a tab delimited file etc.

While, these are very conveniently viewed using spreadsheet software, they are not so great for storing large data.

Why?

Because (i) They occupy lot of memory. (ii) The load and export time of the datasets stored in such format is large.
This can be a problem when working with large volumes of data.

There are few convenient file formats that overcomes these problems. The most important ones amongst such formats are:

  1. Parquet
  2. Feather
  3. HDF5

Let’s see an example.

# create a large dataset
df_random = pd.DataFrame(np.random.randint(0,100,size=(10000000,10)))
df_random.to_csv("data_random.csv")

Load the data with pd.read_csv()

%%time
df = pd.read_csv('data_random.csv')

Output

CPU times: total: 6.48 s
Wall time: 6.54 s

Let’s now export it to a parquet format and load it back. But to save it as parquet, you will need the ‘pyarrow’ library. So, let’s install it.

# !pip install pyarrow

Export the data as csv file, parquet file and feather file and see how long it takes.

%%time
df.to_csv("data_random.csv")

Output

CPU times: total: 35.4 s
Wall time: 36.4 s

Export to parquet

%%time
df.to_parquet("data_random.parquet")

Output

CPU times: total: 2.72 s
Wall time: 2.83 s

Export to Feather format

%%time
df.to_feather("data_random.feather")

Output

CPU times: total: 1.61 s
Wall time: 723 ms

Turns out parquet and feather are more than 15x faster, with feather being the fastest.

Let’s try reading them back in.

%%time
df_parq = pd.read_parquet("data_random.parquet")

Output

CPU times: total: 1.81 s
Wall time: 842 ms
Output

Read back the feather file

%%time
df_feather = pd.read_feather("data_random.feather")

Output

CPU times: total: 1.7 s
Wall time: 716 ms

Importing Parquet and Feather formats are much faster when importing the data as well, compared to csv. So, clearly when working with large data, parquet and feather are preferred. Parquet has a relatively wider adoption, so you might want to consider that.

Another alternate you might want to explore is storing the data in HDF5 format where the data loads are often instantaneous.

6. Pandas alternates

While pandas is sort of the default library people use to wrangle data with python, there are other formidable options too.

I’ve been quite impressed with:

  1. Modin – Allows you to work on large data. Through Modin you can work with other high performance libraries like ‘dask’ and ‘ray’ using pandas syntax itself. You can use the exact same pandas syntax by changing just one line of code

  2. Vaex – Uses memory mapping to enable you to work with large datasets which you are not able to load to RAM memory.

  3. Dask – Effectively implements parallel processing for data analysis tasks. Highly recommended.

Note: I teach Pandas, the alternate libraries and other methods to speedup code in the Pandas for Data Science course which is part of the ML+ University Online Course.

7. Dask

Dask requires a special mention because there are various multithreading and paralleization packages in Python. But when it comes to Data Wrangling and analysis, Dask is great.

Dask dataframes uses lazy evaluation, parallel computing and computational graphs to allow you to work with large datasets.

This dask tutorial should give you a quick overview of dask functionalities and understand the core ideas well.

Dask also works well with other popular machine learning libraries such as sklearn, XGBoost, LightGBM, PyTorch and TensorFlow.

You can build machine learning models as well using Dask and have it parallelized.

Here is a quick example of how you can build SVM model and search the best model using RandomizedSearch.

import numpy as np
from dask.distributed import Client

import joblib
from sklearn.datasets import load_digits
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC

client = Client(processes=False) # create local cluster

digits = load_digits()

param_space = {
'C': np.logspace(-6, 6, 13),
'gamma': np.logspace(-8, 8, 17),
'tol': np.logspace(-4, -1, 4),
'class_weight': [None, 'balanced'],
}

model = SVC(kernel='rbf')
search = RandomizedSearchCV(model, param_space, cv=3, n_iter=50, verbose=10)

with joblib.parallel_backend('dask'):
search.fit(digits.data, digits.target)

8. Distributed computing with Spark

When you want to work on big data, where the size of the data keeps growing, companies use distributed computing with spark where the data is stored in different clusters.

Apache Spark is an open source framework that can scale as your data grows big. The two main commercial providers of Apache Spark implementation are Databricks and Cloudera.

What is great about Spark?

Today, it is an indispensable tool for Data Engineers. Spark nicely interplays with Python and SQL, where you can seamlessly mix SQL queries with Python / Spark code.

results = spark.sql("SELECT * FROM tablename")

You can import data from a wide variety of data sources, run HIVE queries and with packages like Koalas, you can write pandas like code and yet use the pyspark in the backend.

For Data Scientists, a significant use is the implementation of the distributed version of various ML algorithms in Spark.

9. Intel(R) extension for sklearn

A possible way to deal with big data is to use a smaller sample of your dataset to build the machine learning models.

While doing that you might want to build your models on multiple samples instead of just one to generlize the predictions. In this process, the key is to be able to train your ML models rather quick.

It is possible to speed up the training of your ML models built with scikit-learn using Intel(R) extension for the scikit-learn library.

Speedup of algorithms to the extent of 3000x have been reported.

The great part is to make use of this speedup, you don’t need to change your existing scikit-learn code.

Install the extension

pip install scikit-learn-intelex

Import specific implementation of ML algo

# from sklearn.svm import SVC
from sklearnex.svm import SVC

# normal code without any more changes

Alternately, you can patch everything so that all of your existing scikit-learn models will use the Inter(R) extenion instead

from sklearnex import patch_sklearn
patch_sklearn()
from sklearn.svm import SVC
# Your subsequent code without any changes

General tools and tips you can exploit

1. Apply Vectorized functions

Whenever possible, use the built-in functions in pandas or numpy to do the operations.

If you have to use the pd.apply() function, see if that operation can be done with a builtin function. This is because, aply() and it’s cousins like iterrows etc will loop over the entire dataframe.

Whereas the builtin functions (like np.sum etc) are typically optimized for speed and usually runs much faster.

2. Numba

Numba allows you to speed up pure python functions by JIT comiling them to native machine functions.

In several cases, you can see significant speed improvements just by adding a decorator @jit

import numba

@numba.jit
def plainfunc(x):
return x * (x + 10)

That’s it. Just add @numba.jit to your functions. You can parallelize your functions as well using @njit(parallel=True).

To know more, here is a 5min guide on Numba.

3. pd.eval and pd.query

Your typical pandas code gets faster if you use pd.eval or df.query instead of using the corresponding dataframe methods.

# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html#pandas.DataFrame.query
import pandas as pd
df = pd.DataFrame({'A': range(1, 6),
'B': range(10, 0, -2),
'C': range(10, 5, -1)})

df
A B C
0 1 10 10
1 2 8 9
2 3 6 8
3 4 4 7
4 5 2 6
df.query('A > B')
A B C
4 5 2 6

This is essentially the same as the following pandas code.

df[df.A > df.B]
A B C
4 5 2 6

Now if you want to use a variable value that is not a column in the dataframe, you can do the following.

k = 3
df.query('A > @k')
A B C
3 4 4 7
4 5 2 6

Now, pd.eval is very much similar to df.query, only that it operates as a top-level pandas function.

# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval
df = pd.DataFrame({"animal": ["dog", "pig"], "age": [10, 20]})
df
animal age
0 dog 10
1 pig 20

Read also: Enhancing Pandas performance

pd.eval("double_age = df.age * 2", target=df)
animal age double_age
0 dog 10 20
1 pig 20 40

4. Rapids cuDF

Rapids cudf library allows you to use the power of GPUs while working with familiar pandas like api.

It allows you to do typical data wrangling tasks as well as build ML models like xgboost and train them on GPUs.

Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science