Pandas Dropna – How to drop missing values?

In reality, majority of the datasets collected contain missing values due to manual errors, unavailability of information, etc. Although there are different ways for handling missing values, sometimes you have no other option but to drop those rows from the dataset. A common method for dropping rows and columns is using the pandas `dropna` function.

In this article, you will learn about the different functionalities of this method for dropping rows with missing values followed by some practical tips for using pandas dropna method.

Creating a Basic DataFrame

import pandas as pd

# Create the data of the DataFrame as a dictionary
data_dict = {'Name': ['Ankit', 'Aman', 'Riya', 'Ayush', 'Anushka'],
             'ID': [4001, 4002, np.nan, np.nan, 4010], # Use np.nan to create missing values
             'Department': ['Technical', np.nan, np.nan, np.nan, 'Marketing'],
             'Qualification': [np.nan, np.nan, np.nan, np.nan, 'M.A.']}

# Create the DataFrame
df = pd.DataFrame(data_dict)

df
Creating Basic Dataframe

The pandas dropna function

  • Syntax: pandas.DataFrame.dropna(axis = 0, how =’any’, thresh = None, subset = None, inplace=False)
  • Purpose: To remove the missing values from a DataFrame.
  • Parameters:
    • axis:0 or 1 (default: 0). Specifies the orientation in which the missing values should be looked for. Pass the value 0 to this parameter search down the rows. Pass the value 1 to this parameter to look across columns.
    • how:‘any’ or ‘all’ (default:’any’). If it is set to ‘any’, the row/column that has atleast one missing value will be dropped. If it is set to ‘all’, only the rows/columns in which all values are missing will be dropped.
    • thresh:Integer (default: None). Maximum number of missing values in a row or column which will be ignored by this function.
    • subset:array (default: None). It is used to specify the particular labels along the axis which is not specified in the axis parameter in which missing  values should be looked for.
    • inplace:Boolean (default: False). Denotes if the missing values should be dropped in the original DataFrame or if a new DataFrame having the missing values dropped should be returned.
  • Returns: If inplace is set to ‘True’ then None. If it is set to ‘False’, then a DataFrame.

Dropping rows having at least 1 missing value

For removing all rows which have at least one missing value, the value of the axis parameter should be 0 and the how parameter should be set to ‘any’. Since these are the default values of the parameter, you do not need to pass any arguments to the function. This is the simplest usecase of pandas dropna function.

print('Original DataFrame:')
print(df)
print('\n')

# Default configuration drops rows having at least 1 missing value
print('DataFrame after dropping the rows having missing values:')
print(df.dropna())
Original DataFrame:
      Name      ID Department Qualification
0    Ankit  4001.0  Technical           NaN
1     Aman  4002.0        NaN           NaN
2     Riya     NaN        NaN           NaN
3    Ayush     NaN        NaN           NaN
4  Anushka  4010.0  Marketing          M.A.


DataFrame after dropping the rows having missing values:
      Name      ID Department Qualification
4  Anushka  4010.0  Marketing          M.A.

Dropping columns having at least 1 missing values

For removing all columns which have at least one missing value, pass the value 1 to the axis parameter to dropna().

print('Original DataFrame:')
print(df)
print('\n')

# Drop all columns that have at least one missing value
print('DataFrame after dropping the columns having missing values:')
print(df.dropna(axis=1))
Original DataFrame:
      Name      ID Department Qualification
0    Ankit  4001.0  Technical           NaN
1     Aman  4002.0        NaN           NaN
2     Riya     NaN        NaN           NaN
3    Ayush     NaN        NaN           NaN
4  Anushka  4010.0  Marketing          M.A.


DataFrame after dropping the columns having missing values:
      Name
0    Ankit
1     Aman
2     Riya
3    Ayush
4  Anushka

Dropping rows or columns only when all values are null

With the help of this function, you can also drop all the rows and columns where all the values are null values.

Dropping all rows where all the values are null values

data_dict = {'Name': ['Ankit', 'Aman', np.nan, 'Ayush', 'Anushka'],
             'ID': [4001, 4002, np.nan, np.nan, 4010],
             'Department': ['Technical', np.nan, np.nan, np.nan, 'Marketing'],
             'Qualification': [np.nan, np.nan, np.nan, np.nan, 'M.A.']}

df = pd.DataFrame(data_dict)

df
Pandas dropna rows of null values

To drop all the rows which contain only missing values, pass the value 0 to the axis parameter and set the value how='all'.

print('Original DataFrame:')
print(df)
print('\n')

# Drop only those rows where all the row values are null values
print('DataFrame after dropping the rows where all the values were null values:')
print(df.dropna(axis=0, how='all'))
Original DataFrame:
      Name      ID Department Qualification
0    Ankit  4001.0  Technical           NaN
1     Aman  4002.0        NaN           NaN
2     Riya     NaN        NaN           NaN
3    Ayush     NaN        NaN           NaN
4  Anushka  4010.0  Marketing          M.A.


DataFrame after dropping the rows where all the values were null values:
      Name      ID Department Qualification
0    Ankit  4001.0  Technical           NaN
1     Aman  4002.0        NaN           NaN
2     Riya     NaN        NaN           NaN
3    Ayush     NaN        NaN           NaN
4  Anushka  4010.0  Marketing          M.A.

Here, none of them contained missing values in all columns. Hence, no rows were dropped.

Dropping all columns where all the values are null values

In particular cases, you might encounter columns full of null values (information not collected). These definitely have to be dropped

data_dict = {'Name': ['Ankit', 'Aman', 'Riya', 'Ayush', 'Anushka'],
             'ID': [4001, 4002, np.nan, np.nan, 4010],
             'Department': ['Technical', np.nan, np.nan, np.nan, 'Marketing'],
             'Qualification': [np.nan, np.nan, np.nan, np.nan, np.nan]}

df = pd.DataFrame(data_dict)

df
Dropping rows of null values

For dropping all the columns which contain only missing values, pass the value 1 to the axis parameter and the value ‘all’ to the how parameter.

print('Original DataFrame:')
print(df)
print('\n')

# Drop only those columns where all values are null values
print('DataFrame after dropping the columns where all the values were null values:')
print(df.dropna(axis=1, how='all'))
Original DataFrame:
      Name      ID Department  Qualification
0    Ankit  4001.0  Technical            NaN
1     Aman  4002.0        NaN            NaN
2     Riya     NaN        NaN            NaN
3    Ayush     NaN        NaN            NaN
4  Anushka  4010.0  Marketing            NaN


DataFrame after dropping the columns where all the values were null values:
      Name      ID Department
0    Ankit  4001.0  Technical
1     Aman  4002.0        NaN
2     Riya     NaN        NaN
3    Ayush     NaN        NaN
4  Anushka  4010.0  Marketing

How to drop rows/columns that contain missing values above a certain threshold?

In certain cases, you don’t want to drop a row that has very few missing values, so pandas dropna gives you an option to set threshold. To remove only those rows or columns which have missing values above a certain threshold, you need to pass a threshold value to the thresh parameter.

The `thresh` parameter represent the number of non-missing values needed to retain the row/column.

# Original DataFrame
print('Original DataFrame:')
print(df)
print('\n')

# Keep rows that contain at least 2 non-missing values
print('DataFrame after using the thresh function:')
print(df.dropna(axis=0, thresh=2))
Original DataFrame:
      Name      ID Department  Qualification
0    Ankit  4001.0  Technical            NaN
1     Aman  4002.0        NaN            NaN
2     Riya     NaN        NaN            NaN
3    Ayush     NaN        NaN            NaN
4  Anushka  4010.0  Marketing            NaN


DataFrame after using the thresh function:
      Name      ID Department  Qualification
0    Ankit  4001.0  Technical            NaN
1     Aman  4002.0        NaN            NaN
4  Anushka  4010.0  Marketing            NaN

Dropping rows if missing values are present only in specific columns

DataFrame.dropna() also gives you the option to remove the rows by searching for null or missing values on specified columns.

To search for null values in specific columns, pass the column names to the subset parameter. It can take a list of column names or column positions.

print('Original DataFrame:')
print(df)
print('\n')

# Drop only those rows where the specified column has a missing value
print('DataFrame after using the subset function:')
print(df.dropna(subset=['ID']))
Original DataFrame:
      Name      ID Department  Qualification
0    Ankit  4001.0  Technical            NaN
1     Aman  4002.0        NaN            NaN
2     Riya     NaN        NaN            NaN
3    Ayush     NaN        NaN            NaN
4  Anushka  4010.0  Marketing            NaN


DataFrame after using the subset function:
      Name      ID Department  Qualification
0    Ankit  4001.0  Technical            NaN
1     Aman  4002.0        NaN            NaN
4  Anushka  4010.0  Marketing            NaN

Let’s look at the column names.

df.columns
Index(['Name', 'ID', 'Department', 'Qualification'], dtype='object')

Practical Tips

  1. In case of memory constraints, use the inplace parameter. Set its value as True so that the changes will take place in the original DataFrame itself and a new DataFrame will not be created.
  2. Please keep in mind that while dropping rows or columns using the how parameter and setting its value as ‘all’ , the function will remove only those labels where all its values are missing or null values.
  3. While removing columns, you can also pass row labels to the subset parameter to search for rows that contain missing values.

Example:

Want to become awesome in ML?

Hi! I am Selva, and I am excited you are reading this!
You can now go from a complete beginner to a Data Science expert, with my end-to-end free Data Science training.
No shifting between multiple books and courses. Hop on to the most effective way to becoming the expert. (Includes downloadable notebooks, portfolio projects and exercises)


Start free with the first course 'Foundations of Machine Learning' - a well rounded orientation of what the field of ML is all about.

Enroll to the Foundations of ML Course (FREE)

Sold already? Start with the Complete ML Mastery Path

print('Original DataFrame')
print(df)

print('\n')

print('DataFrame after removing columns using the subset function:')
print(df.dropna(axis=1, subset=[2, 3]))
Original DataFrame
      Name      ID Department Qualification
0    Ankit  4001.0  Technical           NaN
1     Aman  4002.0        NaN           NaN
2     Riya     NaN        NaN           NaN
3    Ayush     NaN        NaN           NaN
4  Anushka  4010.0  Marketing          M.A.


DataFrame after removing columns using the subset function:
      Name
0    Ankit
1     Aman
2     Riya
3    Ayush
4  Anushka

Conclusion

In this article, you learned about pandas dropna using the DataFrame.dropna() and using its various parameters such as subset, how and thresh. If you would like to learn more about various functions of pandas library, checkout 101 Pandas Exercises for Data Analysis.

We have also have the most comprehensive Pandas for Data Science course that covers Pandas in depth.

Test Your Knowledge

Q1: The default configuration of DataFrame.dropna() removes all the rows having missing values from the DataFrame. True or False?

Answer

Answer: True

Q2: Which parameter is used to specify the row or column labels to be included while removing the missing value?

Answer

Answer: The subset function.

Q3: Write the code to drop the rows that have more than two missing values from the DataFrame df.

Answer

Answer: df.dropna(axis=0,thresh=2)

Q4: Write the code to remove only those columns from the DataFrame that contain only null values.

Answer

Answer: df.dropna(axis=1,how='all')

Q5: Write the code to remove rows from the DataFrame df especially in those rows where the value of the column ‘col_3’ is null.

Answer

Answer: DataFrame.dropna(axis=0,subset=['col_3'])

The article was contributed by Shreyansh B and Shri Varsheni

Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science