Menu

Pandas Dataframe.duplicated()

The pandas.DataFrame.duplicated() method is used to find duplicate rows in a DataFrame. It returns a boolean series which identifies whether a row is duplicate or unique.

In this article, you will learn how to use this method to identify the duplicate rows in a DataFrame. You will also get to know a few practical tips for using this method.

Creating a DataFrame

# Create a DataFrame
import pandas as pd
data_df = {'Name': ['Arpit', 'Riya', 'Priyanka', 'Aman', 'Arpit', 'Rohan', 'Riya', 'Sakshi'],

           'Employment Type': ['Full-time Employee', 'Part-time Employee', 'Intern', 'Intern',
                               'Full-time Employee', 'Part-time Employee', 'Part-time Employee', 'Full-time Employee'],

           'Department': ['Administration', 'Marketing', 'Technical', 'Marketing',
                          'Administration', 'Technical', 'Marketing', 'Administration']}

df = pd.DataFrame(data_df)
df

Also read: creating and loading DataFrames.

pandas.DataFrame.duplicated()

  • Syntax: pandas.DataFrame.duplicated(subset=None, keep= ‘first’)Purpose: To identify duplicate rows in a DataFrame
  • Parameters:
    • subset:(default: None). It is used to specify the particular columns in which duplicate values are to be searched.
    • keep:‘first’ or ‘last’ or False (default: ‘first’). It is used to specify which instance of the repeated rows is to be identified as a unique row.
  • Returns: A Boolean series where the value True indicates that the row at the corresponding index is a duplicate and False indicates that the row is unique.

Using the DataFrame.duplicated() function

When you directly use the DataFrame.duplicated() function, the default values will be passed to the parameters for searching duplicate rows in the DataFrame.

# Use the DataFrame.duplicated() method to return a series of boolean values
bool_series = df.duplicated()
0    False
1    False
2    False
3    False
4     True
5    False
6     True
7    False
dtype: bool

Using the keep parameter

You can use the keep parameter to specify which instance of the repeated should be considered as unique and the remaining instances will be considered as duplicates.

Setting keep as ‘first’

The default value of the keep parameter is ‘first’. It means that the method will consider the first instance of a row to be unique and the remaining instances to be duplicates.

Let’s try to remove the duplicated rows.

# Use the keep parameter to consider only the first instance of a duplicate row to be unique
bool_series = df.duplicated(keep='first')
print('Boolean series:')
print(bool_series)
print('\n')
print('DataFrame after keeping only the first instance of the duplicate rows:')

# The `~` sign is used for negation. It changes the boolean value True to False and False to True.

df[~bool_series]
Boolean series:
0    False
1    False
2    False
3    False
4     True
5    False
6     True
7    False
dtype: bool


DataFrame after keeping only the first instance of the duplicate rows:

As you can see, the fifth and the seventh row have been identified as duplicates. The fifth row is a duplicate of the first row and the seventh row is a duplicate of the second row. Hence, they have been removed from the DataFrame

 

 

Setting keep as ‘last’

When you set the value of this parameter as ‘last’, the method will consider the last instance of a row to be unique and the remaining instances to be duplicates.

Let’s try to remove the duplicated rows.

# Use the keep parameter to consider only the last instance of a duplicate row to be unique
bool_series = df.duplicated(keep='last')
print('Boolean series:')
print(bool_series)
print('\n')
print('DataFrame after keeping only the last instance of the duplicate rows:')

# The `~` sign is used for negation. It changes the boolean value True to False and False to True
df[~bool_series]
Boolean series:
0     True
1     True
2    False
3    False
4    False
5    False
6    False
7    False
dtype: bool


DataFrame after keeping only the last instance of the duplicate rows:

Here, the first and the second rows have been identified as duplicates while the fifth and the seventh rows have been considered to be unique.

Setting keep as False

If you set the value of keep as the boolean value False, then the method will consider all the instances of a row to be duplicates.

Let’s try to remove the duplicated rows.

# Use the keep parameter to consider all instances of a row to be duplicates
bool_series = df.duplicated(keep=False)
print('Boolean series:')
print(bool_series)
print('\n')
print('DataFrame after removing all the instances of the duplicate rows:')
# The `~` sign is used for negation. It changes the boolean value True to False and False to True
df[~bool_series]
Boolean series:
0     True
1     True
2    False
3    False
4     True
5    False
6     True
7    False
dtype: bool


DataFrame after removing all the instances of the duplicate rows:

Using the subset parameter

The subset parameter is used for specifying the columns in which duplicates are to be searched.
After you have specified the columns, the method will search for duplicate rows by comparing the values of only the specified columns between the rows.

This is extremly useful as you might be intrested only in finding duplicate values for only few columns.

# Use the subset parameter to search for duplicate values only in the Name column of the DataFrame

bool_series = df.duplicated(subset='Name')

print('Boolean series:')
print(bool_series)
print('\n')
print('DataFrame after removing duplicates found in the Name column:')
df[~bool_series]
Boolean series:
0    False
1    False
2    False
3    False
4     True
5    False
6     True
7    False
dtype: bool


DataFrame after removing duplicates found in the Name column:

Practical Tips

  1. If you do not use the subset parameter, then the all the values in the rows need to be same to be identified as duplicates.
  2. You can also pass multiple columns to the subset parameter. However, keep in mind that, all the values of the specified columns must be same in the rows to consider them as duplicates.
bool_series = df.duplicated(subset=['Name', 'Department'])
print(bool_series)
print('\n')
print('DataFrame after removing all the instances of the duplicate rows found in the "Name" and "Department" columns:')

df[~bool_series]
0    False
1    False
2    False
3    False
4     True
5    False
6     True
7    False
dtype: bool


DataFrame after removing all the instances of the duplicate rows found in the "Name" and "Department" columns:
  • Although this method returns a series which only identifies the duplicate rows in a DataFrame, you can use this series to subset the DataFrame so that it contains only unique values.

Test Your Knowledge

Q1: The False value for the keep parameter is used to remove all the duplicate rows from the DataFrame. True or False?

Answer:

Answer: False. The False value identifies all the instances of a row to be duplicates but it does not remove them

Q2: How are the duplicate rows identified when multiple columns are passed to the subset parameter?

Answer:

Answer: When multiple columns are passed to the subset parameter, the method will consider a row to be a duplicate only if the values of all the specified columns in that row matches with the values of the specified columns in another row.

Q3: Write the code to remove all the instances of duplicate rows from the DataFrame df apart from the first instance of the rows.

Answer:

Answer:

bool_series = df.duplicated(keep='first')

df.duplicated[~bool_series]

Q4: Write the code to remove all the instances of duplicate rows from the DataFrame df apart from the last instance of the rows.

Answer:

Answer:

bool_series = df.duplicated(keep='last')

df.duplicated[~bool_series]

Q5: Write the code to search for duplicate values in the columns col_1 and col_2 in the DataFrame df.

Answer:

Answer: df.duplicated(subset=[col_1, col_2])

The article was contributed by Shreyansh B and Shri Varsheni

In this article, you will learn how to use this method to identify the duplicate rows in a DataFrame. You will also get to know a few practical tips for using this method.

Creating a DataFrame

# Create a DataFrame
import pandas as pd
data_df = {'Name': ['Arpit', 'Riya', 'Priyanka', 'Aman', 'Arpit', 'Rohan', 'Riya', 'Sakshi'],

           'Employment Type': ['Full-time Employee', 'Part-time Employee', 'Intern', 'Intern',
                               'Full-time Employee', 'Part-time Employee', 'Part-time Employee', 'Full-time Employee'],

           'Department': ['Administration', 'Marketing', 'Technical', 'Marketing',
                          'Administration', 'Technical', 'Marketing', 'Administration']}

df = pd.DataFrame(data_df)
df

Also read: creating and loading DataFrames.

pandas.DataFrame.duplicated()

  • Syntax: pandas.DataFrame.duplicated(subset=None, keep= ‘first’)Purpose: To identify duplicate rows in a DataFrame
  • Parameters:
    • subset:(default: None). It is used to specify the particular columns in which duplicate values are to be searched.
    • keep:‘first’ or ‘last’ or False (default: ‘first’). It is used to specify which instance of the repeated rows is to be identified as a unique row.
  • Returns: A Boolean series where the value True indicates that the row at the corresponding index is a duplicate and False indicates that the row is unique.

Using the DataFrame.duplicated() function

When you directly use the DataFrame.duplicated() function, the default values will be passed to the parameters for searching duplicate rows in the DataFrame.

# Use the DataFrame.duplicated() method to return a series of boolean values
bool_series = df.duplicated()
0    False
1    False
2    False
3    False
4     True
5    False
6     True
7    False
dtype: bool

Using the keep parameter

You can use the keep parameter to specify which instance of the repeated should be considered as unique and the remaining instances will be considered as duplicates.

Setting keep as ‘first’

The default value of the keep parameter is ‘first’. It means that the method will consider the first instance of a row to be unique and the remaining instances to be duplicates.

Let’s try to remove the duplicated rows.

# Use the keep parameter to consider only the first instance of a duplicate row to be unique
bool_series = df.duplicated(keep='first')
print('Boolean series:')
print(bool_series)
print('\n')
print('DataFrame after keeping only the first instance of the duplicate rows:')

# The `~` sign is used for negation. It changes the boolean value True to False and False to True.

df[~bool_series]
Boolean series:
0    False
1    False
2    False
3    False
4     True
5    False
6     True
7    False
dtype: bool


DataFrame after keeping only the first instance of the duplicate rows:

As you can see, the fifth and the seventh row have been identified as duplicates. The fifth row is a duplicate of the first row and the seventh row is a duplicate of the second row. Hence, they have been removed from the DataFrame

 

 

Setting keep as ‘last’

When you set the value of this parameter as ‘last’, the method will consider the last instance of a row to be unique and the remaining instances to be duplicates.

Let’s try to remove the duplicated rows.

# Use the keep parameter to consider only the last instance of a duplicate row to be unique
bool_series = df.duplicated(keep='last')
print('Boolean series:')
print(bool_series)
print('\n')
print('DataFrame after keeping only the last instance of the duplicate rows:')

# The `~` sign is used for negation. It changes the boolean value True to False and False to True
df[~bool_series]
Boolean series:
0     True
1     True
2    False
3    False
4    False
5    False
6    False
7    False
dtype: bool


DataFrame after keeping only the last instance of the duplicate rows:

Here, the first and the second rows have been identified as duplicates while the fifth and the seventh rows have been considered to be unique.

Setting keep as False

If you set the value of keep as the boolean value False, then the method will consider all the instances of a row to be duplicates.

Let’s try to remove the duplicated rows.

# Use the keep parameter to consider all instances of a row to be duplicates
bool_series = df.duplicated(keep=False)
print('Boolean series:')
print(bool_series)
print('\n')
print('DataFrame after removing all the instances of the duplicate rows:')
# The `~` sign is used for negation. It changes the boolean value True to False and False to True
df[~bool_series]
Boolean series:
0     True
1     True
2    False
3    False
4     True
5    False
6     True
7    False
dtype: bool


DataFrame after removing all the instances of the duplicate rows:

Using the subset parameter

The subset parameter is used for specifying the columns in which duplicates are to be searched.
After you have specified the columns, the method will search for duplicate rows by comparing the values of only the specified columns between the rows.

This is extremly useful as you might be intrested only in finding duplicate values for only few columns.

# Use the subset parameter to search for duplicate values only in the Name column of the DataFrame

bool_series = df.duplicated(subset='Name')

print('Boolean series:')
print(bool_series)
print('\n')
print('DataFrame after removing duplicates found in the Name column:')
df[~bool_series]
Boolean series:
0    False
1    False
2    False
3    False
4     True
5    False
6     True
7    False
dtype: bool


DataFrame after removing duplicates found in the Name column:

Practical Tips

  1. If you do not use the subset parameter, then the all the values in the rows need to be same to be identified as duplicates.
  2. You can also pass multiple columns to the subset parameter. However, keep in mind that, all the values of the specified columns must be same in the rows to consider them as duplicates.
bool_series = df.duplicated(subset=['Name', 'Department'])
print(bool_series)
print('\n')
print('DataFrame after removing all the instances of the duplicate rows found in the "Name" and "Department" columns:')

df[~bool_series]
0    False
1    False
2    False
3    False
4     True
5    False
6     True
7    False
dtype: bool


DataFrame after removing all the instances of the duplicate rows found in the "Name" and "Department" columns:
  • Although this method returns a series which only identifies the duplicate rows in a DataFrame, you can use this series to subset the DataFrame so that it contains only unique values.

Test Your Knowledge

Q1: The False value for the keep parameter is used to remove all the duplicate rows from the DataFrame. True or False?

Answer:

Answer: False. The False value identifies all the instances of a row to be duplicates but it does not remove them

Q2: How are the duplicate rows identified when multiple columns are passed to the subset parameter?

Answer:

Answer: When multiple columns are passed to the subset parameter, the method will consider a row to be a duplicate only if the values of all the specified columns in that row matches with the values of the specified columns in another row.

Q3: Write the code to remove all the instances of duplicate rows from the DataFrame df apart from the first instance of the rows.

Answer:

Answer:

bool_series = df.duplicated(keep='first')

df.duplicated[~bool_series]

Q4: Write the code to remove all the instances of duplicate rows from the DataFrame df apart from the last instance of the rows.

Answer:

Answer:

bool_series = df.duplicated(keep='last')

df.duplicated[~bool_series]

Q5: Write the code to search for duplicate values in the columns col_1 and col_2 in the DataFrame df.

Answer:

Answer: df.duplicated(subset=[col_1, col_2])

The article was contributed by Shreyansh B and Shri Varsheni

Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science