RegEx Replace values using Pandas

RegEx (Regular Expression) is a special sequence of characters used to form a search pattern using a specialized syntax

While working on data manipulation, especially textual data, you need to manipulate specific string patterns. These may include retrieving hashtags from a tweet, extracting dates from a text, or removing website links. Pandas replace() function is used to replace a string regex, list, dictionary, series, number in a dataframe. In this article, we explain how to replace patterns using regex with examples

Replace function for regex

For using pandas replace function with regex, you need to define 3 parameters: to_replace, regex and value.

  1. to_replace: Denotes the value that has to be replaced in the dataframe or series. In the case of regular expressions, a regex pattern has to be passed. This pattern represents a generic sequence of characters.
  2. regex: For pandas to interpret the replacement as regular expression replacement, set it to True
  3. value: This represents the value to be replaced in place of to_replace values.

If you are hearing of regex for the first time, we have a beginner tutorial to get you up to pace/

Let’s try to implement this using various use cases.

Create a sample dataset

Create a pandas dataframe with sample data as shown below. Following that, we’ll say various examples of pandas replace using regex.


# Import packages
import pandas as pd

df = pd.DataFrame(
                   data= [
                        ['@mlplus', 'We are excited to launch our new course on ML. #newcourse #machinelearning #python','mlplus@mlplus.tech'],
                        ['@kaustubhgupta', "@gmail Gmail is down for 30 minutes. What's the matter? #gmaildown #google #gmail",'kaustubh@random.in'],
                        ['@rajveer', 'Excited to lauch our new product! #newproduct #startup ','rajveer@twitter.me'],
                        ['@joe', 'When will this coronavirus end? #thoughts','joe@facebook.pl'],
                        ['@abhishek', 'I want to become web developer. Any tips? @webdeveloper @randomxyz','abhishek@orkut.tech'],
                        ['@ayushi', 'Missing college! @colllege','ayushi@space.org' ]
                         ],
                    columns=['twitter_username', 'tweet', 'email']
                )
df
Create database

Situation 1: Removing hashtags using regex replace

The dataset above has a tweet column. The values of these columns contain hashtags which are generally used for cross-referencing content. What if you want to remove all the hashtags from tweets?

Use the pandas replace function with regex. The regex for this case would be #\w+.

Tweet before replacement

df.tweet[0]

Output:

'We are excited to launch our new course on ML. #newcourse #machinelearning #python'

Tweet after replacement

# using replace function with regex pattern, regex=True and value as empty string
df.tweet.replace(to_replace='#\w+', regex=True, value='')[0]

Output:

'We are excited to launch our new course on ML.'

Also Read: Getting comfortable with Regular Expressions in Python

Situation 2: Replacing all domain suffixes with .edu using regex

Suppose you want to replace all the domain suffixes such as .com, .in, .tech, etc to .edu in the email column of the dataset. The regex pattern for this case will be \.\w+.

Emails before replacement

Get Free Complete Python Course

Facing the same situation like everyone else?

Build your data science career with a globally recognised, industry-approved qualification. Get the mindset, the confidence and the skills that make Data Scientist so valuable.

Logo

Get Free Complete Python Course

Build your data science career with a globally recognised, industry-approved qualification. Get the mindset, the confidence and the skills that make Data Scientist so valuable.

df.email

Output:

0     [email protected].tech
1     [email protected].in
2     [email protected].me
3        [email protected].pl
4    [email protected].tech
5       [email protected].org
Name: email, dtype: object

Emails after replacement

df.email.replace(to_replace='\.\w+', value='.edu', regex=True)

Output:

0      [email protected].edu
1    [email protected].edu
2    [email protected].edu
3       [email protected].edu
4     [email protected].edu
5       [email protected].edu
Name: email, dtype: object

Situation 3: Replace all the vowels in tweets with $

In this case, the vowels will be replaced with $. For example, the word Miss would become M$iss.

The regular expression for this case will be: [aeiouAEIOU]

Tweet before replacement

df.tweet[5]
'Missing college! @colllege'

Tweet after replacement

df.tweet.replace(to_replace="[aeiouAEIOU]", regex=True, value='$')[5]
'M$ss$ng c$ll$g$! @c$lll$g$'

Practical Tips

  1. Regular expression comes in handy to replace complex string patterns that are usually difficult to replace via other functions.
  2. For instance, you can replace all the cuss words in a text with special characters using regex replacement.

 

Test your knowledge

Q1: To enable regular expression search in the replace function, what parameter should be enabled?

Answer:

Answer: regex parameter should be set to True

Q2: The value parameter in replace function is used for:

A) defining which values should be replaced in the string.

b) defining the replacement value.

c) defining the regex pattern

d) None of these

Answer:

Answer: (B) option

Q3: Consider the dataframe below:

import pandas as pd

df = pd.DataFrame(
                    data= [
                        ['@mlplus', 'Our new course on ML price: 3222'],
                        ['@kaustubhgupta', "Gmail down for 30 minutes. What's the matter?"],
                        ['@rajveer', 'Excited to lauch our new product on 5th Jan!'],
                        ['@joe', 'Will coronavirus end in 2021? #thoughts'],
                        ['@abhishek', 'I want to become web developer in 4 months. Any tips? @webdeveloper @randomxyz'],
                        ['@ayushi', 'Missing college! @colllege']
                         ],

                    columns=['username', 'tweet']
                )

df
Create database for regex replace

Write the code to replace the numbers in tweets with text 00number00 using replace function and regex expressions

Answer:

Answer: Use the regular expression: \d+

df.tweet.replace(to_replace="\d+", value='00number00', regex=True)
0               Our new course on ML price: 00number00
1    Gmail down for 00number00 minutes. What's the ...
2    Excited to lauch our new product on 00number00...
3        Will coronavirus end in 00number00? #thoughts
4    I want to become web developer in 00number00 m...
5                           Missing college! @colllege
Name: tweet, dtype: object

 

The article was contributed by Kaustubh G and Shri Varsheni.

Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science