RegEx (Regular Expression) is a special sequence of characters used to form a search pattern using a specialized syntax
While working on data manipulation, especially textual data, you need to manipulate specific string patterns. These may include retrieving hashtags from a tweet, extracting dates from a text, or removing website links. Pandas replace()
function is used to replace a string regex, list, dictionary, series, number in a dataframe. In this article, we explain how to replace patterns using regex with examples
Replace function for regex
For using pandas replace
function with regex, you need to define 3 parameters: to_replace
, regex
and value
.
to_replace
: Denotes the value that has to be replaced in the dataframe or series. In the case of regular expressions, a regex pattern has to be passed. This pattern represents a generic sequence of characters.regex
: For pandas to interpret the replacement as regular expression replacement, set it toTrue
value
: This represents the value to be replaced in place ofto_replace
values.
If you are hearing of regex for the first time, we have a beginner tutorial to get you up to pace/
Let’s try to implement this using various use cases.
Create a sample dataset
Create a pandas dataframe with sample data as shown below. Following that, we’ll say various examples of pandas replace using regex.
# Import packages
import pandas as pd
df = pd.DataFrame(
data= [
['@mlplus', 'We are excited to launch our new course on ML. #newcourse #machinelearning #python','mlplus@mlplus.tech'],
['@kaustubhgupta', "@gmail Gmail is down for 30 minutes. What's the matter? #gmaildown #google #gmail",'kaustubh@random.in'],
['@rajveer', 'Excited to lauch our new product! #newproduct #startup ','rajveer@twitter.me'],
['@joe', 'When will this coronavirus end? #thoughts','joe@facebook.pl'],
['@abhishek', 'I want to become web developer. Any tips? @webdeveloper @randomxyz','abhishek@orkut.tech'],
['@ayushi', 'Missing college! @colllege','ayushi@space.org' ]
],
columns=['twitter_username', 'tweet', 'email']
)
df

Situation 1: Removing hashtags using regex replace
The dataset above has a tweet column. The values of these columns contain hashtags which are generally used for cross-referencing content. What if you want to remove all the hashtags from tweets?
Use the pandas replace function with regex. The regex for this case would be #\w+
.
Tweet before replacement
df.tweet[0]
Output:
'We are excited to launch our new course on ML. #newcourse #machinelearning #python'
Tweet after replacement
# using replace function with regex pattern, regex=True and value as empty string
df.tweet.replace(to_replace='#\w+', regex=True, value='')[0]
Output:
'We are excited to launch our new course on ML.'
Also Read: Getting comfortable with Regular Expressions in Python
Situation 2: Replacing all domain suffixes with .edu
using regex
Suppose you want to replace all the domain suffixes such as .com, .in, .tech, etc to .edu in the email column of the dataset. The regex pattern for this case will be \.\w+
.
Emails before replacement
df.email
Output:
0 mlplus@mlplus.tech
1 kaustubh@random.in
2 rajveer@twitter.me
3 joe@facebook.pl
4 abhishek@orkut.tech
5 ayushi@space.org
Name: email, dtype: object
Emails after replacement
df.email.replace(to_replace='\.\w+', value='.edu', regex=True)
Output:
0 mlplus@mlplus.edu
1 kaustubh@random.edu
2 rajveer@twitter.edu
3 joe@facebook.edu
4 abhishek@orkut.edu
5 ayushi@space.edu
Name: email, dtype: object
Situation 3: Replace all the vowels in tweets with $
In this case, the vowels will be replaced with $
. For example, the word Miss
would become M$iss
.
The regular expression for this case will be: [aeiouAEIOU]
Tweet before replacement
df.tweet[5]
'Missing college! @colllege'
Tweet after replacement
df.tweet.replace(to_replace="[aeiouAEIOU]", regex=True, value='$')[5]
'M$ss$ng c$ll$g$! @c$lll$g$'
Practical Tips
- Regular expression comes in handy to replace complex string patterns that are usually difficult to replace via other functions.
- For instance, you can replace all the cuss words in a text with special characters using regex replacement.
Test your knowledge
Q1: To enable regular expression search in the replace
function, what parameter should be enabled?
Answer: regex
parameter should be set to True
Q2: The value
parameter in replace
function is used for:
A) defining which values should be replaced in the string.
b) defining the replacement value.
c) defining the regex pattern
d) None of these
Answer:Answer: (B) option
Q3: Consider the dataframe below:
import pandas as pd
df = pd.DataFrame(
data= [
['@mlplus', 'Our new course on ML price: 3222'],
['@kaustubhgupta', "Gmail down for 30 minutes. What's the matter?"],
['@rajveer', 'Excited to lauch our new product on 5th Jan!'],
['@joe', 'Will coronavirus end in 2021? #thoughts'],
['@abhishek', 'I want to become web developer in 4 months. Any tips? @webdeveloper @randomxyz'],
['@ayushi', 'Missing college! @colllege']
],
columns=['username', 'tweet']
)
df

Write the code to replace the numbers in tweets with text 00number00
using replace function and regex expressions
Answer: Use the regular expression: \d+
df.tweet.replace(to_replace="\d+", value='00number00', regex=True)
0 Our new course on ML price: 00number00
1 Gmail down for 00number00 minutes. What's the ...
2 Excited to lauch our new product on 00number00...
3 Will coronavirus end in 00number00? #thoughts
4 I want to become web developer in 00number00 m...
5 Missing college! @colllege
Name: tweet, dtype: object
The article was contributed by Kaustubh G and Shri Varsheni.