RegEx (Regular Expression) is a special sequence of characters used to form a search pattern using a specialized syntax
While working on data manipulation, especially textual data, you need to manipulate specific string patterns. These may include retrieving hashtags from a tweet, extracting dates from a text, or removing website links. Pandas
replace() function is used to replace a string regex, list, dictionary, series, number in a dataframe. In this article, we explain how to replace patterns using regex with examples
Replace function for regex
For using pandas
replace function with regex, you need to define 3 parameters:
to_replace: Denotes the value that has to be replaced in the dataframe or series. In the case of regular expressions, a regex pattern has to be passed. This pattern represents a generic sequence of characters.
regex: For pandas to interpret the replacement as regular expression replacement, set it to
value: This represents the value to be replaced in place of
If you are hearing of regex for the first time, we have a beginner tutorial to get you up to pace/
Let’s try to implement this using various use cases.
Create a sample dataset
Create a pandas dataframe with sample data as shown below. Following that, we’ll say various examples of pandas replace using regex.
import pandas as pd df = pd.DataFrame( data= [ ['@mlplus', 'We are excited to launch our new course on ML. #newcourse #machinelearning #python','email@example.com'], ['@kaustubhgupta', "@gmail Gmail is down for 30 minutes. What's the matter? #gmaildown #google #gmail",'firstname.lastname@example.org'], ['@rajveer', 'Excited to lauch our new product! #newproduct #startup ','email@example.com'], ['@joe', 'When will this coronavirus end? #thoughts','firstname.lastname@example.org'], ['@abhishek', 'I want to become web developer. Any tips? @webdeveloper @randomxyz','email@example.com'], ['@ayushi', 'Missing college! @colllege','firstname.lastname@example.org' ] ], columns=['twitter_username', 'tweet', 'email'] ) df
Situation 1: Removing hashtags using regex replace
The dataset above has a tweet column. The values of these columns contain hashtags which are generally used for cross-referencing content. What if you want to remove all the hashtags from tweets?
Use the pandas replace function with regex. The regex for this case would be
Tweet before replacement
'We are excited to launch our new course on ML. #newcourse #machinelearning #python'
Tweet after replacement
# using replace function with regex pattern, regex=True and value as empty string df.tweet.replace(to_replace='#\w+', regex=True, value='')
'We are excited to launch our new course on ML.'
Situation 2: Replacing all domain suffixes with
.edu using regex
Suppose you want to replace all the domain suffixes such as .com, .in, .tech, etc to .edu in the email column of the dataset. The regex pattern for this case will be
Emails before replacement
Want to become awesome in ML?
Hi! I am Selva, and I am excited you are reading this!
You can now go from a complete beginner to a Data Science expert, with my end-to-end free Data Science training.
No shifting between multiple books and courses. Hop on to the most effective way to becoming the expert. (Includes downloadable notebooks, portfolio projects and exercises)
Start free with the first course 'Foundations of Machine Learning' - a well rounded orientation of what the field of ML is all about.
Sold already? Start with the Complete ML Mastery Path
0 [email protected].tech 1 [email protected].in 2 [email protected].me 3 [email protected].pl 4 [email protected].tech 5 [email protected].org Name: email, dtype: object
Emails after replacement
df.email.replace(to_replace='\.\w+', value='.edu', regex=True)
0 [email protected].edu 1 [email protected].edu 2 [email protected].edu 3 [email protected].edu 4 [email protected].edu 5 [email protected].edu Name: email, dtype: object
Situation 3: Replace all the vowels in tweets with
In this case, the vowels will be replaced with
$. For example, the word
Miss would become
The regular expression for this case will be:
Tweet before replacement
'Missing college! '
Tweet after replacement
df.tweet.replace(to_replace="[aeiouAEIOU]", regex=True, value='$')
'M$ss$ng c$ll$g$! @c$lll$g$'
- Regular expression comes in handy to replace complex string patterns that are usually difficult to replace via other functions.
- For instance, you can replace all the cuss words in a text with special characters using regex replacement.
Test your knowledge
Q1: To enable regular expression search in the
replace function, what parameter should be enabled?
regex parameter should be set to
value parameter in
replace function is used for:
A) defining which values should be replaced in the string.
b) defining the replacement value.
c) defining the regex pattern
d) None of theseAnswer:
Answer: (B) option
Q3: Consider the dataframe below:
import pandas as pd df = pd.DataFrame( data= [ ['@mlplus', 'Our new course on ML price: 3222'], ['@kaustubhgupta', "Gmail down for 30 minutes. What's the matter?"], ['@rajveer', 'Excited to lauch our new product on 5th Jan!'], ['@joe', 'Will coronavirus end in 2021? #thoughts'], ['@abhishek', 'I want to become web developer in 4 months. Any tips? @webdeveloper @randomxyz'], ['@ayushi', 'Missing college! @colllege'] ], columns=['username', 'tweet'] ) df
Write the code to replace the numbers in tweets with text
00number00 using replace function and regex expressions
Answer: Use the regular expression:
df.tweet.replace(to_replace="\d+", value='00number00', regex=True)
0 Our new course on ML price: 00number00 1 Gmail down for 00number00 minutes. What's the ... 2 Excited to lauch our new product on 00number00... 3 Will coronavirus end in 00number00? 4 I want to become web developer in 00number00 m... 5 Missing college! @colllege Name: tweet, dtype: object
The article was contributed by Kaustubh G and Shri Varsheni.