Menu
Scaler Ads

Python Regular Expressions Tutorial and Examples: A Simplified Guide

Regular expressions, also called regex, is a syntax or rather a language to search, extract and manipulate specific string patterns from a larger text. It is widely used in projects that involve text validation, NLP and text mining.

Regular Expressions in Python: A Simplified Tutorial. Photo by Sarah Crutchfield.

1. Contents

  1. Introduction to regular expressions
  2. What is a regex pattern and how to compile one?
  3. How to split a string separated by a regex?
  4. Finding pattern matches using findall, search and match
  5. What does re.findall() do?
    5.1. re.search() vs re.match()
  6. How to substitute one text with another using regex?
  7. Regex groups
  8. What is greedy matching in regex?
  9. Most common regular expression syntax and patterns
  10. Regular Expressions Examples
    10.1. Any character except for a new line
    10.2. A period
    10.3. Any digit
    10.4. Anything but a digit
    10.5. Any character, including digits
    10.6. Anything but a character
    10.7. Collection of characters
    10.8. Match something upto ‘n’ times
    10.9. Match 1 or more occurrences
    10.10. Match any number of occurrences (0 or more times)
    10.11. Match exactly zero or one occurrence
    10.12. Match word boundaries
  11. Practice Exercises
  12. Conclusion

1. Introduction to regular expressions

Regular expressions, also called regex is implemented in pretty much every computer language. In python, it is implemented in the standard module re.

It is widely used in natural language processing, web applications that require validating string input (like email address) and pretty much most data science projects that involve text mining. This post is structured into 2 parts.

Before getting to the regular expressions syntax, it’s better for you to first understand how the re module works.

So, you will first get introduced to the 5 main features of the `re“ module and then see how to create commonly used regular expressions in python.

You will see how to construct pretty much any string pattern you will likely need when working on text mining related projects.

2. What is a regex pattern and how to compile one?

A regex pattern is a special language used to represent generic text, numbers or symbols so it can be used to extract texts that conform to that pattern.

A basic example is '\s+'. Here the '\s' matches any whitespace character.

By adding a '+' notation at the end will make the pattern match at least 1 or more spaces.

So, this pattern will match even tab '\t' characters as well. A larger list of regex patterns comes at the end of this post. But before getting to that, let’s see how to compile and play with regular expressions.

import re   
regex = re.compile('\s+')

The above code imports the 're' package and compiles a regular expression pattern that can match at least one or more space characters.

3. How to split a string separated by a regex?

Let’s consider the following piece of text.

text = """101 COM    Computers
205 MAT   Mathematics
189 ENG   English"""

I have three course items in the following format: “[Course Number] [Course Code] [Course Name]”.

The spacing between the words are not equal.

I want to split these three course items into individual units of numbers and words.

How to do that?

This can be split in two ways:

1. By using the re.split method.
2. By calling the split method of the regex object.

# split the text around 1 or more space characters 
re.split('\s+', text) # or 
regex.split(text) 
#> ['101', 'COM', 'Computers', '205', 'MAT', 'Mathematics', '189', 'ENG', 'English']

So both these methods work. But which one to use in practice?

If you intend to use a particular pattern multiple times, then you are better off compiling a regular expression rather than using re.split over and over again.

4. Finding pattern matches using findall, search and match

Let’s suppose you want to extract all the course numbers, that is, the numbers 101, 205 and 189 alone from the above text. How to do that?

4.1 What does re.findall() do?

# find all numbers within the text print(text) regex_num = re.compile('\d+') regex_num.findall(text) #> 101 COM    Computers
#> 205 MAT   Mathematics
#> 189 ENG   English
#> ['101', '205', '189']

In above code, the special character '\d' is a regular expression which matches any digit.

I will be covering more such patterns in later in this tutorial.

Adding a '+' symbol to it mandates the presence of at least 1 digit to be present in order to be found.

Similar to '+', there is a '*' symbol which requires 0 or more digits in order to be found.

It practically makes the presence of a digit optional in order to make a match. More on this later. Finally, the findall method extracts all occurrences of the 1 or more digits from the text and returns them in a list.

4.2 re.search() vs re.match()

As the name suggests, regex.search() searches for the pattern in a given text.

But unlike findall which returns the matched portions of the text as a list, regex.search() returns a particular match object that contains the starting and ending positions of the first occurrence of the pattern. Likewise, regex.match() also returns a match object.

But the difference is, it requires the pattern to be present at the beginning of the text itself.

# define the text 
text2 = """COM Computers 205 MAT Mathematics 189""" 
# compile the regex and search the pattern 
regex_num = re.compile('\d+') 
s = regex_num.search(text2) print('Starting Position: ', s.start()) 
print('Ending Position: ', s.end()) 
print(text2[s.start():s.end()]) 

#> Starting Position:  17
#> Ending Position:  20
#> 205

Alternately, you can get the same output using the group() method of the match object.

print(s.group())
#> 205
m = regex_num.match(text2)
print(m)

#> None

 

5. How to substitute one text with another using regex?

To replace texts, use the regex.sub().

Let’s consider the following modified version of the courses text. Here I have added an extra tab after each course code.

# define the text 
text = """101 COM \t Computers 205 MAT \t Mathematics 189 ENG \t English""" 
print(text) 

#> 101   COM      Computers
#> 205   MAT      Mathematics
#> 189   ENG      English

From the above text, I want to even out all the extra spaces and put all the words in one single line. To do this, you just have to use regex.sub to replace the '\s+' pattern with a single space ‘ ‘.

# replace one or more spaces with single space 
regex = re.compile('\s+') print(regex.sub(' ', text)) 
# or print(re.sub('\s+', ' ', text)) 
#> 101 COM Computers 205 MAT Mathematics 189 ENG English

Suppose you only want to get rid of the extra spaces but want to keep the course entries in the new line itself.

To achieve that you should use a regex that effectively excludes new line characters but includes all other whitespaces.

This can be done using a negative lookahead (?!\n). It checks for an upcoming newline character and excludes it from the pattern.

# get rid of all extra spaces except newline 
regex = re.compile('((?!\n)\s+)') print(regex.sub(' ', text)) 

#> 101 COM Computers
#> 205 MAT Mathematics
#> 189 ENG English

 

6. Regex groups

Regular expression groups is a very useful feature that lets you extract the desired match objects as individual items. Suppose I want to extract the course number, code and the name as separate items. Without groups, I will have to write something like this.

text = """101   COM   Computers
205   MAT   Mathematics
189   ENG    English"""  


# 1. extract all course numbers
 re.findall('[0-9]+', text) 
# 2. extract all course codes 
re.findall('[A-Z]{3}', text) 
# 3. extract all course names re.findall('[A-Za-z]{4,}', text) 

#> ['101', '205', '189']
#> ['COM', 'MAT', 'ENG']
#> ['Computers', 'Mathematics', 'English']

Well, let’s see what just happened.

I compiled 3 separate regular expressions one each for matching the course number, code and the name.

For course number, the pattern [0-9]+ instructs to match all number from 0 to 9.

Adding a + symbol at the end makes it look for at least 1 occurrence of numbers 0-9. If you know the course number will certainly have exactly 3 digits, the pattern could have been [0-9]{3} instead.

For course code, you can guess that '[A-Z]{3}' will match exactly 3 consequtive occurrences of alphabets capital A-Z.

For course name, '[A-Za-z]{4,}' will look for upper and lower case alphabets a-z, assuming all course names will have at least 4 or more characters.

Can you guess what would be the pattern if the maximum limit of characters in course name is say, 20?

Now I had to write 3 separate lines to get the individual items.

But there is a better way: Regex Groups.

Since all the entries have the same pattern, you can construct a unified pattern for the entire course entry and put the portions you want to extract inside a pair of brackets ().

# define the course text pattern groups and extract 
course_pattern = '([0-9]+)\s*([A-Z]{3})\s*([A-Za-z]{4,})' re.findall(course_pattern, text) 
#> [('101', 'COM', 'Computers'), ('205', 'MAT', 'Mathematics'), ('189', 'ENG', 'English')]

Notice the patterns for the course num: [0-9]+, code: [A-Z]{3} and name: [A-Za-z]{4,} are all placed inside parenthesis () in order to form the groups.

7. What is greedy matching in regex?

The default behavior of regular expressions is to be greedy. That means it tries to extract as much as possible until it conforms to a pattern even when a smaller part would have been syntactically sufficient.

Let’s see an example of a piece of HTML, where I want to retrieve the HTML tag.

text = "< body>Regex Greedy Matching Example < /body>"
re.findall('<.*>', text)
#> ['< body>Regex Greedy Matching Example < /body>']

Instead of matching till the first occurrence of ‘>’, which I was hoping would happen at the end of first body tag itself, it extracted the whole string.

This is the default greedy or ‘take it all’ behavior of regex. Lazy matching, on the other hand, ‘takes as little as possible’. This can be effected by adding a `?` at the end of the pattern.

re.findall('<.*?>', text)
#> ['< body>', '< /body>']

If you want only the first match to be retrieved, use the search method instead.

re.search('<.*?>', text).group()
#> '< body>'

 

8. Most common regular expression syntax and patterns

Now that you understand the how to use the re module.
Let’s see some commonly used wildcard patterns.

Basic Syntax
.             One character except new line
\.            A period. \ escapes a special character.
\d            One digit
\D            One non-digit
\w            One word character including digits
\W            One non-word character
\s            One whitespace
\S            One non-whitespace
\b            Word boundary
\n            Newline
\t            Tab

Modifiers
$             End of string
^             Start of string
ab|cd         Matches ab or de.
[ab-d]        One character of: a, b, c, d
[^ab-d]       One character except: a, b, c, d
()            Items within parenthesis are retrieved
(a(bc))       Items within the sub-parenthesis are retrieved

Repetitions
[ab]{2}       Exactly 2 continuous occurrences of a or b
[ab]{2,5}     2 to 5 continuous occurrences of a or b
[ab]{2,}      2 or more continuous occurrences of a or b
+             One or more
*             Zero or more
?             0 or 1

9. Regular Expressions Examples

9.1. Any character except for a new line

text = 'machinelearningplus.com'
print(re.findall('.', text))  # .   Any character except for a new line 
print(re.findall('...', text)) #> ['m', 'a', 'c', 'h', 'i', 'n', 'e', 'l', 'e', 'a', 'r', 'n', 'i', 'n', 'g', 'p', 'l', 'u', 's', '.', 'c', 'o', 'm']
#> ['mac', 'hin', 'ele', 'arn', 'ing', 'plu', 's.c']

9.2. A period

text = 'machinelearningplus.com'
print(re.findall('\.', text))  # matches a period 
print(re.findall('[^\.]', text)) # matches anything but a period 
#> ['.']
#> ['m', 'a', 'c', 'h', 'i', 'n', 'e', 'l', 'e', 'a', 'r', 'n', 'i', 'n', 'g', 'p', 'l', 'u', 's', 'c', 'o', 'm']

9.3. Any digit

text = '01, Jan 2015'
print(re.findall('\d+', text))  # \d  Any digit. The + mandates at least 1 digit. 
#> ['01', '2015']

9.4. Anything but a digit

text = '01, Jan 2015'
print(re.findall('\D+', text))  # \D  Anything but a digit #> [', Jan ']

9.5. Any character, including digits

text = '01, Jan 2015'
print(re.findall('\w+', text))  # \w  Any character 
#> ['01', 'Jan', '2015']

9.6. Anything but a character

text = '01, Jan 2015'
print(re.findall('\W+', text))  # \W  Anything but a character 
#> [', ', ' ']

9.7. Collection of characters

text = '01, Jan 2015'
print(re.findall('[a-zA-Z]+', text))  # [] Matches any character inside 
#> ['Jan']

9.8. Match something upto ‘n’ times

text = '01, Jan 2015'
print(re.findall('\d{4}', text))  # {n} Matches repeat n times. 
print(re.findall('\d{2,4}', text)) 
#> ['2015']
#> ['01', '2015']

9.9. Match 1 or more occurrences

print(re.findall(r'Co+l', 'So Cooool'))  # Match for 1 or more occurrences 
#> ['Cooool']

9.10. Match any number of occurrences (0 or more times)

print(re.findall(r'Pi*lani', 'Pilani'))
#> ['Pilani']

9.11. Match exactly zero or one occurrence

print(re.findall(r'colou?r', 'color'))
['color']

 

9.12. Match word boundaries

Word boundaries \b are commonly used to detect and match the beginning or end of a word. That is, one side is a word character and the other side is whitespace and vice versa. For example, the regex \btoy will match the ‘toy’ in ‘toy cat’ and not in ‘tolstoy’. In order to match the ‘toy’ in ‘tolstoy’, you should use toy\b Can you come up with a regex that will match only the first ‘toy’ in ‘play toy broke toys’? (hint: \b on both sides) Likewise, \B will match any non-boundary. For example, \Btoy\B will match ‘toy’ surrounded by words on both sides, as in, ‘antoynet’.

re.findall(r'\btoy\b', 'play toy broke toys')  # match toy with boundary on both sides #> ['toy']

 

10. Practice Exercises

Let’s get some practice.

It’s time to open up your python console.

  1. Extract the user id, domain name and suffix from the following email addresses.
emails = """[email protected]
[email protected]
[email protected]"""

desired_output = [('zuck26', 'facebook', 'com'),
 ('page33', 'google', 'com'),
 ('jeff42', 'amazon', 'com')]
 
pattern = r'(\w+)@([A-Z0-9]+)\.([A-Z]{2,4})'
re.findall(pattern, emails, flags=re.IGNORECASE)
#> [('zuck26', 'facebook', 'com'),
 ('page33', 'google', 'com'),
 ('jeff42', 'amazon', 'com')]

Use groups with (). There are more sophisticated patterns for matching the email domain and suffix. This is just one version of the answer. [/tab][/tabs]

# Solution pattern = r'(\w+)@([A-Z0-9]+)\.([A-Z]{2,4})' re.findall(pattern, emails, flags=re.IGNORECASE) #>  [('zuck26', 'facebook', 'com'),
#>  ('page33', 'google', 'com'),
#>  ('jeff42', 'amazon', 'com')]  
 
# There are more sophisticated patterns for matching the email domain and suffix. This is just one version of the answer.

2. Retrieve all the words starting with ‘b’ or ‘B’ from the following text.

text = """Betty bought a bit of butter, But the butter was so bitter, So she bought some better butter, To make the bitter butter better."""
import re
re.findall(r'\bB\w+', text, flags=re.IGNORECASE)
#> ['Betty', 'bought', 'bit', 'butter', 'But', 'butter', 'bitter', 'bought', 'better', 'butter', 'bitter', 'butter', 'better']

‘\b’ mandates the left of ‘B’ is a word boundary, effectively requiring the word to start with ‘B’. Setting ‘flags’ arg to ‘re.IGNORECASE’ makes the pattern case insensitive.[/tab][/tabs]

# Solution:  
import re 
re.findall(r'\bB\w+', text, flags=re.IGNORECASE) 
#> ['Betty', 'bought', 'bit', 'butter', 'But', 'butter', 'bitter', 'bought', 'better', 'butter', 'bitter', 'butter', 'better'] 

# '\b' mandates the left of 'B' is a word boundary, effectively requiring the word to start with 'B'. 
# Setting 'flags' arg to 're.IGNORECASE' makes the pattern case insensitive.

3. Split the following irregular sentence into words

sentence = """A, very   very; irregular_sentence"""
desired_output = "A very very irregular sentence"
import re
" ".join(re.split('[;,\s_]+', sentence))
'A very very irregular sentence'

Add more delimiters into the pattern as needed.[/tab][/tabs]

# Solution import re " ".join(re.split('[;,\s_]+', sentence)) #> 'A very very irregular sentence' # Add more delimiters into the pattern as needed.

4. Clean up the following tweet so that it contains only the user’s message. That is, remove all URLs, hashtags, mentions, punctuations, RTs and CCs.

tweet = '''Good advice! RT @TheNextWeb: What I would do differently if I was learning to code today http://t.co/lbwej0pxOd cc: @garybernhardt #rstats'''

desired_output = 'Good advice What I would do differently if I was learning to code today'
import re

def clean_tweet(tweet):
    tweet = re.sub('http\S+\s*', '', tweet)  # remove URLs
    tweet = re.sub('RT|cc', '', tweet)  # remove RT and cc
    tweet = re.sub('#\S+', '', tweet)  # remove hashtags
    tweet = re.sub('@\S+', '', tweet)  # remove mentions
    tweet = re.sub('[%s]' % re.escape("""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""), '', tweet)  # remove punctuations
    tweet = re.sub('\s+', ' ', tweet)  # remove extra whitespace
    return tweet

print(clean_tweet(tweet))
#> Good advice What I would do differently if I was learning to code today 
[/tab][/tabs]
# Solution import re def clean_tweet(tweet): tweet = re.sub('http\S+\s*', '', tweet) # remove URLs tweet = re.sub('RT|cc', '', tweet) # remove RT and cc tweet = re.sub('#\S+', '', tweet) # remove hashtags tweet = re.sub('@\S+', '', tweet) # remove mentions tweet = re.sub('[%s]' % re.escape("""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""), '', tweet) # remove punctuations tweet = re.sub('\s+', ' ', tweet) # remove extra whitespace return tweet print(clean_tweet(tweet)) #> Good advice What I would do differently if I was learning to code today 

5. Extract all the text portions between the tags from the following HTML page: https://raw.githubusercontent.com/selva86/datasets/master/sample.html Code to retrieve the HTML page:

import requests
r = requests.get("https://raw.githubusercontent.com/selva86/datasets/master/sample.html")
r.text  # html text is contained here 

desired_output = ['Your Title Here', 'Link Name', 'This is a Header', 'This is a Medium Header', 'This is a new paragraph! ', 'This is a another paragraph!', 'This is a new sentence without a paragraph break, in bold italics.']
re.findall('<.*?>(.*)< /.*?>', r.text) # remove the space after < and /.*> for the pattern to work 
#> ['Your Title Here', 'Link Name', 'This is a Header', 'This is a Medium Header', 'This is a new paragraph! ', 'This is a another paragraph!', 'This is a new sentence without a paragraph break, in bold italics.']
[/tab][/tabs]
# Solution:
# Note: remove the space after < and /.*> for the pattern to work re.findall('<.*?>(.*)< /.*?>', r.text) #> ['Your Title Here', 'Link Name', 'This is a Header', 'This is a Medium Header', 'This is a new paragraph! ', 'This is a another paragraph!', 'This is a new sentence without a paragraph break, in bold italics.']

 

11. Conclusion

I hope you enjoyed reading this.

The purpose of this post was to get you introduced to regular expressions in a simplified way which you remember. Plus, also something you can use as a future reference.

Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science