Python - Python Advanced - RegEx - Regular Expression Tutorial
A Regular Expression (RegEx) is a special sequence of characters that uses a search pattern to find a string or set of strings.
. - Any Character Except New Line
\d - Digit (0-9)
\D - Not a Digit (0-9)
\w - Word Character (a-z, A-Z, 0-9, _)
\W - Not a Word Character
\s - Whitespace (space, tab, newline)
\S - Not Whitespace (space, tab, newline)
\b - Word Boundary
\B - Not a Word Boundary
^ - Beginning of a String
$ - End of a String
[] - Matches Characters in brackets
[^ ] - Matches Characters NOT in brackets
| - Either Or
( ) - Group
Quantifiers:
* - 0 or More
+ - 1 or More
? - 0 or One
{3} - Exact Number
{3,4} - Range of Numbers (Minimum, Maximum)
#### Sample Regexs ####
[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+
To start using Regular Expression, we have to import re module
import re
Example 1-
To check whether the particular string starts with "Welcome" and ends with ".com" or not-
txt = "Welcome to the fresherbell.com"
x = re.search("^Welcome.*com$", txt)
if x:
print("matched")
else:
print("Not matched")
Output-
matched
Example 2-
To extract the particular string ("Fresherbell") from the given text variable-
text= '''
abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ\n
1234567890
Ha HaHa
MetaCharacters (Need to be escaped):
. ^ $ * + ? { } [ ] \ | ( )
support@fresher.com
321-555-4321
Mr. Schafer
Mr Samsung
Ms Philip
Mr. Fresherbell
Mr. T
Mr_hello
'''
pattern = re.compile('Fresherbell') # what to extract
matches = pattern.finditer(text) # from where to extract
list(matches)
Output-
[<re.Match object; span=(211, 222), match='Fresherbell'>]
text[211:222]
Output-
'Fresherbell'
\d
Suppose we want to extract all the digits(0-9) from the variable text then-
Example-
pattern = re.compile('\d') # what to extract
matches = pattern.finditer(text) # from where to extract
list(matches)
Output-
[<re.Match object; span=(57, 58), match='1'>,
<re.Match object; span=(58, 59), match='2'>,
<re.Match object; span=(59, 60), match='3'>,
<re.Match object; span=(60, 61), match='4'>,
<re.Match object; span=(61, 62), match='5'>,
<re.Match object; span=(62, 63), match='6'>,
<re.Match object; span=(63, 64), match='7'>,
<re.Match object; span=(64, 65), match='8'>,
<re.Match object; span=(65, 66), match='9'>,
<re.Match object; span=(66, 67), match='0'>,
<re.Match object; span=(161, 162), match='3'>,
<re.Match object; span=(162, 163), match='2'>,
<re.Match object; span=(163, 164), match='1'>,
<re.Match object; span=(165, 166), match='5'>,
<re.Match object; span=(166, 167), match='5'>,
<re.Match object; span=(167, 168), match='5'>,
<re.Match object; span=(169, 170), match='4'>,
<re.Match object; span=(170, 171), match='3'>,
<re.Match object; span=(171, 172), match='2'>,
<re.Match object; span=(172, 173), match='1'>]
\D
Suppose we want to extract all characters except digit(0-9) from the variable text then-
Example-
pattern = re.compile('\D') # what to extract
matches = pattern.finditer(text) # from where to extract
list(matches)
Output-
[<re.Match object; span=(0, 1), match='\n'>,
<re.Match object; span=(1, 2), match='a'>,
<re.Match object; span=(2, 3), match='b'>,
<re.Match object; span=(3, 4), match='c'>,
<re.Match object; span=(4, 5), match='d'>,
<re.Match object; span=(5, 6), match='e'>,
<re.Match object; span=(6, 7), match='f'>,
<re.Match object; span=(7, 8), match='g'>,.......]
\w
Suppose we want to extract all the Word Character (a-z, A-Z, 0-9, _) from the variable text then-
Example-
pattern = re.compile('\w') # what to extract
matches = pattern.finditer(text) # from where to extract
list(matches)
Output-
[.....
<re.Match object; span=(162, 163), match='1'>,
.
.
.
<re.Match object; span=(223, 224), match='r'>,
<re.Match object; span=(226, 227), match='T'>,
<re.Match object; span=(228, 229), match='M'>,
<re.Match object; span=(229, 230), match='r'>,
<re.Match object; span=(230, 231), match='_'>,
<re.Match object; span=(231, 232), match='h'>,
.......]
\W
Suppose we want to extract not a Word Character (a-z, A-Z, 0-9, _) from the variable text then-
Example-
pattern = re.compile('\w') # what to extract
matches = pattern.finditer(text) # from where to extract
list(matches)
Output-
[.....
<re.Match object; span=(111, 112), match='\n'>,
<re.Match object; span=(112, 113), match='.'>,
<re.Match object; span=(113, 114), match=' '>,
<re.Match object; span=(114, 115), match='^'>,
<re.Match object; span=(115, 116), match=' '>,
<re.Match object; span=(116, 117), match='$'>,
.......]
\s
Suppose we want to extract Whitespace (space, tab, newline) from the variable text then-
Example-
pattern = re.compile('\s') # what to extract
matches = pattern.finditer(text) # from where to extract
list(matches)
Output-
[.....
<re.Match object; span=(98, 99), match=' '>,
<re.Match object; span=(101, 102), match=' '>,
<re.Match object; span=(111, 112), match='\n'>,
<re.Match object; span=(113, 114), match=' '>,
<re.Match object; span=(115, 116), match=' '>,
<re.Match object; span=(117, 118), match=' '>,
.......]
\S
Suppose we want to extract all character except Whitespace (space, tab, newline) from the variable text then-
Example-
pattern = re.compile('\S') # what to extract
matches = pattern.finditer(text) # from where to extract
list(matches)
Output-
[.....
<re.Match object; span=(155, 156), match='.'>,
<re.Match object; span=(156, 157), match='c'>,
<re.Match object; span=(157, 158), match='o'>,
<re.Match object; span=(158, 159), match='m'>,
<re.Match object; span=(160, 161), match='3'>,
.......]
Raw String
(r'.') to extract all character or (r'\d') to extract all digit.
In the above line r character is is used to extract raw string. i.e it wont consider \ character as an escape character. It is different from Regex expression, as Regex expression use \ as escaping character.
In case of Regex-
\n is a new line
\t is a tab
In case of Raw String-
\n will be considered as \n not a new line
\t will be considered as \t not a tab
Example 1-
s = 'Hello\nFresherbell\t.com'
print(s)
Output-
Hello
Fresherbell .com
Example 2-
It will extract all the characters without raw format
s = r'Hello\nFresherbell\t.com'
print(s)
Output-
Hello\nFresherbell\t.com
\b - Word Boundary
Suppose we want to extract alone "cat" from the below data-
data = 'cat catherine catholic wildcat copycat uncatchable'
Let see example-
pattern = re.compile('cat')
matches = pattern.finditer(data)
for match in matches:
print(match)
Output-
<re.Match object; span=(0, 3), match='cat'>
<re.Match object; span=(4, 7), match='cat'>
<re.Match object; span=(14, 17), match='cat'>
<re.Match object; span=(27, 30), match='cat'>
<re.Match object; span=(35, 38), match='cat'>
<re.Match object; span=(41, 44), match='cat'>
In the above program, we are getting all the word cat from catherine, catholic, wildcat etc. But we only want the single/alone word "cat". So, let see another example-
pattern = re.compile('cat ')
matches = pattern.finditer(data)
for match in matches:
print(match)
Output-
<re.Match object; span=(0, 4), match='cat '>
<re.Match object; span=(27, 31), match='cat '>
<re.Match object; span=(35, 39), match='cat '>
In the above program, now we are getting the word "cat", but we are also getting space along with it. To solve this issue, we can use word boundary-
word\b = Right-hand side of the word should not be a word char
\bword = Left-hand side of the word should not be a word char
\bword\b = Both hand sides of the word should not be a word char.
Example-
pattern = re.compile(r'cat\b')
matches = pattern.finditer(data)
for match in matches:
print(match)
Output-
<re.Match object; span=(0, 3), match='cat'>
<re.Match object; span=(27, 30), match='cat'>
<re.Match object; span=(35, 38), match='cat'>
Still, the issue is not resolved, we are still getting cats from word wildcat, and copycat.
Other Example-
pattern = re.compile(r'\bcat')
matches = pattern.finditer(data)
for match in matches:
print(match)
Output-
<re.Match object; span=(0, 3), match='cat'>
<re.Match object; span=(4, 7), match='cat'>
<re.Match object; span=(14, 17), match='cat'>
Still, the issue is not resolved, we are still getting a cat from word catherine, catholic.
Correct Solution-
pattern = re.compile(r'\bcat\b')
matches = pattern.finditer(data)
for match in matches:
print(match)
Output-
<re.Match object; span=(0, 3), match='cat'>
\B - Not a Word Boundary
word\B = Right hand side of the word should be a word char
\Bword = Left hand side of the word should be a word char
\Bword\B = Both hand side of the word should be a word char
data = 'she sells seashells at sea shore'
Example 1-
pattern = re.compile(r's\B')
matches = pattern.finditer(data)
for match in matches:
print(match)
Output-
<re.Match object; span=(0, 1), match='s'>
<re.Match object; span=(4, 5), match='s'>
<re.Match object; span=(10, 11), match='s'>
<re.Match object; span=(13, 14), match='s'>
<re.Match object; span=(23, 24), match='s'>
<re.Match object; span=(27, 28), match='s'>
Example 2-
pattern = re.compile(r'\Bs')
matches = pattern.finditer(data)
for match in matches:
print(match)
Output-
<re.Match object; span=(8, 9), match='s'>
<re.Match object; span=(13, 14), match='s'>
<re.Match object; span=(18, 19), match='s'>
Example 3-
pattern = re.compile(r'\Bs\B')
matches = pattern.finditer(data)
for match in matches:
print(match)
Output-
<re.Match object; span=(13, 14), match='s'>
^ - Beginning of a String
To check particular string starting/beginning with a particular word or notExample-
txt = "Welcome to the fresherbell.com"
x = re.search("^Welcome", txt)
if x:
print("Beginning with Welcome")
else:
print("Not Beginning with Welcome")
Output-
Beginning with Welcome
$ - End of a String
To check particular string ends with particular word or notExample-
txt = "Welcome to the fresherbell.com"
x = re.search("com$", txt)
if x:
print("Ending with com")
else:
print("Not Ending with com")
Output-
Ending with com
----------------------------------------------------------------------------------------------
text_to_search = '''
abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ\s
321-555-4321
1234567890
Ha HaHa
MetaCharacters (Need to be escaped):
. ^ $ * + ? { } [ ] \ | ( )
support@gmail.com
321-555-4321
123.555.1234
123*555*-1234
123.555.1234
800-555-1234
900-555-1234
Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
Mr_hello
'''
# Q1 - write a regex to search all the 3 digit numbers
pattern = re.compile(r'\b\d\d\d\b')
matches = pattern.finditer(text_to_search)
for match in matches:
print(match)
Output-
<re.Match object; span=(57, 60), match='321'>
<re.Match object; span=(61, 64), match='555'>
<re.Match object; span=(171, 174), match='321'>
<re.Match object; span=(175, 178), match='555'>
<re.Match object; span=(184, 187), match='123'>
<re.Match object; span=(188, 191), match='555'>
<re.Match object; span=(197, 200), match='123'>
<re.Match object; span=(201, 204), match='555'>
<re.Match object; span=(211, 214), match='123'>
<re.Match object; span=(215, 218), match='555'>
<re.Match object; span=(224, 227), match='800'>
<re.Match object; span=(228, 231), match='555'>
<re.Match object; span=(237, 240), match='900'>
<re.Match object; span=(241, 244), match='555'>
#Q2 extract a valid phone number = nnn.nnn.nnnn
pattern = re.compile(r'\d\d\d\.\d\d\d\.\d\d\d\d')
matches = pattern.finditer(text_to_search)
for match in matches:
print(match)
Output-
<re.Match object; span=(184, 196), match='123.555.1234'>
<re.Match object; span=(211, 223), match='123.555.1234'>
[ ] - Matches any single character in brackets.
Example 1-
txt = "Python Mython ython. python Rython"
pattern = re.compile('[Pp]ython')
matches = pattern.finditer(txt)
for match in matches:
print(match)
Output-
<re.Match object; span=(0, 6), match='Python'>
<re.Match object; span=(20, 26), match='python'>
Example 2-
txt = "Python Mython ython. python Rython"
pattern = re.compile('[on]')
matches = pattern.finditer(txt)
for match in matches:
print(match)
Output-
<re.Match object; span=(4, 5), match='o'>
<re.Match object; span=(5, 6), match='n'>
<re.Match object; span=(11, 12), match='o'>
<re.Match object; span=(12, 13), match='n'>
<re.Match object; span=(17, 18), match='o'>
<re.Match object; span=(18, 19), match='n'>
<re.Match object; span=(25, 26), match='o'>
<re.Match object; span=(26, 27), match='n'>
<re.Match object; span=(32, 33), match='o'>
<re.Match object; span=(33, 34), match='n'>
#Q3 extract a valid phone number = nnn.nnn.nnnn / nnn-nnn-nnnn
Example-
pattern = re.compile(r'\d\d\d[.-]\d\d\d[.-]\d\d\d\d')
matches = pattern.finditer(text_to_search)
for match in matches:
print(match)
Output-
<re.Match object; span=(57, 69), match='321-555-4321'>
<re.Match object; span=(172, 184), match='321-555-4321'>
<re.Match object; span=(185, 197), match='123.555.1234'>
<re.Match object; span=(212, 224), match='123.555.1234'>
<re.Match object; span=(225, 237), match='800-555-1234'>
<re.Match object; span=(238, 250), match='900-555-1234'>
Example 3 - Range
Without using range
txt = "1th 2th 3th 4th 5th 6th ath bth cth dth eth"
pattern = re.compile('[123456abcde]th')
matches = pattern.finditer(txt)
for match in matches:
print(match)
Output-
<re.Match object; span=(0, 3), match='1th'>
<re.Match object; span=(4, 7), match='2th'>
<re.Match object; span=(8, 11), match='3th'>
<re.Match object; span=(12, 15), match='4th'>
<re.Match object; span=(16, 19), match='5th'>
<re.Match object; span=(20, 23), match='6th'>
<re.Match object; span=(24, 27), match='ath'>
<re.Match object; span=(28, 31), match='bth'>
<re.Match object; span=(32, 35), match='cth'>
<re.Match object; span=(36, 39), match='dth'>
<re.Match object; span=(40, 43), match='eth'>
With Range
txt = "1th 2th 3th 4th 5th 6th ath bth cth dth eth"
pattern = re.compile('[1-6a-e]th')
matches = pattern.finditer(txt)
for match in matches:
print(match)
Output-
<re.Match object; span=(0, 3), match='1th'>
<re.Match object; span=(4, 7), match='2th'>
<re.Match object; span=(8, 11), match='3th'>
<re.Match object; span=(12, 15), match='4th'>
<re.Match object; span=(16, 19), match='5th'>
<re.Match object; span=(20, 23), match='6th'>
<re.Match object; span=(24, 27), match='ath'>
<re.Match object; span=(28, 31), match='bth'>
<re.Match object; span=(32, 35), match='cth'>
<re.Match object; span=(36, 39), match='dth'>
<re.Match object; span=(40, 43), match='eth'>
[^ ] - Matches any single character not in brackets
Example-
txt = "Python Mython ython. python Rython"
pattern = re.compile('[^Pp]ython')
matches = pattern.finditer(txt)
for match in matches:
print(match)
Output-
<re.Match object; span=(7, 13), match='Mython'>
<re.Match object; span=(13, 19), match=' ython'>
<re.Match object; span=(28, 34), match='Rython'>
Example 2-
txt = "Python Mython ython. python Rython"
pattern = re.compile('[^ython]')
matches = pattern.finditer(txt)
for match in matches:
print(match)
Output-
<re.Match object; span=(0, 1), match='P'>
<re.Match object; span=(6, 7), match=' '>
<re.Match object; span=(7, 8), match='M'>
<re.Match object; span=(13, 14), match=' '>
<re.Match object; span=(19, 20), match='.'>
<re.Match object; span=(20, 21), match=' '>
<re.Match object; span=(21, 22), match='p'>
<re.Match object; span=(27, 28), match=' '>
<re.Match object; span=(28, 29), match='R'>
Example 3 - Range
Using range with ^
txt = "1th 2th 3th 4th 5th 6th ath bth cth dth eth"
pattern = re.compile('[^1-5a-d]th')
matches = pattern.finditer(txt)
for match in matches:
print(match)
Output-
<re.Match object; span=(20, 23), match='6th'>
<re.Match object; span=(40, 43), match='eth'>
{n} - Matches exactly n number of occurrences of the preceding expression.
Instead of using \d\d\d (i.e \d 3 times), we can use \d{3}
Example-
pattern = re.compile(r'\d{3}[.-]\d{3}[.-]\d{4}')
matches = pattern.finditer(text_to_search)
for match in matches:
print(match)
Output-
<re.Match object; span=(57, 69), match='321-555-4321'>
<re.Match object; span=(172, 184), match='321-555-4321'>
<re.Match object; span=(185, 197), match='123.555.1234'>
<re.Match object; span=(212, 224), match='123.555.1234'>
<re.Match object; span=(225, 237), match='800-555-1234'>
<re.Match object; span=(238, 250), match='900-555-1234'>
{n,} -Matches n or more occurrences of preceding expression.
If we want to extract digit 2 or more than 2, then we can use \d{2,)
text_to_search = '''
321-555-4321
321-555-432165
321-555-43
123.555.1234
123*555*-1234
123.555.1234
800-555-1234
900-555-1234
'''
Example-
pattern = re.compile(r'\d{3}[.-]\d{3}[.-]\d{2,}')
matches = pattern.finditer(text_to_search)
for match in matches:
print(match)
Output-
<re.Match object; span=(1, 13), match='321-555-4321'>
<re.Match object; span=(14, 28), match='321-555-432165'>
<re.Match object; span=(29, 39), match='321-555-43'>
<re.Match object; span=(40, 52), match='123.555.1234'>
<re.Match object; span=(67, 79), match='123.555.1234'>
<re.Match object; span=(80, 92), match='800-555-1234'>
<re.Match object; span=(93, 105), match='900-555-1234'>
{n,m} -Matches at least n and at most m occurrences of preceding expression.
If we want to extract digits between 3 to 6, then we can use \d{3,6)
Example-
pattern = re.compile(r'\d{3}[.-]\d{3}[.-]\d{3,6}')
matches = pattern.finditer(text_to_search)
for match in matches:
print(match)
Output-
<re.Match object; span=(1, 13), match='321-555-4321'>
<re.Match object; span=(14, 28), match='321-555-432165'>
<re.Match object; span=(40, 52), match='123.555.1234'>
<re.Match object; span=(67, 79), match='123.555.1234'>
<re.Match object; span=(80, 92), match='800-555-1234'>
<re.Match object; span=(93, 105), match='900-555-1234'>
Escape Character(\char)
To extract Special Regex Character-
Special Regex Characters: These characters have special meaning in regex (to be discussed below): ., +, *, ?, ^, $, (, ), [, ], {, }, |, \.
To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash (\). E.g., \. matches "."; regex \+ matches "+"; and regex \( matches "(".
You also need to use regex \\ to match "\" (back-slash).
Without escape character
text_to_search = '''
Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
Mr_hello
'''
Example -
pattern = re.compile(r'Mr.')
matches = pattern.finditer(text_to_search)
for match in matches:
print(match)
Output-
<re.Match object; span=(1, 4), match='Mr.'>
<re.Match object; span=(13, 16), match='Mr '>
<re.Match object; span=(31, 34), match='Mrs'>
<re.Match object; span=(45, 48), match='Mr.'>
<re.Match object; span=(51, 54), match='Mr_'>
With escape character-
pattern = re.compile(r'Mr\.')
matches = pattern.finditer(text_to_search)
for match in matches:
print(match)
Output-
<re.Match object; span=(1, 4), match='Mr.'>
<re.Match object; span=(45, 48), match='Mr.'>
* -Matches 0 or more occurrences of the preceding expression.
Without *
pattern = re.compile(r'Mr\. [A-Z][a-z]')
matches = pattern.finditer(text_to_search)
for match in matches:
print(match)
Output-
<re.Match object; span=(1, 7), match='Mr. Sc'>
With *
pattern = re.compile(r'Mr\. [A-Z][a-z]*')
matches = pattern.finditer(text_to_search)
for match in matches:
print(match)
Output-
<re.Match object; span=(1, 12), match='Mr. Schafer'>
<re.Match object; span=(45, 50), match='Mr. T'>
+ -Matches 1 or more occurrences of the preceding expression.
Example-
pattern = re.compile(r'Mr\. [A-Z][a-z]+')
matches = pattern.finditer(text_to_search)
for match in matches:
print(match)
Output-
<re.Match object; span=(1, 12), match='Mr. Schafer'>
? -Matches 0 or 1 occurrence of the preceding expression.
Example-
pattern = re.compile(r'Mr\.? [A-Z][a-z]*')
matches = pattern.finditer(text_to_search)
for match in matches:
print(match)
Output-
<re.Match object; span=(1, 12), match='Mr. Schafer'>
<re.Match object; span=(13, 21), match='Mr Smith'>
<re.Match object; span=(45, 50), match='Mr. T'>
a|b -Matches either a or b.
Example-
pattern = re.compile(r'M(r|s|rs)\. [A-Z][a-z]*')
matches = pattern.finditer(text_to_search)
for match in matches:
print(match)
Output-
<re.Match object; span=(1, 12), match='Mr. Schafer'>
<re.Match object; span=(31, 44), match='Mrs. Robinson'>
<re.Match object; span=(45, 50), match='Mr. T'>
( ) - Groups regular expressions and remembers matched text.