Python - Python Advanced - RegEx - Regular Expression Tutorial

A Regular Expression (RegEx) is a special sequence of characters that uses a search pattern to find a string or set of strings.

. - Any Character Except New Line
\d - Digit (0-9)
\D - Not a Digit (0-9)
\w - Word Character (a-z, A-Z, 0-9, _)
\W - Not a Word Character
\s - Whitespace (space, tab, newline)
\S - Not Whitespace (space, tab, newline)

\b - Word Boundary
\B - Not a Word Boundary
^ - Beginning of a String
$ - End of a String

[] - Matches Characters in brackets
[^ ] - Matches Characters NOT in brackets
| - Either Or
( ) - Group

Quantifiers:
* - 0 or More
+ - 1 or More
? - 0 or One
{3} - Exact Number
{3,4} - Range of Numbers (Minimum, Maximum)

#### Sample Regexs ####

[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+

To start using Regular Expression, we have to import re module

import re

Example 1-
To check whether the particular string starts with "Welcome" and ends with ".com" or not-

txt = "Welcome to the fresherbell.com"
x = re.search("^Welcome.*com$", txt)

if x:
    print("matched")
else:
    print("Not matched")

Output-

matched

Example 2-
To extract the particular string ("Fresherbell") from the given text variable-

text= '''
abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ\n
1234567890
Ha HaHa
MetaCharacters (Need to be escaped):
. ^ $ * + ? { } [ ] \ | ( )
support@fresher.com
321-555-4321
Mr. Schafer
Mr Samsung
Ms Philip
Mr. Fresherbell
Mr. T
Mr_hello
'''

pattern = re.compile('Fresherbell') # what to extract

matches = pattern.finditer(text) # from where to extract

list(matches)

Output-

[<re.Match object; span=(211, 222), match='Fresherbell'>]

text[211:222]

Output-

'Fresherbell'

\d
Suppose we want to extract all the digits(0-9) from the variable text then-

Example-

pattern = re.compile('\d') # what to extract

matches = pattern.finditer(text) # from where to extract

list(matches)

Output-

[<re.Match object; span=(57, 58), match='1'>,
 <re.Match object; span=(58, 59), match='2'>,
 <re.Match object; span=(59, 60), match='3'>,
 <re.Match object; span=(60, 61), match='4'>,
 <re.Match object; span=(61, 62), match='5'>,
 <re.Match object; span=(62, 63), match='6'>,
 <re.Match object; span=(63, 64), match='7'>,
 <re.Match object; span=(64, 65), match='8'>,
 <re.Match object; span=(65, 66), match='9'>,
 <re.Match object; span=(66, 67), match='0'>,
 <re.Match object; span=(161, 162), match='3'>,
 <re.Match object; span=(162, 163), match='2'>,
 <re.Match object; span=(163, 164), match='1'>,
 <re.Match object; span=(165, 166), match='5'>,
 <re.Match object; span=(166, 167), match='5'>,
 <re.Match object; span=(167, 168), match='5'>,
 <re.Match object; span=(169, 170), match='4'>,
 <re.Match object; span=(170, 171), match='3'>,     
 <re.Match object; span=(171, 172), match='2'>,
 <re.Match object; span=(172, 173), match='1'>]

\D
Suppose we want to extract all characters except digit(0-9) from the variable text then-

Example-

pattern = re.compile('\D') # what to extract

matches = pattern.finditer(text) # from where to extract

list(matches)

Output-

[<re.Match object; span=(0, 1), match='\n'>,
 <re.Match object; span=(1, 2), match='a'>,
 <re.Match object; span=(2, 3), match='b'>,
 <re.Match object; span=(3, 4), match='c'>,
 <re.Match object; span=(4, 5), match='d'>,
 <re.Match object; span=(5, 6), match='e'>,
 <re.Match object; span=(6, 7), match='f'>,
 <re.Match object; span=(7, 8), match='g'>,.......]

\w
Suppose we want to extract all the Word Character (a-z, A-Z, 0-9, _) from the variable text then-

Example-

pattern = re.compile('\w') # what to extract

matches = pattern.finditer(text) # from where to extract

list(matches)

Output-

[.....
 <re.Match object; span=(162, 163), match='1'>,
 .
 .
 .
 <re.Match object; span=(223, 224), match='r'>,
 <re.Match object; span=(226, 227), match='T'>,
 <re.Match object; span=(228, 229), match='M'>,
 <re.Match object; span=(229, 230), match='r'>,
 <re.Match object; span=(230, 231), match='_'>,
 <re.Match object; span=(231, 232), match='h'>,
.......]

\W
Suppose we want to extract not a Word Character (a-z, A-Z, 0-9, _) from the variable text then-

Example-

pattern = re.compile('\w') # what to extract

matches = pattern.finditer(text) # from where to extract

list(matches)

Output-

[.....
 <re.Match object; span=(111, 112), match='\n'>,
 <re.Match object; span=(112, 113), match='.'>,
 <re.Match object; span=(113, 114), match=' '>,
 <re.Match object; span=(114, 115), match='^'>,
 <re.Match object; span=(115, 116), match=' '>,
 <re.Match object; span=(116, 117), match='$'>,
.......]

\s
Suppose we want to extract Whitespace (space, tab, newline) from the variable text then-

Example-

pattern = re.compile('\s') # what to extract

matches = pattern.finditer(text) # from where to extract

list(matches)

Output-

[.....
 <re.Match object; span=(98, 99), match=' '>,
 <re.Match object; span=(101, 102), match=' '>,
 <re.Match object; span=(111, 112), match='\n'>,
 <re.Match object; span=(113, 114), match=' '>,
 <re.Match object; span=(115, 116), match=' '>,
 <re.Match object; span=(117, 118), match=' '>,
.......]

\S
Suppose we want to extract all character except Whitespace (space, tab, newline) from the variable text then-

Example-

pattern = re.compile('\S') # what to extract

matches = pattern.finditer(text) # from where to extract

list(matches)

Output-

[.....
 <re.Match object; span=(155, 156), match='.'>,
 <re.Match object; span=(156, 157), match='c'>,
 <re.Match object; span=(157, 158), match='o'>,
 <re.Match object; span=(158, 159), match='m'>,
 <re.Match object; span=(160, 161), match='3'>,
.......]

Raw String
(r'.') to extract all character or (r'\d') to extract all digit.
In the above line r character is is used to extract raw string. i.e it wont consider \ character as an escape character. It is different from Regex expression, as Regex expression use \ as escaping character.
In case of Regex-
\n is a new line
\t is a tab

In case of Raw String-
\n will be considered as \n not a new line
\t will be considered as \t not a tab

Example 1-

s = 'Hello\nFresherbell\t.com'
print(s)

Output-

Hello
Fresherbell	.com

Example 2-
It will extract all the characters without raw format

s = r'Hello\nFresherbell\t.com'
print(s)

Output-

Hello\nFresherbell\t.com

\b - Word Boundary

Suppose we want to extract alone "cat" from the below data-

data = 'cat catherine catholic wildcat copycat uncatchable'

Let see example-

pattern = re.compile('cat')

matches = pattern.finditer(data)

for match in matches:
    print(match)

Output-

<re.Match object; span=(0, 3), match='cat'>
<re.Match object; span=(4, 7), match='cat'>
<re.Match object; span=(14, 17), match='cat'>
<re.Match object; span=(27, 30), match='cat'>
<re.Match object; span=(35, 38), match='cat'>
<re.Match object; span=(41, 44), match='cat'>

In the above program, we are getting all the word cat from catherine, catholic, wildcat etc. But we only want the single/alone word "cat". So, let see another example-

pattern = re.compile('cat ')

matches = pattern.finditer(data)

for match in matches:
    print(match)

Output-

<re.Match object; span=(0, 4), match='cat '>
<re.Match object; span=(27, 31), match='cat '>
<re.Match object; span=(35, 39), match='cat '>

In the above program, now we are getting the word "cat", but we are also getting space along with it. To solve this issue, we can use word boundary-

word\b = Right-hand side of the word should not be a word char
\bword = Left-hand side of the word should not be a word char
\bword\b = Both hand sides of the word should not be a word char.

Example-

pattern = re.compile(r'cat\b')

matches = pattern.finditer(data)

for match in matches:
    print(match)

Output-

<re.Match object; span=(0, 3), match='cat'>
<re.Match object; span=(27, 30), match='cat'>
<re.Match object; span=(35, 38), match='cat'>

Still, the issue is not resolved, we are still getting cats from word wildcat, and copycat.

Other Example-

pattern = re.compile(r'\bcat')

matches = pattern.finditer(data)

for match in matches:
    print(match)

Output-

<re.Match object; span=(0, 3), match='cat'>
<re.Match object; span=(4, 7), match='cat'>
<re.Match object; span=(14, 17), match='cat'>

Still, the issue is not resolved, we are still getting a cat from word catherine, catholic.

Correct Solution-

pattern = re.compile(r'\bcat\b')

matches = pattern.finditer(data)

for match in matches:
    print(match)

Output-

<re.Match object; span=(0, 3), match='cat'>

\B - Not a Word Boundary

word\B = Right hand side of the word should be a word char
\Bword = Left hand side of the word should be a word char
\Bword\B = Both hand side of the word should be a word char

data = 'she sells seashells at sea shore'

Example 1-

pattern = re.compile(r's\B')

matches = pattern.finditer(data)

for match in matches:
    print(match)

Output-

<re.Match object; span=(0, 1), match='s'>
<re.Match object; span=(4, 5), match='s'>
<re.Match object; span=(10, 11), match='s'>
<re.Match object; span=(13, 14), match='s'>
<re.Match object; span=(23, 24), match='s'>
<re.Match object; span=(27, 28), match='s'>

Example 2-

pattern = re.compile(r'\Bs')

matches = pattern.finditer(data)

for match in matches:
    print(match)

Output-

<re.Match object; span=(8, 9), match='s'>
<re.Match object; span=(13, 14), match='s'>
<re.Match object; span=(18, 19), match='s'>

Example 3-

pattern = re.compile(r'\Bs\B')

matches = pattern.finditer(data)

for match in matches:
    print(match)

Output-

<re.Match object; span=(13, 14), match='s'>

^ - Beginning of a String
To check particular string starting/beginning with a particular word or notExample-

txt = "Welcome to the fresherbell.com"
x = re.search("^Welcome", txt)

if x:
    print("Beginning with Welcome")
else:
    print("Not Beginning with Welcome")

Output-

Beginning with Welcome

$ - End of a String
To check particular string ends with particular word or notExample-

txt = "Welcome to the fresherbell.com"
x = re.search("com$", txt)

if x:
    print("Ending with com")
else:
    print("Not Ending with com")

Output-

Ending with com

----------------------------------------------------------------------------------------------

text_to_search = '''
abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ\s
321-555-4321
1234567890
Ha HaHa
MetaCharacters (Need to be escaped):
. ^ $ * + ? { } [ ] \ | ( )
support@gmail.com
321-555-4321
123.555.1234
123*555*-1234
123.555.1234
800-555-1234
900-555-1234
Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
Mr_hello
'''

# Q1 - write a regex to search all the 3 digit numbers

pattern = re.compile(r'\b\d\d\d\b')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

Output-

<re.Match object; span=(57, 60), match='321'>
<re.Match object; span=(61, 64), match='555'>
<re.Match object; span=(171, 174), match='321'>
<re.Match object; span=(175, 178), match='555'>
<re.Match object; span=(184, 187), match='123'>
<re.Match object; span=(188, 191), match='555'>
<re.Match object; span=(197, 200), match='123'>
<re.Match object; span=(201, 204), match='555'>
<re.Match object; span=(211, 214), match='123'>
<re.Match object; span=(215, 218), match='555'>
<re.Match object; span=(224, 227), match='800'>
<re.Match object; span=(228, 231), match='555'>
<re.Match object; span=(237, 240), match='900'>
<re.Match object; span=(241, 244), match='555'>

#Q2 extract a valid phone number = nnn.nnn.nnnn

pattern = re.compile(r'\d\d\d\.\d\d\d\.\d\d\d\d')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

Output-

<re.Match object; span=(184, 196), match='123.555.1234'>
<re.Match object; span=(211, 223), match='123.555.1234'>

[ ] - Matches any single character in brackets.
Example 1-

txt = "Python Mython ython. python Rython"
pattern = re.compile('[Pp]ython')

matches = pattern.finditer(txt)

for match in matches:
    print(match)

Output-

<re.Match object; span=(0, 6), match='Python'>
<re.Match object; span=(20, 26), match='python'>

Example 2-

txt = "Python Mython ython. python Rython"
pattern = re.compile('[on]')

matches = pattern.finditer(txt)

for match in matches:
    print(match)

Output-

<re.Match object; span=(4, 5), match='o'>
<re.Match object; span=(5, 6), match='n'>
<re.Match object; span=(11, 12), match='o'>
<re.Match object; span=(12, 13), match='n'>
<re.Match object; span=(17, 18), match='o'>
<re.Match object; span=(18, 19), match='n'>
<re.Match object; span=(25, 26), match='o'>
<re.Match object; span=(26, 27), match='n'>
<re.Match object; span=(32, 33), match='o'>
<re.Match object; span=(33, 34), match='n'>

#Q3 extract a valid phone number = nnn.nnn.nnnn / nnn-nnn-nnnn

Example-

pattern = re.compile(r'\d\d\d[.-]\d\d\d[.-]\d\d\d\d')             
matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

Output-

<re.Match object; span=(57, 69), match='321-555-4321'>
<re.Match object; span=(172, 184), match='321-555-4321'>
<re.Match object; span=(185, 197), match='123.555.1234'>
<re.Match object; span=(212, 224), match='123.555.1234'>
<re.Match object; span=(225, 237), match='800-555-1234'>
<re.Match object; span=(238, 250), match='900-555-1234'>

Example 3 - Range
Without using range

txt = "1th 2th 3th 4th 5th 6th ath bth cth dth eth"
pattern = re.compile('[123456abcde]th')

matches = pattern.finditer(txt)

for match in matches:
    print(match)

Output-

<re.Match object; span=(0, 3), match='1th'>
<re.Match object; span=(4, 7), match='2th'>
<re.Match object; span=(8, 11), match='3th'>
<re.Match object; span=(12, 15), match='4th'>
<re.Match object; span=(16, 19), match='5th'>
<re.Match object; span=(20, 23), match='6th'>
<re.Match object; span=(24, 27), match='ath'>
<re.Match object; span=(28, 31), match='bth'>
<re.Match object; span=(32, 35), match='cth'>
<re.Match object; span=(36, 39), match='dth'>
<re.Match object; span=(40, 43), match='eth'>

With Range

txt = "1th 2th 3th 4th 5th 6th ath bth cth dth eth"
pattern = re.compile('[1-6a-e]th')

matches = pattern.finditer(txt)

for match in matches:
    print(match)

Output-

<re.Match object; span=(0, 3), match='1th'>
<re.Match object; span=(4, 7), match='2th'>
<re.Match object; span=(8, 11), match='3th'>
<re.Match object; span=(12, 15), match='4th'>
<re.Match object; span=(16, 19), match='5th'>
<re.Match object; span=(20, 23), match='6th'>
<re.Match object; span=(24, 27), match='ath'>
<re.Match object; span=(28, 31), match='bth'>
<re.Match object; span=(32, 35), match='cth'>
<re.Match object; span=(36, 39), match='dth'>
<re.Match object; span=(40, 43), match='eth'>

[^ ] - Matches any single character not in brackets
Example-

txt = "Python Mython ython. python Rython"
pattern = re.compile('[^Pp]ython')

matches = pattern.finditer(txt)

for match in matches:
    print(match)

Output-

<re.Match object; span=(7, 13), match='Mython'>
<re.Match object; span=(13, 19), match=' ython'>
<re.Match object; span=(28, 34), match='Rython'>

Example 2-

txt = "Python Mython ython. python Rython"
pattern = re.compile('[^ython]')

matches = pattern.finditer(txt)

for match in matches:
    print(match)

Output-

<re.Match object; span=(0, 1), match='P'>
<re.Match object; span=(6, 7), match=' '>
<re.Match object; span=(7, 8), match='M'>
<re.Match object; span=(13, 14), match=' '>
<re.Match object; span=(19, 20), match='.'>
<re.Match object; span=(20, 21), match=' '>
<re.Match object; span=(21, 22), match='p'>
<re.Match object; span=(27, 28), match=' '>
<re.Match object; span=(28, 29), match='R'>

Example 3 - Range

Using range with ^

txt = "1th 2th 3th 4th 5th 6th ath bth cth dth eth"
pattern = re.compile('[^1-5a-d]th')

matches = pattern.finditer(txt)

for match in matches:
    print(match)

Output-

<re.Match object; span=(20, 23), match='6th'>
<re.Match object; span=(40, 43), match='eth'>

{n} - Matches exactly n number of occurrences of the preceding expression.

Instead of using \d\d\d (i.e \d 3 times), we can use \d{3}
Example-

pattern = re.compile(r'\d{3}[.-]\d{3}[.-]\d{4}')             
matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

Output-

<re.Match object; span=(57, 69), match='321-555-4321'>
<re.Match object; span=(172, 184), match='321-555-4321'>
<re.Match object; span=(185, 197), match='123.555.1234'>
<re.Match object; span=(212, 224), match='123.555.1234'>
<re.Match object; span=(225, 237), match='800-555-1234'>
<re.Match object; span=(238, 250), match='900-555-1234'>

{n,} -Matches n or more occurrences of preceding expression.
If we want to extract digit 2 or more than 2, then we can use \d{2,)

text_to_search = '''
321-555-4321
321-555-432165
321-555-43
123.555.1234
123*555*-1234
123.555.1234
800-555-1234
900-555-1234
'''

Example-

pattern = re.compile(r'\d{3}[.-]\d{3}[.-]\d{2,}')             
matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

Output-

<re.Match object; span=(1, 13), match='321-555-4321'>
<re.Match object; span=(14, 28), match='321-555-432165'>
<re.Match object; span=(29, 39), match='321-555-43'>
<re.Match object; span=(40, 52), match='123.555.1234'>
<re.Match object; span=(67, 79), match='123.555.1234'>
<re.Match object; span=(80, 92), match='800-555-1234'>
<re.Match object; span=(93, 105), match='900-555-1234'>

{n,m} -Matches at least n and at most m occurrences of preceding expression.
If we want to extract digits between 3 to 6, then we can use \d{3,6)
Example-

pattern = re.compile(r'\d{3}[.-]\d{3}[.-]\d{3,6}')             
matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

Output-

<re.Match object; span=(1, 13), match='321-555-4321'>
<re.Match object; span=(14, 28), match='321-555-432165'>
<re.Match object; span=(40, 52), match='123.555.1234'>
<re.Match object; span=(67, 79), match='123.555.1234'>
<re.Match object; span=(80, 92), match='800-555-1234'>
<re.Match object; span=(93, 105), match='900-555-1234'>

Escape Character(\char)
To extract Special Regex Character-
Special Regex Characters: These characters have special meaning in regex (to be discussed below): ., +, *, ?, ^, $, (, ), [, ], {, }, |, \.

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash (\). E.g., \. matches "."; regex \+ matches "+"; and regex \( matches "(".
You also need to use regex \\ to match "\" (back-slash).

Without escape character

text_to_search = '''
Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
Mr_hello
'''

Example -

pattern = re.compile(r'Mr.')             
matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

Output-

<re.Match object; span=(1, 4), match='Mr.'>
<re.Match object; span=(13, 16), match='Mr '>
<re.Match object; span=(31, 34), match='Mrs'>
<re.Match object; span=(45, 48), match='Mr.'>
<re.Match object; span=(51, 54), match='Mr_'>

With escape character-

pattern = re.compile(r'Mr\.')             
matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

Output-

<re.Match object; span=(1, 4), match='Mr.'>
<re.Match object; span=(45, 48), match='Mr.'>

* -Matches 0 or more occurrences of the preceding expression.
Without *

pattern = re.compile(r'Mr\. [A-Z][a-z]')             
matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

Output-

<re.Match object; span=(1, 7), match='Mr. Sc'>

With *

pattern = re.compile(r'Mr\. [A-Z][a-z]*')             
matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

Output-

<re.Match object; span=(1, 12), match='Mr. Schafer'>
<re.Match object; span=(45, 50), match='Mr. T'>

+ -Matches 1 or more occurrences of the preceding expression.
Example-

pattern = re.compile(r'Mr\. [A-Z][a-z]+')             
matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

Output-

<re.Match object; span=(1, 12), match='Mr. Schafer'>

? -Matches 0 or 1 occurrence of the preceding expression.
Example-

pattern = re.compile(r'Mr\.? [A-Z][a-z]*')             
matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

Output-

<re.Match object; span=(1, 12), match='Mr. Schafer'>
<re.Match object; span=(13, 21), match='Mr Smith'>
<re.Match object; span=(45, 50), match='Mr. T'>

a|b -Matches either a or b.
Example-

pattern = re.compile(r'M(r|s|rs)\. [A-Z][a-z]*')             
matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

Output-

<re.Match object; span=(1, 12), match='Mr. Schafer'>
<re.Match object; span=(31, 44), match='Mrs. Robinson'>
<re.Match object; span=(45, 50), match='Mr. T'>

( ) - Groups regular expressions and remembers matched text.

Python - Python Advanced - RegEx - Regular Expression Tutorial

About Fresherbell

Important Links

Social Media