Python regex

what is regex?

  • Regular Expressions or regex are a tool for matching patterns in text.
  • Python has a built-in package called re, which can be used to work with Regular Expressions.

find a pattern using python regex

  • Lets look for a ZIP code in the given text.
  • ZIP code characheristics
    • It has length of 6 and all are numbers
import re

address1 = "43 Diamond Harbour Road, Alipore, Kolkata, 700027"
address2 = "781, Golden towers, Zip:500001, Hyderabad"

zip_code_pattern = r'.*(?P<zip_code>[0-9]{6}).*'
output = re.findall(zip_code_pattern, address1)
print(output)
# output: ['700027']
output = re.findall(zip_code_pattern, address2)
print(output)
# output: ['500001']

character sets

Pattern Meaning
\w Match a single word character a-z, A-Z, 0-9, and underscore (_)
\d Match a single digit 0-9
\s Match whitespace including \t, \n, and \r and space character
. Match any character except the newline
\W Match a character except for a word character
\D Match a character except for a digit
\S Match a single character except for a whitespace character

anchors

Pattern Meaning
^ Match at the beginning of a string
$ Match at the end of a string
\b Match a position defined as a word boundary
\B Match a position that is not a word boundary

quantifiers

Quantifiers (Greedy) Non-greedy Quantifiers (Lazy) Meaning
* *? Match its preceding element zero or more times.
+ +? Match its preceding element one or more times.
? ?? Match its preceding element zero or one time.
{n} {n}? Match its preceding element exactly n times.
{n , } {n,}? Match its preceding element at least n times.
{n , m} {n , m}? Match its preceding element from n to m times

sets & ranges

Pattern Meaning
[XYZ] Match any of three elements X, Y, and Z
[X-Y] Match a range from X to Y
^[XYZ] Match any single element except X, Y, and Z
^[X-Y] Match any single element
{n , } Match its preceding element at least n times.
{n , m} Match its preceding element from n to m times

capturing groups

Pattern Meaning
(X) Capture the X in the group
(?P<name>X) Capture the X and assign it the name
\N Reference the capturing group #N
\g<N> Reference the capturing group #N (alternative syntax)

alternation

Pattern Meaning
X | Y Match either X or Y

look around

Pattern Meaning
X(?=Y) Match X but only if it is followed by Y
X(?!Y) Match X but only if it is NOT followed by Y
(?<=Y)X Match X if there is Y before it
(?<!Y)X Match X if there is NO Y before it

regex functions

Function Description
findall() Return a list of matches or None
finditer() Return an iterator yielding all non-overlapping matches
search() Return the first match
fullmatch() Return a Match object if the whole string matches a pattern
match() Return the match at the beginning of a string or None
sub() Return a string with matched replaced with a replacement
split() Split a string at the occurrences of matches

regex flags

Flag Alias Inline Flag Meaning
re.ASCII re.A ?m The re.ASCII is relevant to the byte patterns only. It makes the \w, \W,\b, \B, \d, \D, and \S perform ASCII-only matching instead of full Unicode matching.
re.DEBUG N/A N/A The re.DEBUG shows the debug information of compiled pattern.
re.IGNORECASE re.I ?i perform case-insensitive matching. It means that the [A-Z] will also match lowercase letters.
re.LOCALE re.L ?L The re.LOCALE is relevant only to the byte pattern. It makes the \w, \W, \b, \B and case-sensitive matching dependent on the current locale. The re.LOCALE is not compatible with the re.ASCII flag.
re.MUTILINE re.M ?m The re.MULTILINE makes the ^ matches at the beginning of a string and at the beginning of each line and $ matches at the end of a string and at the end of each line.
re.DOTALL re.S ?s By default, the dot (.) matches any characters except a newline. The re.DOTALL makes the dot (.) matches all characters including a newline.
re.VERBOSE re.X ?x The re.VERBOSE flag allows you to organize a pattern into logical sections visually and add comments.

References: