Python regex
what is regex?
- Regular Expressions or regex are a tool for matching patterns in text.
- Python has a built-in package called re, which can be used to work with Regular Expressions.
find a pattern using python regex
- Lets look for a ZIP code in the given text.
- ZIP code characheristics
- It has length of 6 and all are numbers
import re
address1 = "43 Diamond Harbour Road, Alipore, Kolkata, 700027"
address2 = "781, Golden towers, Zip:500001, Hyderabad"
zip_code_pattern = r'.*(?P<zip_code>[0-9]{6}).*'
output = re.findall(zip_code_pattern, address1)
print(output)
# output: ['700027']
output = re.findall(zip_code_pattern, address2)
print(output)
# output: ['500001']
character sets
Pattern | Meaning |
\w | Match a single word character a-z, A-Z, 0-9, and underscore (_) |
\d | Match a single digit 0-9 |
\s | Match whitespace including \t, \n, and \r and space character |
. | Match any character except the newline |
\W | Match a character except for a word character |
\D | Match a character except for a digit |
\S | Match a single character except for a whitespace character |
anchors
Pattern | Meaning |
^ | Match at the beginning of a string |
$ | Match at the end of a string |
\b | Match a position defined as a word boundary |
\B | Match a position that is not a word boundary |
quantifiers
Quantifiers (Greedy) | Non-greedy Quantifiers (Lazy) | Meaning |
* | *? | Match its preceding element zero or more times. |
+ | +? | Match its preceding element one or more times. |
? | ?? | Match its preceding element zero or one time. |
{n} | {n}? | Match its preceding element exactly n times. |
{n , } | {n,}? | Match its preceding element at least n times. |
{n , m} | {n , m}? | Match its preceding element from n to m times |
sets & ranges
Pattern | Meaning |
[XYZ] | Match any of three elements X, Y, and Z |
[X-Y] | Match a range from X to Y |
^[XYZ] | Match any single element except X, Y, and Z |
^[X-Y] | Match any single element |
{n , } | Match its preceding element at least n times. |
{n , m} | Match its preceding element from n to m times |
capturing groups
Pattern | Meaning |
(X) | Capture the X in the group |
(?P<name>X) | Capture the X and assign it the name |
\N | Reference the capturing group #N |
\g<N> | Reference the capturing group #N (alternative syntax) |
alternation
Pattern | Meaning |
X | Y | Match either X or Y |
look around
Pattern | Meaning |
X(?=Y) | Match X but only if it is followed by Y |
X(?!Y) | Match X but only if it is NOT followed by Y |
(?<=Y)X | Match X if there is Y before it |
(?<!Y)X | Match X if there is NO Y before it |
regex functions
Function | Description |
findall() | Return a list of matches or None |
finditer() | Return an iterator yielding all non-overlapping matches |
search() | Return the first match |
fullmatch() | Return a Match object if the whole string matches a pattern |
match() | Return the match at the beginning of a string or None |
sub() | Return a string with matched replaced with a replacement |
split() | Split a string at the occurrences of matches |
regex flags
Flag | Alias | Inline Flag | Meaning |
re.ASCII | re.A | ?m | The re.ASCII is relevant to the byte patterns only. It makes the \w, \W,\b, \B, \d, \D, and \S perform ASCII-only matching instead of full Unicode matching. |
re.DEBUG | N/A | N/A | The re.DEBUG shows the debug information of compiled pattern. |
re.IGNORECASE | re.I | ?i | perform case-insensitive matching. It means that the [A-Z] will also match lowercase letters. |
re.LOCALE | re.L | ?L | The re.LOCALE is relevant only to the byte pattern. It makes the \w, \W, \b, \B and case-sensitive matching dependent on the current locale. The re.LOCALE is not compatible with the re.ASCII flag. |
re.MUTILINE | re.M | ?m | The re.MULTILINE makes the ^ matches at the beginning of a string and at the beginning of each line and $ matches at the end of a string and at the end of each line. |
re.DOTALL | re.S | ?s | By default, the dot (.) matches any characters except a newline. The re.DOTALL makes the dot (.) matches all characters including a newline. |
re.VERBOSE | re.X | ?x | The re.VERBOSE flag allows you to organize a pattern into logical sections visually and add comments. |
References: