Osahenru

Posted on Apr 26, 2024

Regex pattern in Python: A beginners guide

#python #beginners #regex #webdev

Regular expression (regex) is a sequence of characters that define a search pattern, usually used for text search, text manipulation, and input validation

In this tutorial we are going to see how to validate, clean and extract users data, using code examples to explain the concepts of regular expressions in Python.

At the end of this tutorial, I hope you'll be more confident in using regex in your next project or have a better understanding when you encounter it in a codebase.

^[a-zA-Z0-9.!#$%&'*+\/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$

Insight💡: The regex pattern above is used to validate an email address. I know it looks really cryptic, but as we look at examples, you will begin to have a more in-depth understanding.

Let's start by updating our vocabulary with some common metacharacters used in regex.

.   any character except a new line
*   0 or more repetitions
+   1 or more repetitions
?   0 or 1 repetition
{m} m repetitions
{m,n} m-n repetitions
^   matches the start of the string
$   matches the end of the string or just before the newline at the end of the string
\ Used to drop the special meaning of character following it
| Means OR (A|B) A or B
(...)   a group
(?:...) non-capturing version
[]    set of characters
[^]   complementing the set

Now, let's write some code to better understand the regex syntax above, as we will be revisiting this code block frequently.

VALIDATING USERS DATA

Example 1: IP address validator:

Ideally, an IP address should consist of four sets of numbers separated by periods, with each set ranging from 0 to 255 and typically represented with at most three digits (e.g., 192.168.1.1). Each set is called an octet, represented in decimal form. That said, let's see how we can pattern a user's IPv4 address.

import re

def main():
    print(validate(input("IPv4 Address: ")))


def validate(ip):
    matches = re.search(r"^(\d{1,3}\.){3}\d{1,3}$", ip)

    if not matches:
        return False
    return True

if __name__ == "__main__":
    main()

Now, let’s break down this pattern to understand what each syntax does

^ -matches the beginning of the string
\d -numbers only, i.e a letter will return False
{1, 3} -do something either once or 3 times i.e the number can be one - three digits, i.e 0 - 999
\. -a literal dot(.), if we do not use the backslash the re.search function will interpret . as any character except a new line
(\d{1,3}\.) -a group of pattern
{3} -do something 3 times, so we want to do whatever is in the group parenthesis (\d{1,3}\.) 3 times which will match something like this 192.168.1.
\d -digits
{1, 3} -do something either once or 3 times
$ -matches the end of the string or just before the newline

Our code above checks if a user's IP address has four sets of octet numbers. However, it does not check if each octet number is between 0 and 255. Therefore, an IP address like 1.300.600.5 will return True instead of False. To fix this, we can include some extra logic. Modify the code as follows

Example 2: IP address validator:

import re

def main():
    print(validate(input("IPv4 Address: ")))

def validate(ip):
    matches = re.search(r"^(\d{1,3}\.){3}\d{1,3}$", ip)

    if not matches:
        return False

    ip_list = list(map(int, matches.group(0).split(".")))

    for number in ip_list:
        if 0 <= number <= 255:
            return True
        return False


if __name__ == "__main__":
    main()

Let’s update our regex vocabulary to see how we can improve our existing code base

\d    decimal digit
\D    not a decimal digit
\s    whitespace characters
\S    not a whitespace character
\w    word character, as well as numbers and the underscore
\W    not a word character
\b word bound, matches if a word matches the beginning and end of a word

Now, let's see how we might be able to use regex to extract data from a user's input

EXTRACTING DATA

Assuming we have a Facebook URL https://www.facebook.com/johndoe and we want to extract just the username, how might we go about this using our bank of rich regex vocabulary? We can think of various ways to implement this. One straightforward approach is to introduce the use of the re.sub() function.

The re.sub() function in Python is used for replacing substrings that match a specified pattern with a replacement string. It takes three main arguments: re.sub(pattern, replacement, string).

Example 3: Username extractor:

import re


text = 'https://www.facebook.com/johndoe'
matches = re.sub('https://www.facebook.com/', '', text)
print(matches)

Note: This code assumes that every Facebook URL begins with https://www., but you can take it up as a challenge to pattern URLs that begin with http://www. or just www.

I hope at this point it's all beginning to make sense. If not, let's cover some more examples.

CLEANING DATA

Sometimes, phone books append special characters and whitespaces to a phone number. How about we write a pattern to clean the data of a user’s phone number? Let's revisit our rich metacharacter vocabulary.

Assuming we want to get rid of the (+233) and replace it with a 0, how might we go about it? Well, let's take a look at the example below

Example 4: Phone number book:

import re


phone_number = '(+233) 546 07890'
matches = re.sub(r'\(\+233\)', '0', phone_number)

print(matches)

Output: 0 546 07890

We can further improve this code by removing the whitespaces, again, by using the re.sub() function

Example 5: Phone number book

import re

phone_number = '(+233) 546 07890'
matches = re.sub(r'\(\+233\)', '0', phone_number)

rematched_pattern = re.sub(' ', '', matches)
print(rematched_pattern)

Output: 054607890

MORE REGEX FUNCTION

The re.fullmatch() function in regex checks whether the entire string satisfies the pattern. It returns a match object if there is a match and None otherwise. This means that you have to pattern the entire string for the re.fullmatch() function to return a match object; otherwise, it returns None. This is unlike the re.search() function, which looks for the pattern anywhere in the string. Consider the example below:

Example 6 with fullmatch:

import re
text = '123abcd' 
matches = re.fullmatch(r'\d+', text)

print(matches)

When you run this program, you'll notice that the re.fullmatch() function returns None because we only tried to match the digits part of the string with r'\d+'. Now, let's write the same pattern, but using the re.search() function.

Example 7 with search():

import re

text = '123abcd' 
matches = re.search(r'\d+', text)

print(matches)

Output: <re.Match object; span=(0, 3), match='123'>

While the re.fullmatch() function returns None, the re.search() function returns a match object. This illustrates that using the re.fullmatch() function would have been a simpler approach for our IP address example above, as we are mapping an entire string of data.

re.fullmatch() vs. re.match()

The main difference is that re.fullmatch() searches the entire string (from beginning to end) for a pattern, whereas re.match() only searches the beginning for a match. It returns a match object if a match is found, otherwise None.

When to Use fullmatch(), match(), or search()

re.fullmatch(): Use when you want to ensure the entire string matches a pattern. For example, to validate a string against a specific format, such as checking if a string is a valid date or a sequence of digits.
re.match(): Use when you want to find a match only at the beginning of a string. For example, to validate user input, such as checking if a string starts with a specific prefix.
re.search(): Use when you want to find a match anywhere in a string. For example, to check if a string contains a specific pattern, such as a certain word or sequence of characters.

In general, if you are unsure which function to use, re.search() is often a good choice because it is more flexible and can be used to find patterns anywhere in the string. On a surface level, one could argue that re.match() and re.fullmatch() are just shortcuts, as re.search() can perform the functions of both re.match() and re.fullmatch().

COMMON FLAGS

To optimize the functionality of a regex function, you can include flags. We will discuss three common flags used in regex.

re.IGNORECASE(re.I)

This flag ignores case sensitivity in a pattern, allowing uppercase and lowercase letters to be matched interchangeably.

Example 8: without the re.IGNORECASE flag:

import re

text = '123ABCDabcd'
matches = re.search(r'[A-Z]+', text)

print(matches)

The above code prints out a match object <re.Match object; span=(3, 7), match='ABCD'>, omitting lowercase characters. Let's consider another version of the same code where we use the re.IGNORECASE flag and observe the output.

Example 9: with the re.IGNORECASE flag:

import re

text = '123ABCDabcd'

matches = re.search(r'[A-Z]+', text, re.IGNORECASE)
print(matches)

Notice that with the re.IGNORECASE flag, we have a different printout <re.Match object; span=(3, 11), match='ABCDabcd'>, which includes all characters (due to case-insensitivity).

re.MULTILINE (re.M)

This flag allows ^ (caret) and $ (dollar) to match the pattern at the beginning of the string and at the beginning of each newline (\n). It also allows them to match the pattern at the end of the string and at the end of each newline (\n).

Example 10 without the re.MULTILINE flag:

import re

target_str = "Joy lucky number is 75\nTom lucky number is 25"

result = re.findall(r"^\w{3}", target_str)
print(result)

Output: ['Joy']

The code above ignores every third word after a newline (\n). With a modified version like this...

Example 11 with the re.MULTILINE flag:

import re

target_str = "Joy lucky number is 75\nTom lucky number is 25"

result = re.findall(r"^\w{3}", target_str, re.MULTILINE)
print(result)
Output: ['Joy', 'Tom']

we can ignore the new line character and print out every 3 word.

re.DOTALL (re.S)

By default, the . inside a regular expression pattern represents any character (a letter, digit, symbol, or punctuation mark), except the newline character (\n). However, this behavior can be altered using the re.DOTALL flag.

Example 12 without the re.DOTALL flags:

import re

target_str = """ML 
and AI"""
result = re.search(r".+", target_str)
print(result)

Output: <re.Match object; span=(0, 3), match='ML '>

Example 13 with the re.DOTALL flag:

import re

target_str = """ML 
and AI"""

result = re.search(r".+", target_str, re.DOTALL)
print(result)
<re.Match object; span=(0, 10), match='ML \nand AI'>

you can see it included the nextline character to the match object

CONCLUSION

The key to writing a good regex pattern is to visually model the pattern based on the data given and tweak it as you progress, rather than relying solely on memory. Remember to test your pattern against a variety of inputs to ensure it behaves as expected. Please leave a comment below if any questions or for further clarifications. Happy coding!

DEV Community