In many circles of the coding realm, regex lurks as a mystery behind the scenes, cloaked in shadows until it shows it's Lovecraftian face to us and we lose our digital minds to the madness that is the Regular expression. I've endeavored to change that, if only a bit, so come with us as we explore a first look at the real face of Regex
REGEX
Regular expressions, or Regex, originated in 1951, when mathematician Stephen Cole Kleene described regular languages using his mathematical notation called regular events.
Its first appearance was a program setting in Ken Thompson's version of the QED text editor in the mid 1960's, and,since then, it has been used in pattern recognition software all over the coding world. So what exactly are Regexes, or a Regex, and how can we learn more about them
A regular expression is a sequence of characters that define a search pattern. It sounds simple... And that's because it IS simple! It's essentially a simplification of possible occurrences for the sole purpose of recognizing those occurrences any time (or just once or so when) they happen to arrive.
definition, use cases, mention it as a matcher talk about simpler parts of it and how it's not the same in every place it's used. Maybe find an example of different versions of the same regex
But why
Regex is used in a wide variety of text processing tasks, from serving as the language of lexers in parsing, to data validation, data scraping, wrangling and more. It is very heavily utilized for its simplicity in describing and searching for patterns in as few characters as possible. Let's take, for example, the above regex. What does it refer to? Well, by the end of this post, I'm hoping we can figure it out together. But first, lets look at the tools in the regex toolbox
Tools
- quantifier- Regex quantifiers check to see how many times you should search for a character.
Here some examples of quantifiers:
? – Zero or one
- – one or more
- – zero or more {N} – Exactly N number of times (where N is a number) {N,} – N or more number of times (where N is a number) {N,M} – Between N and M number of times (where N and M are numbers and N < M) *? – Zero or more, but stop after first match
- regex group- When searching with a regex, it can be helpful to search for more than one matched item at a time. This is where “groups” come into play. Groups allow you to search for more than a single item at a time. Groups are defined by parentheses; there are two different types of groups–capture groups and non-capturing groups:
(...) – Group matching any three characters
(?:...) – Non-capturing group matching any three characters
The difference between these two typically comes up in the conversation when “replace” is part of the equation.
- pattern collector- Pattern collections allow you to search for a collection of characters to match against. For example, using the following regex:
My favorite vowel is [aeiou]
You could match the following strings:
My favorite vowel is a
My favorite vowel is e
My favorite vowel is i
My favorite vowel is o
My favorite vowel is u
But nothing else.
Here’s a list of the most common pattern collections:
[A-Z]– Match any uppercase character from “A” to “Z”
[a-z]– Match any lowercase character from “a” to “z”
[0-9] – Match any number
[opspk]– Match any character that’s either “o”, “p”, “s”, "p", or “k”
[^asdf]– Match any character that’s not any of the following: “a”, “s”, “d”, or “f”
You can combine these to:
[0-9A-Z]– Match any character that’s either a number or a capital letter from “A” to “Z”
[^a-z] – Match any non-lowercase letter
- regex tokens- While keys like “a” to “z” make sense to match using regex, what about the newline character?
. – Any character
\n – Newline character
\t – Tab character
\s– Any whitespace character (including \t, \n and a few others)
\S – Any non-whitespace character
\w– Any word character (Uppercase and lowercase Latin alphabet, numbers 0-9, and _)
\W– Any non-word character (the inverse of the \w token)
\b– Word boundary: The boundaries between \w and \W, but matches in-between characters
\B– Non-word boundary: The inverse of \b
^ – The start of a line
$ – The end of a line
\– The literal character “\”
- metacharacters
- Repeaters
- The asterisk symbol (*) tells the computer to match the preceding character (or set of characters) for 0 or more times (to forever).1
- '+'- The Plus symbol tells the computer to repeat the preceding character (or set of characters) at least one or more times(to forever).1
'{' - The curly braces tell the computer to repeat the preceding character (or set of characters) for as many times as the value inside this bracket.1
'^'- The caret symbol tells the computer that the match must start at the beginning of the string or line.(Example : ^\d{3} will match with patterns like "901" in "901-333-".
)1'$'- The dollar symbol tells the computer that the match must occur at the end of the string or before \n at the end of the line or string.1
'.'- The dot symbol can take the place of any other symbol, that is why it is called the wildcard character.1
'|'- Regex OR operator:
'?'- ** Optional character symbol tells the computer that the preceding character may or may not be present in the string to be matched.1**
'()'- Grouping Characters ( )
A set of different symbols of a regular expression can be grouped together (within parentheses) to act as a single unit and behave as a block...1'[]'- [set_of_characters]: Matches any single character in set_of_characters. By default, the match is case-sensitive.1
'\'- The Escape Symbol
**If you want to use any of these characters as a literal in a regex, you need to escape them with a backslash. If you want to match 1+1=2, the correct regex is 1+1=2. Otherwise, the plus sign has (the) special meaning.4
Here are some other helpers
\number
Backreference: allows a previously matched sub-expression(expression captured or enclosed within circular brackets ) to be identified subsequently in the same regular expression. \n means that group enclosed within the n-th bracket will be repeated at current position.
Comments:
Comment ( ?# comment )
Inline comment: The comment ends at the first closing parenthesis.
# [to end of line]
X-mode comment. The comment starts at an unescaped # and continues to the end of the line.
\\d+[\+-x\*]\d+\b#Here, we really match the + sign
Here are some other helpers
\number
Backreference: allows a previously matched sub-expression(expression captured or enclosed within circular brackets ) to be identified subsequently in the same regular expression. \n means that group enclosed within the n-th bracket will be repeated at current position.
Comments- In any sort or language of programming, it's important to know just how to leave notes without interfering with the code at large. Here's how to do it in Regex:
Comment ( ?# comment )
Inline comment: The comment ends at the first closing parenthesis.
# [to end of line]
X-mode comment. The comment starts at an unescaped # and continues to the end of the line.
\\d+[\+-x\*]\d+\b#Here, we really match the + sign
- regex flag- A regex flag is a modifier to an existing regex. These flags are always appended after the last forward slash in a regex definition.
Here’s a shortlist of some of the flags available to you.
g – Global, match as many times as the pattern emerges
m – Force $ and ^ to match each newline individually
i – Make the regex case insensitive
s - Match new lines.
x - allow spaces and comments.
J - Allow duplicate group names
U - Ungreedy match
Breakdown:
So, what does this mean in the first place? Let's break it down:
[w._%+-] #the w tag refers to characters/numbers, then essentially anything goes
+@ # match as many times as possible, then match the @ symbol directly
+\. #this '.' is escaped, so match it exactly
[a-zA-Z] # then any letter
{2,4} # at least two times, but no more than 4.
Put them together and what do you get?? It's a checker for an email address!(regu$ala9@gmail.com) See? not so bad!
conclusion
Today we've peered into the bewildering abyss that is the regular expression and come out with our sanity. Jokes aside, the regular expression is an extremely useful and interesting tool for data-matching and parsers all over the world use them as a default. Once the grammar and groove of what these symbols hold are understood, there can be a whole new world of pattern matching ahead of us. I hope it's a bit easier for you to take a few steps into that wilderness after today. Happy coding!
References
- https://www.geeksforgeeks.org/write-regular-expressions/
- https://coderpad.io/blog/development/the-complete-guide-to-regular-expressions-regex/#what-does-a-regex-look-like
- https://www.computerhope.com/jargon/r/regex.htm
- https://www.regular-expressions.info/tutorial.html
- https://en.wikipedia.org/wiki/Regular_expression
Top comments (0)