Regular expressions and what are they for
I was going through the Blender developer’s wiki for getting started and saw an unfamiliar concept in a list of recommended requirements to get started:
Learn basic regex (many search tools & IDE’s support them), it looks confusing but you can find some cheat sheets online to help.
Before this, I had never heard of regex before and the short Googling I did turned up some unclear results. Wkipedia says:
A regular expression, regex or regexp is a sequence of characters that define a search pattern.
My immediate questions followed: what’s it look like? Where is it used? How have I never seen it? And so, I guess the rest of this post is going to be my efforts at noting down what I’ve learned about this regular expression or regex.
So what is it?
As far as I’m concerned, the description from Wikipedia seems about as accurate as I would put it as well. It’s not exactly a programming language (although it seems very close), rather a sequence of characters that describes a set of strings and character combinations. An example would be a regular expression that defines a group of all the phone numbers. One could probably also say it’s very similar to set expressions in mathematics but that would be basically what it is.
Usually, a regular expression would have to be run through a regex interpreter, which are found in many text editors (apparently VSCode also has one but I haven’t used it yet). The regular expression can be run through the interpreter and can be used to find all the strings (i.e. the set of strings) matching the criteria defined by the expression.
Syntax
Characters in a regular expression will match themselves so an a
will match to any a
in the search field and so will 5
s and so forth. There are special characters (the Python documentation calls them metacharacters) however that serve other functions, which we will get to.
So in order to find a specific character we want, we just have to type in that character… seems simple enough. What about if we want to find any character from a broader group, especially if we’re not sure what character we want yet?
Character classes
In comes the first metacharacters, the [
and the ]
. Any set of characters enclosed within these are searched for individually, so:
[abc]
means searching for either characters a
, b
, or c
. These character sets can also be expressed as a set so alternatively, that above can also be written as
[a-c]
Complements to a set can be done with the caret ^
as the first character inside the class. This means any set of characters inside the class is not matched and the other ones are
[^5] matches with any character other than a "5"
[5^] matches with characters "5" or "^"
Outside of a class however, the caret ^
indicates that the matching set must be at the beginning of a string. Similarly, a dollar sign $
indicates that the matching set must be at the end of a string.
^I matches with an "I" at the beginning of a string
no$ matches with a "no" at the end of a string
^Hey dude$ matches an exact string "Hey dude"
To search for a string of characters, you can string (oh, so that’s why they’re called strings) together multiple characters or sets of the characters, as such:
[a-z][b-u][a-d][g-x][d-t] matches with words like "crate" and "beats", but not "peach" and "guava"
There are also pre-defined character classes, which represent a whole set of characters by itseld, such as the dot .
metacharacter, which will match any character. Here are some more:
\d matches with all digit characters (numbers)
\w matches with all alphanumeric characters (numbers and alphabets)
\s matches with whitespaces
\t matches with tabs
Negations can be represented with capitalized alternatives
\D matches with all non-digit characters (alphabets and symbols)
\W matches with symbols
It should be noted that inside character class metacharacters ([
and ]
), other metacharacters are not active inside the []
. If you want the same behavior outside of the class with a metacharacter, it has to be escaped using the backslash \
. These characters have to be escaped: ^.[$()|*+?
.
[$] matches a "$" but
\$ also matches a "$"
More metacharacters
Here are some more special characters and what they do
+ matches with one or more of the preceding character
pog+ matches with "pog", "pogg", or "poggggggg" but not "pooggg"
* matches with zero or more of the preceding character
? matches with zero or one of the preceding character
ab?c matches with "abc" or "ac" but not "abbc"
{num} used to specify a number of the preceding character appearing
p{2} matches with 2 p's
p{4,} matches with 4+ p's
p{3,7} matches with 3 to 7 p's
| acts as an OR operator
a|c searches for either "a" or "c", also equivalent to [ac]
Character captures
A character or a set of characters can be captured into a group, using parentheses (
)
. This also means the captured group can be accessed later for repetition as well, using the orders in which the groups were created, i.e. \1 \2 \3
and so on.
(jolly) matches "jolly"
(\d)kl\1 matches "4kl4" but not "gekle" or "4kl2", where the captured character is 4
(s|d) matches "s" or "d" and captures them
The captured group can also be named with ?<name> and disabled with ?:
(?<Bob>\d) matches digits and names it "Bob"
(?:sit) this group cannot be accessed with \number
Example use
Going back to the phone number, my regular expression for matching a phone number would look something like this:
/^((\d{3})|(\(\d{3}\)))\-?\d{3}\-?\d{4}/gm
It can be broken down into a couple of parts:
- the forward slashes on either end of the expression
/
encloses the expression -
g
andm
are called flags, which are special values for the regex interpreterg
(global) tells the interpreter to look for all the matches and not return after the firstm
(multiline) tells the interpreter to match after every newline after^
and$
(\d{3})|(\(\d{3}\))
means look for 3 digits either enclosed in parentheses or not\-?
means there may or may not be a dash between every 3rd digit\d{3}
and\d{4}
means match a 3 digit string and a 4 digit string respectively
As such, the above expression will match with numbers like:
- (212)664-7665
- 212-664-7665
- 212-664-7665
More information on regex can be found here:
Python 3.8.2 documentation - Python’s documentation on how to write regex and use them in scripts
Microsoft documentation - Information on using regex in Visual Studio and lists of the syntax used
Regex101 - try out some of your own regex here