Regular Expressions are a powerful way of expressing string patterns for matching. Value Search uses them to allow you to home in on particular strings in the database with precision. Regular Expressions are commonly available with many languages and applications and the syntax and capabilities of them have a common core.
We recommend regex101.com as an excellent resource to help understand how your regular expressions work – as well as running your regular expressions against text and showing you your matches, it produces an explanation of your expression.
The Simplest Match with Regular Expressions
If the characters in the regular expression exactly match the text being matched with, that’s a match:
Regular expression | Text | Match? |
---|---|---|
hello | hello | Yes, hello – We show where the expression matched with italics |
hell | shell | Yes, shell – a regular expression matches when the pattern is matched – it doesn’t have to match at the start of the text |
hello | hell | No – there’s only four characters in the text and the regular expression wants five characters |
hello | helloisanyonein | Yes, helloisanyonein – A regular expression matches when the pattern is matched and here hello is matched immediately, the rest of the text is irrelevant |
Any Single Character – The dot .
A single .
represents a single character, any character.
Regular expression | Text | Match? |
---|---|---|
. | a | Yes, a |
. | aa | Yes, aa – a regular expression matches when the pattern is matched |
.. | a | No – there’s only one character and the regular expression wants two characters |
.. | aa | Yes, aa – there’s two characters in the pattern and two characters in the text |
You can use this . and other characters in combination:
Regular expression | Text | Match? |
---|---|---|
h.llo | hullo and hello | Yes, hullo and hello |
h…. | hodel | Yes, hodel this is matching h followed by any 4 characters |
.hello. | hello | No, the expression wants any character before and after ‘hello’ |
say.hello | say hello | Yes, say hello – when we say any character, we mean it – a space is a character so the . matches it. |
The + and * Multiple Operators
So far, so simple; the patterns are one-to-one in terms of what they match. But regular expressions start being expressions with the +
and *
operators. A +
following a character indicates that the character should appear one or more times for a match. You can also use the .
to match any character with the + to match one or more of any character
Regular expression | Text | Match? |
---|---|---|
a+ | manatee | Yes, manatee |
e+ | manatee | Yes, manatee |
d+ | waddle | Yes, waddle |
d+ | wondered | Yes, wondered |
m.+e | manatee | Yes, manatee |
m.+e | me | No. There’s no characters between the m and the e. |
m.+t | manatee | Yes, manatee |
A *
says that the character should appear zero or more times. Matching with nothing is somewhat counter-intuitive. You can, for example, have a regular expression of a*
and that will match any string at all because there’ll everything matches with no a
. That’s why it’s important to use a sequence of characters which terminate your pattern if you use *
.
Regular expression | Text | Match? |
---|---|---|
a* | manatee | Yes (Twice for each ‘a’ occurring once, and five times for all characters which contain zero a – which is probably not what you want) |
a*tee | manatee | Yes – manatee (it doesn’t match the first a as that is followed by an n) |
d*l | waddle | Yes waddle |
w.*d | wondered | Yes wondered (regular expressions seek the biggest matching sequence) |
w.*d | wd | Yes, wd |
w.*d | would | Yes, would |
.*e | manatee | Yes, manatee (seeking the biggest matching sequence again). |
For completeness, there is also an ?
operator which matches none or one of a character.
The Alternative Operator |
Where there may be two sequences of characters that you want to match against, the alternative operator |
(the vertical bar, also known as the pipe character) will allow the regular expression to match the sequences either side of it. Those sequences can also contain any other operators:
Regular expression | Text | Match?e |
---|---|---|
a|e | manatee | Yes, manatee |
an|t | manatee | Yes manatee – matches an and t, not an and at |
word|bird | The bird is the word | Yes – The bird is the word |
w..d|b..d | The bird is the word | Yes – The bird is the word |
w.+d|b.+d | The bird is the word | Yes – The bird is the word – The sequence on the left is greedy and will match to the d in word, so the right hand sequence is never evaluated. |
Character Classes with [ and ]
To match with one of a number of characters, you can create a character class. This is a set of characters surrounded by square brackets. So [ame]
says “any character that is a, m or e” and would match with a
or m
or e
.
If the first character in the class is ^
then this inverts the meaning and says “any character that is not a, m or e”. A -
within the characters will create a range. So [a-z]
will match any lowercase character from a
to z
and [0-9]
would match a single digit. Multiple ranges can be specified too.
As it is, effectively, a character, then you can use + and * with it too.
Regular expression | Text | Match? |
---|---|---|
[ame]+ | manatee | Yes (Three times, manatee ) |
[^ame]+ | manatee | Yes, (Twice, manatee ) |
[^ ]+ | This is a $50 manatee | Yes – This is a $50 manatee – this matches five times, for each sequence of characters which isn’t a space character. |
[a-z]+ | This is a $50 manatee | Yes – This is a $50 manatee – only the lowercase alphabetic characters match. |
[A-Za-z0-9]+ | This is a $50 manatee | Yes – This is a $50 manatee – everything matches but the spaces as the $ character. |
Anchor operators ^ and $
A regular expression may be required to match only at the start of a string or at the end of a string. For this, the anchor operators can be used. The ^
represents the start of a string and $
the end of a string.
Regular expression | Text | Match? |
---|---|---|
^man | manatee | Yes, manatee |
^tee | manatee | No, tee is not at the start of the text |
tee$ | manatee | Yes, manatee the tee is at the end of the text |
n.*$ | manatee | Yes, manatee the $ can terminate a multi character sequence |
^[^t]+ | manatee | Yes – manatee – This is anchored at the start of the line, then followed by the character class [^t] where ^ means “not t”. This will match up to but not including the t in manatee. |
Escaping with Backslash \
If you want to match with a character that is one of the operator’s characters, you will need to tell the regular expression matcher not to use them as an operator. This is done by preceding the character with a backslash in a process known as escaping.
Regular expression | Text | Match? |
---|---|---|
\$[^ ]+ | This is a $50 manatee | Yes – This is a $50 manatee – The backslash lets the expression match with the $ in the string. |
\. | The end of the line. | Yes – The end of the line. – the . at the end matches, because the backslash has made it explicitly a matchable full stop. |
\[words\] | arrayof[words] | Yes, arrayof[words] – the backslash here escapes the square brackets stopping them from defining a character class. |
\\ | the \ character | Yes, the \ character – here the backslash escapes itself so you can match with a backslash |
Special Characters with Backslash \
The backslash also has another function. Other non-special characters after a backslash can also give powerful, time-saving short cuts. Here’s a list of some of the common ones:
Backslash sequence | Meaning |
---|---|
\s | Any whitespace character, including tabs and returns |
\S | Any non-whitespace character |
\d | Any digit |
\D | Any non-digit |
\w | Any word character, equivalent to [A-Za-z0-9] |
\W | Any non-word character |
Do you want to know more about Regular Expressions?
The full and detailed reference to the Java regular expression engine and the regular expression patterns it accepts is in the Java 17 Documentation. This is the regular expression engine used in the Studio 3T Value Search.
MongoDB itself uses the de facto standard PCRE library for regular expressions.
For a fuller introduction to Regular Expressions, we recommend O’Reilly’s Introducing Regular Expressions. Beyond that, you can consider Mastering Regular Expressions for a deepest dive into the subject. And finally, as we previously mentioned regex101.com is an excellent interactive resource to help understand how your regular expressions work.