How can I trim a string prefix in MongoDB Aggregation? #Studio3T_AMA

Q: How can I trim a string to remove a prefix from a string in MongoDB Aggregation? I can’t find an appropriate operator to do it?

Welcome to the latest Studio 3T Ask Manatees Anything – #Studio3T_AMA. Want to know how to use Studio 3T to make your MongoDB life better? Just tweet and use #Studio3T_AMA or email ama@studio3t.com.

A: This is an interesting aggregation problem, one that should be simple. You would have thought there would be an out of the box function for stripping prefixes off strings. Or even a simple trim string operator. Let’s look into this with an actual query.

We’ll use our Welsh pubs dataset for this exercise and we’re going to attempt to find out what the most popular pub name in Wales is. Let’s get the Aggregation Editor out and start building our aggregation. We want to count using a normalized version of the pub name, ignoring “The ” at the start of pubs like “The White Horse” and counting them as “White Horse”.

When it comes to a single operator for stripping prefixes off, as far as we know, there isn’t one. If you’ve looked through the documentation, you might think that $ltrim was the ideal operator to do this – left trim string – so you’d create a $set stage like:

$set: {
	name: { $ltrim("$name","The ") }
}

If we let this aggregation stage loose on that data we find that, yes, records with “The ” at the start of their name have lost their prefix.

But, and it is a big but, any record which had “T” “h” “e” or ” ” at the start of the name would have lost that character or characters.

“The Ten Floor House” would become “Ten Floor House” but “Ten Floor House” would become “n Floor House”. A subtle but terrible effect; the second parameter in $ltrim is a selection of characters that need to be removed, not a single string.

ReplaceOne as a trim string alternative?

There is a function that will replace parts of strings in MongoDB’s aggregation framework. The $replaceOne (and it’s cousin, $replaceAll) do the simple string operation of finding matches and replacing them with alternative text:

$set: {
      name: { 
        $replaceOne: { 
            input: "$name", 
            find: "The ", 
            replacement: "" } 
        }
}

This will only replace the string “The “, not the characters that make it up. But doing it this way has its own side-effects. If, and only if, the string “The ” occurs somewhere else in the string, that’ll get deleted too. And there are indeed plenty of pubs with ‘The’ in the middle, such as the “Hole In The Wall”, the “Pub on The Pond” and “The Pump on The Green”.

If we want to only trim the “The ” off the string in this aggregation, we have to make sure we’re only dealing with a situation where the “The ” is at the start of the string.

Bring in the Regular Expressions

To solve this, we need a combination of the $cond and $regexMatch expressions. We’ll use $regexMatch to work out if the string begins with “The ” and only then will we do a $replaceOne on it. If it doesn’t match, we’ll pass the string straight through. That decision logic can be handled with $cond and the revised stage looks like this:

$set: {
    name : {
        $cond : {
            if : {
                $regexMatch : {
                    input : "$name",
                    regex : /^The /
                }
            },
            then : {
                $replaceOne : {
                    input : "$name",
                    find : "The ",
                    replacement : ""
                }
            },
            else : "$name"
        }
    }
}

This is how you do an “if…. then…. Else….” inside MongoDB’s aggregation. $regexMatch only returns true if it finds a match. We hand it the name as input and look for the regular expression “^The “. The ^ anchors the match to the start of the input string; it can only match that text occurring at the very start and nowhere else.

If the if expression is true, the then expression is evaluated. And here we have put the $replaceOne from earlier so it does the prefix removal.

If the if expression is false, then the elseexpression is evaluated. That in this case is just the unmodified string from $name.

Now, we are able to be sure that we only do $replaceOne where the prefix is present (and only change one occurrence – don’t, whatever you do, change $replaceOne to $replaceAll).

Wrapping up and reporting

And now the name field is free of “The “. We can now move on to processing the results. Let’s see which is the most popular “normalized” name. We can group and count the results by the name using $sortByCount:

$sortByCount: {
   "$name"
}

And we get top popular pub name roots:

So, we’ve used regular expressions and conditions to find out where we need to do string manipulation and trim a string as part of a MongoDB aggregation. This isn’t the only way to achieve this result though, and it’s something we’ll be coming back to in a future AMA.