Skip to content
Studio 3T - The professional GUI, IDE and client for MongoDB
  • Tools
    • Aggregation Editor
    • IntelliShell
    • Visual Query Builder
    • Export Wizard
    • Import Wizard
    • Query Code
    • SQL Query
    • Connect
    • Schema Explorer
    • Compare
    • SQL ⇔ MongoDB Migration
    • Data Masking
    • Task Scheduler
    • Reschema
    • More Tools and Features
  • Solutions
  • Resources
    • Knowledge Base
    • MongoDB Tutorials & Courses
    • Tool/Feature Documentation
    • Blog
    • Community
    • Testimonials
    • Whitepapers
    • Reports
  • Contact us
    • Contact
    • Sales Support
    • Feedback and Support
    • Careers
    • About Us
  • Store
    • Buy Now
    • Preferred Resellers
    • Team Pricing
  • Download
  • My 3T
search

Studio 3T Knowledge Base

  • Documentation
  • Tutorials
  • Workshops
Take the fastest route to learning MongoDB. Cover the basics in two hours with MongoDB 101, no registration required.
Start the free course

Using Aggregation and Regex in MongoDB to Reduce Data in a Collection

Posted on: 19/06/2018 (last updated: 04/08/2021) by Phil Factor

When you are importing data into a MongoDB database, a lot of work and frayed tempers can be saved by preparing data properly before you attempt to create a MongoDB collection. In this article, I’ll be demonstrating how some real data can be imported into MongoDB and then cleaned up using aggregation and regex to better gain insights.

Extracting the data

Just so that we can use data that you can try yourself, we’ll use some rainfall records that are easily available on the internet. I chose them because they are continually updated with new data, so the need to repeatedly refresh the data in the database is more apparent. If you have no interest in rain in southeast England, just close your eyes, click your heels and pretend it is key financial data. We’re using it to demonstrate a point.

The data is for southeast England rainfall. It is in plain text.

https://www.metoffice.gov.uk/hadobs/hadukp/data/daily/HadSEEP_daily_qc.txt

Rainfall data for Southeast England

At this point, we have data in a form that might have some value in a spreadsheet but it just doesn’t work for a database. It is actually a pivoted report. The first column is the year up to the present, and the second column is the month. All subsequent columns represent the day. Because there aren’t 31 days in every month, the unused columns have -99.99 in them.

Transforming the data

If we just imported this lot, and then tried to use it in its present structure, you’d soon hit problems. How, for example, would you find the lowest and highest rainfall values within a range of dates? What sort of indexing strategy could you use?

Actually, there are a number of ways we can deal with this.

Duck the challenge

I’ve added the completed collection here in case you don’t want to try transforming the date yourself. This will allow you to play along with the next section. You will just need to unzip the data, load the collections and proceed to the ‘downstream reporting’ section.

  • Click here for the zipped RainfallData

Export data to CSV

You can read this into Excel and export it in CSV format, for example. You could then use mongoimport or Studio 3T to import the data. Not only would you then have a lot of aggregation to do in order to optimise the data for querying, but you’d get bored doing this by hand to keep the data up-to-date. You really need a way of importing this data that is more easily automated.

Convert data to JSON

If you convert the text to JSON, using regex, there are a wider range of options. You can get it into MongoDB as a collection of documents, each representing a month with each day’s reading as an array. Here is an example of converting that month’s data into a simple list within the document.

Simplified list from the monthly data

I don’t want to do this for a few reasons. Firstly, I might want to add other datasets and compare across them. It could be that I need to do an analysis by week, quarter or day of the week, none of which really fit well into the implied data model.

Also, the clipping of the invalid data marked by -99.99 as the rainfall would be awkward to deal with. I prefer the simplest way of storing the data because it isn’t a large dataset, and I like simplicity at the document level. A dataset must be versatile, rather than optimized for a particular query. The data gods frown when you do that, and somehow induce your manager to ask for a report that you never predicted and can’t be done with your finely-crafted complex document design. Nope. Give me simplicity.

Simplicity is key when preparing data

Using regex

I want to do this as automatically as possible, so I can repeat this whenever necessary. The manual process is:

  1. Read the data into a text editor
  2. Cut out the three explanatory lines at the top (can be done with regex)
  3. Run a regex replace on the file (I’ve used .NET dialect) to turn each line into 31 documents
  4. Add the ’[‘ open-square-brackets to the beginning and ‘]’ close-square-brackets to the end
  5. Import into MongoDB into a staging collection
  6. Delete every invalid record with a rainfall of -99.99mm
  7. Create the full MongoDB collection in an aggregation process that converts the string representation of dates into ISO binary dates and adds the location information

This ETL process is usually scripted so I can easily refresh the data from source, or I can add other sources of rainfall data later.

Here, I’ve included the regex that I used to convert the text to JSON. (I generally use Editpad Pro and RegexBuddy to develop regexes). The regex looks intimidating, but I like my regexes to be simple and ugly rather than clever. These require only mastery with copy-n-paste rather than an extra lobe to the forebrain. We give thanks for the consistent table-width of the original data source.

Find

^([\d]{4})\s{1,6}([\d]{1,2})\s{1,6}([\d.-]{1,6})\s{1,6}([\d.-]{1,6})\s{1,6}([\d.-]{1,6})\s{1,6}([\d.-]{1,6})\s{1,6}([\d.-]{1,6})\s{1,6}([\d.-]{1,6})\s{1,6}([\d.-]{1,6})\s{1,6}([\d.-]{1,6})\s{1,6}([\d.-]{1,6})\s{1,6}([\d.-]{1,6})\s{1,6}([\d.-]{1,6})\s{1,6}([\d.-]{1,6})\s{1,6}([\d.-]{1,6})\s{1,6}([\d.-]{1,6})\s{1,6}([\d.-]{1,6})\s{1,6}([\d.-]{1,6})\s{1,6}([\d.-]{1,6})\s{1,6}([\d.-]{1,6})\s{1,6}([\d.-]{1,6})\s{1,6}([\d.-]{1,6})\s{1,6}([\d.-]{1,6})\s{1,6}([\d.-]{1,6})\s{1,6}([\d.-]{1,6})\s{1,6}([\d.-]{1,6})\s{1,6}([\d.-]{1,6})\s{1,6}([\d.-]{1,6})\s{1,6}([\d.-]{1,6})\s{1,6}([\d.-]{1,6})\s{1,6}([\d.-]{1,6})\s{1,6}([\d.-]{1,6})\s{1,6}([\d.-]{1,6})

Replace with

{“Date”:”${1}-${2}-1″,
 “Rainfall”: ${3}},
 {“Date”:”${1}-${2}-2″,
 “Rainfall”: ${4}},
 {“Date”:”${1}-${2}-3″,
 “Rainfall”: ${5}},
 {“Date”:”${1}-${2}-4″,
 “Rainfall”: ${6}},
 {“Date”:”${1}-${2}-5″,
 “Rainfall”: ${7}},
 {“Date”:”${1}-${2}-6″,
 “Rainfall”: ${8}},
 {“Date”:”${1}-${2}-7″,
 “Rainfall”: ${9}},
 {“Date”:”${1}-${2}-8″,
 “Rainfall”: ${10}},
 {“Date”:”${1}-${2}-9″,
 “Rainfall”: ${11}},
 {“Date”:”${1}-${2}-10″,
 “Rainfall”: ${12}},
 {“Date”:”${1}-${2}-11″,
 “Rainfall”: ${13}},
 {“Date”:”${1}-${2}-12″,
 “Rainfall”: ${14}},
 {“Date”:”${1}-${2}-13″,
 “Rainfall”: ${15}},
 {“Date”:”${1}-${2}-14″,
 “Rainfall”: ${16}},
 {“Date”:”${1}-${2}-15″,
 “Rainfall”: ${17}},
 {“Date”:”${1}-${2}-16″,
 “Rainfall”: ${18}},
 {“Date”:”${1}-${2}-17″,
 “Rainfall”: ${19}},
 {“Date”:”${1}-${2}-18″,
 “Rainfall”: ${20}},
 {“Date”:”${1}-${2}-19″,
 “Rainfall”: ${21}},
 {“Date”:”${1}-${2}-20″,
 “Rainfall”: ${22}},
 {“Date”:”${1}-${2}-21″,
 “Rainfall”: ${23}},
 {“Date”:”${1}-${2}-22″,
 “Rainfall”: ${24}},
 {“Date”:”${1}-${2}-23″,
 “Rainfall”: ${25}},
 {“Date”:”${1}-${2}-24″,
 “Rainfall”: ${26}},
 {“Date”:”${1}-${2}-25″,
 “Rainfall”: ${27}},
 {“Date”:”${1}-${2}-26″,
 “Rainfall”: ${28}},
 {“Date”:”${1}-${2}-27″,
 “Rainfall”: ${29}},
 {“Date”:”${1}-${2}-28″,
 “Rainfall”: ${30}},
 {“Date”:”${1}-${2}-29″,
 “Rainfall”: ${31}},
 {“Date”:”${1}-${2}-30″,
 “Rainfall”: ${32}},
 {“Date”:”${1}-${2}-31″,
 “Rainfall”: ${33}},

Loading the data

Within Studio 3T, I created a new database called ‘RainfallData’. Studio 3T makes it easy to import various file formats to MongoDB, including JSON.

The JSON was imported successfully into a ‘Staging’ collection within ‘RainfallData’. (I’d checked out a handful of documents with The JSON Validator to make doubly sure of the regex).

I then deleted all the invalid records.

use RainfallData;
 db.Staging.deleteMany(
    { //delete all invalid records with rainfall of -99.99
       "Rainfall" : -99.99
    }
 );

Y

ou need to do it this way because some of the dates from the regex will be invalid. If you try to import them using the ISODate() function on the date string, the import will fail. (Other DBMSs will shrug and convert, for example, ‘31 Feb 2018’ to ‘3 Mar 2018’. It isn’t pretty, but it keeps the show on the road.)

This sort of conversion is best done as part of data staging. This following routine converts the date to ISO binary format and adds a location code to each document. You’ll need MongoDB 3.6 for this. It used to be very unpleasant to do this before the $dateFromString function.

The staging process creates the DailyRainfall MongoDB collection if it doesn’t exist and inserts the processed data from ‘Staging’

use RainfallData;
 db.DailyRainfall.insert(db.Staging.aggregate(
 // insert into the (empty) DailyRainfall collection from Staging
    [
       {
          "$project" : {
             "Date" : {
                "$dateFromString" : {
                   "dateString" : "$Date"
                } //convert the date string to a BSON date
             },
             "_id" : 1.0, //show it
             "Rainfall" : 1.0, //show it
             "Location" : "SEE" //South East England
          } //add field so we can compare with other data Sets
       }
    ],
    {
       "allowDiskUse" : false
    }
 ).toArray());

Why the struggle to get a binary date? This is because you can’t compare dates if they are strings because 2018-4-4 will be sorted as if it were later in date than 2018-4-30; and relying on leading zeros is a cludge. Even worse, the aggregation process for getting such things as rainfall by month, or year and month becomes distinctly horrible. Whenever an ambiguous datatype is imported from another source, it pays to make the datatype one of the explicit BSON types as detailed here and here.

Now, we run some checks by deriving some simple reports. So when were the ten worst days of rain?

use RainfallData;
 db.DailyRainfall.aggregate(
 // produce the ten worst days of rain
    [
       { //sort the entire collection
          "$sort" : {
             "Rainfall" : NumberInt(-1) //high to low
          }
       },
       { //turn the date into a human-oriented format
          "$project" : {
             "TheDate" : {
                "$dateToString" : {
                   "format" : "%d/%m/%Y",
                   "date" : "$Date"
                }
             },
             "Rainfall" : 1.0, //show it
             "_id" : 0.0 //show it
          }
       },
       {
          "$limit" : 10.0 //just the top ten
       }
    ],
    {
       "allowDiskUse" : false
    }
 );

Hmm. Looking good. I didn’t like the way the ISO date was rendered so I changed it to a string (also very easy to do with Studio 3T, just right-click on a field and edit value type).

Changing the value or field type in Studio 3T (from ISO to string) is very straightforward

Beware of putting a $limit() stage directly after the $sort() stage! You don’t get the aggregation of the entire collection so the answer is wrong. Be careful when placing the $sort stage in your aggregation pipeline

I used the aggregation() method – which Studio 3T simplifies through its Aggregation Editor feature – rather than find() because there seems to be no easy way of converting ISO dates into a human-readable equivalent there. ($dateToString() is an ‘Unsupported projection option’ for Find())

So which were the top ten wettest months? (I’ve added a calculated field to show how many inches of rain that was.)

use RainfallData;
 db.DailyRainfall.aggregate(
 // The top ten wettest months
    [
       {
          "$group" : {
             "_id" : {
                "date_year" : {
                   "$year" : "$Date"
                },
                   "date_month" : {
                      "$month" : "$Date"
                }
             },
             "count" : {
                "$sum" : 1.0
             },
             "sum" : {
                "$sum" : "$Rainfall"
             }
          }
       },
       {
          "$sort" : {
             "sum" : -1.0
          }
       },
       {
          "$project" : {
             "Year" : "$_id.date_year",
             "Month" : "$_id.date_month",
             "inches" : {//convert millimetres to inches
                "$divide" : [
                   "$sum",
                   25.4
                ]
             },
             "sum" : 1.0, //show it
             "_id" : 0.0 //hide it
          }
       },
       {
          "$limit" : 10.0
       }
    ],
    {
       "allowDiskUse" : false
    }
 );
The aggregate of the top ten wettest months in Southeast England

There are a number of obvious aggregate queries that are very easy to do to reduce the data to what we want.

We can get the rainfall by year, for example. If we want to do it by month or day of the week (it rains more at the weekend in England), the queries are very similar.

The aggregate of the wettest years in Southeast England

Downstream reporting

At some point, we may want to provide output to an application that can do reporting more slickly. We could export the entire dataset but most reporting applications will want a smaller aggregate summary – a reduction of the data that makes it easy for the user. We’ll therefore prepare an aggregation that we can export to Excel.

We want, for this, a breakdown by all the common ways of displaying the data.

Here's the aggregation used to reduce the data for reporting

We do this via an aggregation, of course.

use RainfallData;
 db.RainfallExport.insert(db.DailyRainfall.aggregate(
 //produce an aggregate designed for downstream reporting
    [
       {
          "$group" : {
             "_id" : {
                "date_year" : { //extract the year
                   "$year" : "$Date"
                },
                "date_month" : {//extract the month
                   "$month" : "$Date"
                },
                "date_dayofweek" : {//and the day of week
                   "$dayOfWeek" : "$Date"
                }
             },
             "count" : {
                "$sum" : 1.0
             },
             "sum" : {
                "$sum" : "$Rainfall"
          //do the sum for the Year, Month, ‘Day of week’ grouping
             }
          }
       },
       { //what to show and what not to show
             "$project" : {
                "Year" : "$_id.date_year",
                "Month" : "$_id.date_month",
                "Weekday" : "$_id.date_dayofweek",
                "count" : 1.0,
                "sum" : 1.0,
                "_id" : 0.0
          }
       },
       {
             "$sort" : {
                "Year" : 1.0,
                "Month" : 1.0,
                "Weekday" : 1.0
          }
       }
    ],
    {
       "allowDiskUse" : false
    }
 ).toArray());

This renders down 32736 documents to 7329. We can then export the results to Excel. First we use the Export Wizard in Studio 3T:

Now that the aggregation has reduced the data, we can prepare it for export

Then save as a tab-delimited CSV file, using the wizard.

Open a new Excel Spreadsheet and you can then import the text file.

The exported collection can then be easily saved as CSV, then opened via Excel

You can then create a pivot table, produce sparklines of the rainfall variations, and conditional formatting to show dry months (red) and wet months (blue)

A clean, visual report as a result of data preparation and reduction

You can then use the date to plot the yearly rainfall over the past eighty seven years, and include a moving average.

We can also easily generate trendlines and moving averages from the cleaned-up data set

Summary

Don’t be in too much of a hurry to get to the point of writing code. You’re still working when you are planning out the cleanest and most versatile way of creating a database that will then be simple to optimise and produce the summary information that you want.

Here we chose data that is extracted from an external data source in the form of a human-oriented report, and we transformed it to a form that was easy for a database or application to consume. We loaded it into MongoDB into a staging collection and did the final formatting, checking, and tidy-up there before creating the MongoDB collection which could be used for querying. We then showed how easily an aggregation can be created and exported into a reporting system.


How helpful was this article?
This article was hideous
This article was bad
This article was ok
This article was good
This article was great
Thank you for your feedback!

About The Author

Phil Factor

Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications. Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

Article navigation

Related articles

  • How do I use MongoDB regex (regular expressions) in Studio 3T? #Studio3T_AMA
  • Lesson 6, Exercise 2: Using the SQL Query tool to aggregate collection data
  • Detecting COVID-19 Trends Using MongoDB Atlas Data
  • Exercise 3: Visualizing collection data in MongoDB Charts
  • Test your skills: Using SQL in MongoDB Aggregation

Studio 3T

MongoDB Enterprise Certified Technology PartnerSince 2014, 3T has been helping thousands of MongoDB developers and administrators with their everyday jobs by providing the finest MongoDB tools on the market. We guarantee the best compatibility with current and legacy releases of MongoDB, continue to deliver new features with every new software release, and provide high quality support.

Find us on FacebookFind us on TwitterFind us on YouTubeFind us on LinkedIn

Education

  • Free MongoDB Tutorials
  • Connect to MongoDB
  • Connect to MongoDB Atlas
  • Import Data to MongoDB
  • Export MongoDB Data
  • Build Aggregation Queries
  • Query MongoDB with SQL
  • Migrate from SQL to MongoDB

Resources

  • Feedback and Support
  • Sales Support
  • Knowledge Base
  • FAQ
  • Reports
  • White Papers
  • Testimonials
  • Discounts

Company

  • About Us
  • Blog
  • Careers
  • Legal
  • Press
  • Privacy Policy
  • EULA

© 2023 3T Software Labs Ltd. All rights reserved.

  • Privacy Policy
  • Cookie settings
  • Impressum

We value your privacy

With your consent, we and third-party providers use cookies and similar technologies on our website to analyse your use of our site for market research or advertising purposes ("analytics and marketing") and to provide you with additional functions (“functional”). This may result in the creation of pseudonymous usage profiles and the transfer of personal data to third countries, including the USA, which may have no adequate level of protection for the processing of personal data.

By clicking “Accept all”, you consent to the storage of cookies and the processing of personal data for these purposes, including any transfers to third countries. By clicking on “Decline all”, you do not give your consent and we will only store cookies that are necessary for our website. You can customize the cookies we store on your device or change your selection at any time - thus also revoking your consent with effect for the future - under “Manage Cookies”, or “Cookie Settings” at the bottom of the page. You can find further information in our Privacy Policy.
Accept all
Decline all
Manage cookies
✕

Privacy Preference Center

With your consent, we and third-party providers use cookies and similar technologies on our website to analyse your use of our site for market research or advertising purposes ("analytics and marketing") and to provide you with additional functions (“functional”). This may result in the creation of pseudonymous usage profiles and the transfer of personal data to third countries, including the USA, which may have no adequate level of protection for the processing of personal data. Please choose for which purposes you wish to give us your consent and store your preferences by clicking on “Accept selected”. You can find further information in our Privacy Policy.

Accept all cookies

Manage consent preferences

Essential cookies are strictly necessary to provide an online service such as our website or a service on our website which you have requested. The website or service will not work without them.

Performance cookies allow us to collect information such as number of visits and sources of traffic. This information is used in aggregate form to help us understand how our websites are being used, allowing us to improve both our website’s performance and your experience.

Google Analytics

Google Ads

Bing Ads

Facebook

LinkedIn

Quora

Hotjar

Functional cookies collect information about your preferences and choices and make using the website a lot easier and more relevant. Without these cookies, some of the site functionality may not work as intended.

HubSpot

Social media cookies are cookies used to share user behaviour information with a third-party social media platform. They may consequently effect how social media sites present you with information in the future.

Accept selected