Working with Arrays in the Aggregation Pipeline

MongoDB 301: Aggregation Working with Arrays in the Aggregation Pipeline

When building an aggregate statement, you might need to work with one or more array fields and their individual elements. To help with this process, MongoDB offers a variety of expression operators that you can use as you refine your statement’s aggregation pipeline. Some of these operators are specific to arrays, and some can be used with multiple field types, although they can still be helpful when working with arrays.

In this section, you’ll learn how to use several of these operators to build an aggregate statement based on the customers collection in the sales database. The statement’s aggregation pipeline will include four stages. The first stage will filter the collection’s documents based on the values (elements) in the interests array field. The second stage will deconstruct the array’s values so you can work with them individually. The third stage will group the array values and count the number of documents in each group. And the fourth stage will save the pipeline’s output to a new collection in a separate database.

By the end of this section, you will learn how to

Use expression operators to filter input documents
Unwind an array to create individual documents
Group array values and generate a document count for each group
Write pipeline results to a new collection

What you will need

Access to a MongoDB Atlas cluster
Access to the customers collection in the sales database

Applying operators to an array

MongoDB provides several aggregate operators that can be useful when working with arrays in your aggregate statements. In some cases, the operators are specific to array fields. In other cases, they can be used with multiple field types, although they can still benefit arrays, whether directly or indirectly. Three of these aggregate operators are $unwind, $sortByCount, and $merge.

$unwind

The $unwind operator lets you deconstruct the values in an array field. The operator outputs a document for each array element, replacing the array field with the individual array value. By default, the operator does not output a document if the array field value is null, missing, or contains an empty array, but you can override this behavior.

The following syntax shows the arguments (fields) available to the $unwind operator:

$unwind:
{
  path: field,
  includeArrayIndex: string,
  preserveNullAndEmptyArrays: Boolean
}

The operator takes three arguments, which you pass into the operator as a document with three fields:

The path argument (mandatory) specifies the field path for the target array in the input documents.
The includeArrayIndex argument (optional) is the name of a new field for the element’s array index.
The preserveNullAndEmptyArrays argument (optional) determines whether a document is included in the output when the target array is null, missing, or contains an empty array.

At its simplest, the $unwind operator requires only the path field argument and its value, as shown in the following example:

db.customers.aggregate([
  { $unwind: { path: "$interests" } }
])

In this statement, the $unwind operator deconstructs the values in the interests array field. The field name must be enclosed in quotes and preceded with a dollar sign. A dollar sign must precede the field name whenever it can be interpreted as an expression.

When you run the $unwind operator, MongoDB outputs a document for each element in the array, repeating this process for each input document. For example, if an input document includes the elements Technology and Software in the interests array, the $unwind operator will generate two output documents to replace the source document. The new documents will each include a string field named interests. The field’s value in the first document will be Technology, and the field’s value in the second document will be Software.

$sortByCount

After you deconstruct the array, you can use the $sortByCount operator to group the documents by the array’s elements and to calculate the number of documents in each group. The operator also sorts the output data in descending order, based on the document counts. You can think of the $sortByCount operator as a shortcut to creating a $group stage and a $sort stage.The syntax for the $sortByCount operator is fairly straightforward:

$sortByCount: "expression"

The operator’s expression can be any expression except a document literal. In fact, the expression can be as simple as a single document field, as shown in the following example:

db.customers.aggregate([
  { $unwind: { path: "$interests" } },
  { $sortByCount: "$interests" }
])

In this case, the $sortByCount operator will group the documents by the interests field values, after they’re deconstructed by the $unwind operator. The output will include two fields, _id and count, as shown in the following figure. Notice that the data is sorted in descending order, based on the count values.

$merge

Once you’ve set up your pipeline to return the data you need, you can add a final stage based on the $merge operator, which writes the pipeline’s results to a specified collection, either in the current database or a different database. The $merge stage must be the last one in the pipeline.

The following syntax shows the components that make up the $merge operator:

$merge:
{
  into: collection|{ db: database, coll: collection },
  on: identifier field [, ...],
  whenMatched: replace|keepExisting|merge|fail|pipeline,
  let: variable: expression [, ...],
  whenNotMatched: insert|discard|fail
}

The $merge operator takes five arguments, which you pass into the operator as a document with five fields:

The into argument (mandatory) specifies the target collection for the pipeline’s output. If it’s in the same database as the source collection, you need only specify the collection name. If the collection is in a different database, you must provide both the database and collection names, using the field names db and coll. MongoDB will create the collection if it doesn’t exist.
The on argument (optional) defines one or more fields to serve as a unique identifier for determining when a pipeline document matches an existing document in the target collection.
The whenMatched argument (optional) determines what action to take if a pipeline document matches an existing document in the target collection.
The let argument (optional) defines one or more variables that can be used by the whenMatched argument.
The whenNotMatched argument (optional) determines what action to take if a pipeline document does not match an existing document in the target collection.

To continue with the previous example, you can use the $merge operator to save the output from the $sortByCount stage to a collection in a different database, as shown in the following example:

db.customers.aggregate([
  { $unwind: { path: "$interests" } },
  { $sortByCount: "$interests" },
  { $merge: {
      into: { db: "marketing", coll: "clients" },
      whenMatched: "keepExisting" }
  }
])

The $merge operator in this example saves the pipeline output to the clients collection in the marketing database. The operator also includes the whenMatched option, along with the keepExisting value. If a document already exists in the target collection, MongoDB will retain that document rather than replace it with one from the pipeline output.

In addition to aggregate operators, you can use other types of operators when working with documents in the pipeline. For example, you can use the $or and the $and Boolean expression operators to set up specific conditions based on array values, as shown in the following example:

db.customers.aggregate([
  {
    $match: 
    { $and: [
      { interests: { $in: [ "Technology" ] } },
      { $or: [
        { package: "Standard" },
        { prior_support: true } ]
      } ]
    }
  }
])

The statement uses the $match aggregate operator to filter the documents in the customers collection. For a document to be returned, the interests array must include the term Technology and it must meet one of two other conditions: the package field value must be Standard or the prior_support field value must be true.

To implement this logic, the operator’s expression starts with the $and operator, which defines two conditions. Both conditions must evaluate to true for a document to be included in the stage’s output. The first condition uses the $in operator to verify whether the interests array includes the term Technology. If the term exists, the condition will evaluate to true. The $in operator is used specifically to determine whether an array contains a certain value.

The second $and condition starts with the $or operator, which defines its own two conditions. The first of these checks whether the package field equals Standard, and the second one checks whether the prior_support field equals true. If either of these conditions evaluate to true, the second $and condition evaluates to true. If both the $and conditions evaluate to true, the document will be included in the stage’s output.