Simplifying enterprise data lake complexities with MongoDB Data Federation

One of the biggest challenges modern enterprises face is managing multiple databases and datasets. While unifying data into a single system seems like a great idea at first, the reality is that different workloads need different database solutions. As a result, enterprise data ends up being spread across multiple databases, creating the challenge of querying disparate data sources. Some organizations address this by moving data between databases, but that introduces extra complexities and costs.

Cross-source queries without ETL

What if you could query your operational database and at the same time access external data from residing in an external data lake, without the need for extra data movement? Enter MongoDB Atlas’s Data Federation.

Data Federation allows you to query data from your Atlas cluster and connected data lakes at the same time, without importing a copy of the data from the data lake into MongoDB.

Before Data Federation, querying data across MongoDB and an external database would require an Extract, Transform, Load (ETL) process to bring the external data into MongoDB first, and then process the query all within MongoDB. With data federation, you skip this step and directly query external data stored in JSON, BSON, CSV, TSV, Avro, ORC, or Parquet formats.

Data federation also allows MongoDB to save processed query results back into the data lake. This is typically done as the final stage of an MQL query and allows organizations to process data from multiple sources and store the output in the data lake for future use.

Positive business impact

Querying data from your Atlas cluster and connected data lakes at the same time removes the need for costly and time-consuming data transfers. It lets you make faster, real-time decisions based on up-to-date data. Even better, you can use the same MongoDB Query Language (MQL) you already use in your clusters.

By removing the ETL overhead, you can reduce operational complexity, lower costs of producing and maintaining ETL, and improve efficiency. This means teams can focus on data analysis instead of data motion and related operational management. To the end user who is querying the data, all the data appears as collections, whether residing in MongoDB Atlas, or in an external data lake. The data from the external sources appears as documents in those collections, and therefore is queryable using one uniform query language.

The ability to process data from multiple sources and store the output in the data lake for future use has a positive impact because it enables building a single source of truth without having to duplicate data, thereby increasing consistency while lowering storage costs.

Data federation at scale

Data lakes often host extremely large volumes of data. So, you may wonder if querying with data federation will load it all into memory or copy it temporarily. Thankfully, the answer is no.

Data federation uses compute nodes that split the runtime load of reading subsets of the data lake partitions, not unlike MongoDB’s sharding mechanisms. Except that in this case the storage is the responsibility of the data lake itself, and the processing nodes exercise processing without owning the data lake outright – as viewers of sorts.

The way Data Federation scales is different and separate from how MongoDB Atlas clusters scale. Atlas scales cluster nodes up by allowing you to choose larger machines with more memory, cpu, and I/O capacities. It scales out by provisioning more shards. But each shard itself is attached to its own storage.

With MongoDB Data Federation, the storage is not owned by Atlas directly or permanently. Rather, the control plane (Atlas’s control plane) routes queries to the federation link configured. When the federated data resides in a data lake such as S3 bucket, Azure blob, or GCP cloud storage, it does not read all data in. Instead, an elastic pool of workers (the “compute plane”) is activated and queries are pushed to those workers to actually process the query.

The resultant arrangement is that those workers perform partition-specific queries against the actual storage file within the cloud region the storage itself resides. The partitioning scheme is also exploited to take advantage of data locality.

The query’s field specification is matched to the partition(s) in which the data may reside, reducing the number and volume of data the workers need to fan out and read. Any and all management of the compute-plane is done behind the scenes.

This elastic arrangement allocates and manages worker nodes automatically behind the scenes. Administrators may configure query limits in order to control costs, but no provisioning or runtime infrastructure involvement is necessary. This approach reduces IT overhead, allowing teams to focus on business outcomes rather than managing query infrastructure.

Importantly, Data Federation does not persist the data that is processed during the query execution within the compute plane. If a user decides to save query results into a collection, the data gets stored in the target collection. But any intermediary data processed is discarded. The compute-plane may only retain some metadata regarding the execution – not your document data.

Strategic value for the business

For businesses, the fact storage is the responsibility of the data lake itself, and the processing nodes exercise processing without owning the data lake outright, means optimized performance at scale, without the need for additional infrastructure investments.

The elastic arrangement mentioned above can reduce management costs, while enjoying automatic elastic scalability. Further, getting quicker insights with lesser latency can also lead to better decision making.

With no compute-plane or elasticity to manage, and with the familiar MQL query language, data federation lets users seamlessly query vast amounts of data across the enterprise.

Tooling makes data federation more powerful

While data federation significantly simplifies the execution of the query, composing and managing complex queries can still be a challenge, especially for those who prefer a visual query building approach.

Data visualization tools offer a range of benefits for MongoDB users. Notably, a visual query builder tool lets users create complex queries through a drag-and-drop interface instead of manually writing MQL. This benefits newcomers unfamiliar with MQL syntax, but also helps experts as they explore and compose elaborate query pipelines.

Data federation gives users access to virtually unlimited data across MongoDB Atlas and external sources. Query optimization tooling can therefore help users refine and execute queries more efficiently, while also reducing errors and accelerating insights. Another area where tools help users is by using schema visualization to understand federated data structures at a glance.

The role of tooling in streamlining, simplifying and, and reducing the time analysts spend on producing insights benefit organizations which reflects in costs, time saving, and accuracy.

Conclusion

Data federation exposes all data via a unified interface, making querying consistent across the data sources. Organizations reap multiple immediate benefits from data federation, reflecting in simplicity, lower time-to-insight, uniformity of analysis, reduced infrastructure management, integration with first-class tooling, and overall productivity.