One of the great things that we love about MongoDB is of course that it’s schema-less, which makes adapting your application to changing requirements a breeze.
That said, your data will often have a fixed implicit schema, e.g. each document in your employees collection will likely always have a first and last name field. So, making sure from time to time that all documents indeed contain certain fields is probably a good idea. Likewise, having the power to dynamically add new or discontinue old fields in your documents can often lead to a proliferation of various “versions” of your document schema. So, getting a feel for how often a certain field occurs in your collection can be quite helpful.
Luckily, Studio 3T Pro makes discovering and exploring the schema in your MongoDB collections super-easy! With Studio 3T (formerly MongoChef), you can quickly:
- check the health of your schema
- find schema anomalies
- inspect data outliers
- visualize data distributions
So, let’s dive right in!
Select the collection whose schema you want to explore (in this example “customers”) and click the “Schema” icon in the global toolbar. This will open a tab where you can now configure your schema discovery:
- Here you can control how Studio 3T should sample documents from the collection for the schema discovery. By default, Studio 3T will analyze randomly selected documents. You can choose between “Random”, “First”, “Last”, or “All” – in which case Studio 3T will read in all documents in the collection.
- Studio 3T gives you full control over how many documents should be read for the schema discovery. For our example, we will look at 1,000 documents.
- By default, Studio 3T will not analyze the elements of any array fields when it encounters them. The reason is that arrays can often contain thousands of elements of the same type which can lead to bloated schema results. You can of course override this default behavior.
- You can also provide a query to further control the document set that should be used for the schema discovery. In our example, we will use an empty query, which will return all documents in the collections
- Click “Run analysis” to start.
After the analysis has completed, you will see the schema discovery result page:
In the left-hand pane (1), you will see the discovered schema tree. For each field you see its name, its global probability – i.e. which percentage of documents were found to contain that field, and its discovered field type(s). You can now easily explore your schema as you would with a JSON document.
In the right-hand pane (2), Studio 3T displays information about the type and data distribution of the currently selected schema field.
Verifying Your Schema
The schema tree is a great tool to understand and verify your schema. For our example, we assume a – fictitious – “customers” collection which contains the personal information of the customers of our – equally fictitious – shop. With the schema tree, we can now easily verify that required fields like “name” and “transactions” do indeed occur in 100% of our documents. We can also observe that the optional field “title” is apparently provided by 56.7% of our customers. The schema tree is also really helpful in discovering schema anomalies:
Discovering Missing Fields
Looking at our schema discovery results, we see that field “first” – which stores our customers’ first names – is missing in 0.4% of our documents. This may for example suggest that our web shop software might be flawed. To learn more about those documents that are missing a “first” field, Studio 3T makes it super-easy to explore that actual data. If we right-click the “first” field, we can choose “Explore documents not containing selected field”. This will open a new query tab that shows all documents that do not contain the field “first”. It is important to note that this query will of course honor the base query criteria of your analysis but will bring up all matching documents in the collection, not just in the (limited) sample set.
We could now inspect those documents which might reveal clues as to what has caused the missing field anomaly.
Discovering Unexpected Fields
Looking at our schema discovery results again, we spot another anomaly – an unexpected field:
We see that in 95% of our documents, the field “user_name” is spelled with an underscore. Yet, in 5% of documents, it is spelled with a hyphen. This could for example be down to a typo somewhere in the source code. After fixing the typo in the code, Studio 3T then makes it very easy to also fix it in your collection. Right-click the incorrectly-spelled “user-name” field and select “Explore documents containing selected field”.
This will open a new query tab showing all documents containing the selected field (“user-name”). Here right-click the field to rename it:
and choose “All documents in collection” in the ensuing dialog to rename all occurrences in the collection:
Discovering Incorrect Field Types
Another type of schema outlier that one commonly wants to look out for is incorrect field types. Studio 3T makes that really easy to spot too. Consider in our example the “address” field in our customer documents. We store the addresses as an embedded object of the following type:
However, when we look at the “address.street” field in the discovered schema tree, we can see that there appear to be some outliers where “address.street” is of type String:
When we select the String instance of “address.street” in the schema tree, we can quickly see in the right-hand data pane that in 4 documents something must have gone awry and all street information was stored in a simple string.
We can right-click the String instance node in the schema tree to “Explore documents with selected field of type String” to have a closer look at those documents:
Exploring Data Distributions
As we have seen, when you click a field or one of its type instances in the schema tree, you can see charts showing – depending on the selected data type – various statistics on the type or data distribution of the field. Now, while Studio 3T is not a full-blown BI tool by any means, these data distribution charts can often already give you some very useful insights into your documents.
For numeric fields, one of the stats charts that the right-hand pane shows is the value histogram. If we look for example at the value histogram of our “transactions” field, we can quickly observe that most customers have engaged in around 50 transactions, with tails of less common numbers of transactions.
For many data types, the right-hand panel also shows the top values that were found across the analyzed documents. This can often be helpful to spot data outliers. Consider the “package” field in our example. Suppose our customers can subscribe to a “Free”, “Basic”, “Standard”, “XL”, or “XXL” package. However, we see that there are some customers who seem to have a “Beginner” package – which may indicate a backend glitch, for example.
Luckily, we can now use Studio 3T to edit those values directly in-place:
For date fields, Studio 3T shows you in detail the value distributions. Consider, in our example, the “registered_on” field. If we look at its “Monthly value distribution”, we notice that customer registration seems to be particularly strong in the summer as well as in January. This might then provide valuable feedback to the marketing and sales teams.
Studio 3T provides a very powerful MongoDB schema explorer feature that lets you easily discover the schema that is present across the documents in your collection and thereby helps you find schema and data outliers. Drilling down into individual fields, you can see, for each field, various visual statistics relevant to the data type of that field.
MongoDB schema discovery and exploration has never been easier with Studio 3T and it’s just one of the many awesome features of our MongoDB client. Why not find out more about Aggregation Pipeline, SQL Querying, IntelliShell ( we could go on and on 🙂 )