Skip to content
Studio 3T - The professional GUI, IDE and client for MongoDB
  • Tools
    • Aggregation Editor
    • IntelliShell
    • Visual Query Builder
    • Export Wizard
    • Import Wizard
    • Query Code
    • SQL Query
    • Connect
    • Schema Explorer
    • Compare
    • SQL ⇔ MongoDB Migration
    • Data Masking
    • Task Scheduler
    • Reschema
    • More Tools and Features
  • Solutions
  • Resources
    • Knowledge Base
    • MongoDB Tutorials & Courses
    • Tool/Feature Documentation
    • Blog
    • Community
    • Testimonials
    • Case Studies
    • Whitepapers
    • Reports
  • Contact us
    • Contact
    • Sales Support
    • Feedback and Support
    • Career
    • About Us
  • Store
    • Buy Now
    • Preferred Resellers
    • Team Pricing
  • My License
  • Download
search

More Dataset And Schema Clean Ups With Studio 3T – #Studio3T_AMA

Posted on: 19/08/2021 (last updated: 07/09/2021) by DJ Walker-Morgan

Q: I’ve started cleaning up my dataset, but how do I make it more consistent?

Welcome to the latest Studio 3T Ask Manatees Anything – #Studio3T_AMA. Want to know how to use Studio 3T to make your MongoDB life better? Just tweet and use #Studio3T_AMA or email [email protected].

A: In the previous Ask Manatees Anything, we made a start cleaning up using the Studio 3T Schema Explorer on a simple dataset.

Tidying up schemas (and data) requires its own set of skills. Let’s dig deeper and look at ways to make schemas consistent. That means with the same fields throughout. And how to handle overlapping fields and functionality.

Making A Consistent Schema

At the end of the last article, our dataset is looking a lot more like the coherent document we are hoping to get.

Screenshot of schema analysis at end of part1/start part 2
We start here with many fields at 100%

Our schema looks like this:

Customer {
    _id: ObjectId,
    address: Object,
    device: String,
    dob: Date,
    interests: Array,
    package: String,
    prio_support: Bool,
    registered_on: Date,
    transactions: Int32
}

Let’s look at the fields which aren’t 100% again – email, first, last, title. This seems to be down to some variations in how names and email have been collected. Omitting fields is a legitimate way of handling a lack of a first name. But if you want a consistent schema (and storage isn’t severely limited), it is better to represent it as an empty string.

Filling In Missing Fields

If we go and select first in the schema view, then bring up the menu to Explore documents not containing selected field. We can see the documents lacking a first field. To remedy that, bring up the right button menu over the id of the first document:

Close up on the Add Field/Value menu
The Add Field/Value menu item being applied to a document.

The Add Field/Value dialog should appear. We want it to add the field “first” of type string and no value. Most importantly, we want to set it to Add this field to Documents matching query criteria.

Add Field/Value Dialog
The Add Field/Value Dialog allows you to add a field to multiple documents.

Clicking Add Field/Value will add that empty field. We’ll repeat the process with email, last and title, close the query window and rerun the analysis:

Schema analysis view after rerunning schema analysis
Now the Schema Analysis shows many more fields at 100%.

This is looking a lot better. Only user-name, user_name or username seem to be off kilter. Your first impulse might be to make them all the same name, say user_name.

Watch Out For Overlaps

But look closely and add the percentages: 6.9%+94.9% is 101.8%. That can only mean that there’s some documents with both. If we’d just gone and renamed user-name to user_name, we’d have lost the values in the user_name field.

The username field is less of a problem. If we use Explore documents containing selected field on the field in the Schema view, we can see it’s only in one document in the entire collection. Select that one document and, through the right-click menu, choose Field -> Rename Field. Rename the field to user_name.

Now, back to user-name and user_name. At this stage, in real life, you need to find out what is being stored in the two different fields. If there’s only a user-name field, you can make a reasonable assumption. This really should be the user_name and it’s just mis-named. To identify those fields, we will again use Explore documents containing selected field on the user-name field. Then we will bring up the Visual Query Builder and drag user_name from the view into the query. Now change the condition to doesn’t exist.

Using the Visual Query Builder to look for overlap
This setting will show us where only one of the two user names is set.

Click Run and our view updates to the 50 fields which don’t have both fields. We can now select the user-name field and rename it to user_name for everything that matches our query. We then select user-name in the tree, right-click for the menu and choose Field->Rename Field. In the dialog, we put user_name in as the new name and set it to rename for all Documents matching query criteria.

The Rename Field dialog
We can also rename multiple fields according to a query criteria.

Clicking Rename will bring up a warning about the possibility of overwriting existing data, but we’ve already handled that possibility by ensuring that user_name doesn’t exist in all these documents. When that’s done, we can return to the Schema view and re-run the analysis. We can now see user_name is 100%, but there’s 1.9% that still have user-name.

For that 1.9%, you may be facing a situation where there are in fact one or more user names for users which you might want to store in an array. Or as an alternate field (that isn’t confusingly named). If you switched to an array and have applications accessing the data, they’d need to change how they treat the user name values. There’s also the other alternative that the data in the user-name field in these cases is bogus and will need to be pruned.

Fixing The Overlap

Some decisions can’t be taken automatically, but Studio 3T can help you when you’ve decided. In this case, we’re going to make a user_name_alias field out of user-name, and have that new field on all documents.

The first step is to create the field itself, so we go to our collection’s tree view and select Field->Add Field and create a user_name_alias field for every document in the collection.

Now we can rename the user-name field into user_name_alias. Start by selecting user-name from the Schema view and double clicking into the collection’s tree view. There we can select the user-name field in any one of the documents and use Field->Field Rename to rename it to user_name_alias for all Documents matching query criteria. When it warns about overwriting data in the user_name_alias field, don’t worry as that’s exactly what we want to do. The value for the field is currently empty.

Now we return to the Schema view and rerun the analysis. Every field shows as 100% consistently typed and present. A shinier, cleaner dataset thanks to Studio 3T. Our final schema is:

Customer {
    _id: ObjectId,
    address: Object,
    device: String,
    dob: Date,
    email: String,
    first: String,
    interests: Array,
    last: String,
    package: String,
    prio_support: Bool,
    registered_on: Date,
    transactions: Int32,
    user_name: String,
    user_name_alias: String
}


How helpful was this article?
This article was hideous
This article was bad
This article was ok
This article was good
This article was great
Thank you for your feedback!

About The Author

DJ Walker-Morgan

Dj has been around since Commodore had Pets and Apples grew everywhere. With a background in Unix and development, he's been around the technology business writing code or writing content ever since.

Related articles

  • How do I clean up a dataset and schema with Studio 3T? – #Studio3T_AMA
  • What’s New in Studio 3T 2021.4 | Customizable Shortcuts, SQL Auto-completion and More
  • What’s New in Studio 3T 2021.6 | Integrating Hackolade, code folding, and more
  • What’s New in Studio 3T 2021.7 | Incremental migrations, customizable SQL types, and more
  • What’s New in Studio 3T 2021.8 | Custom Column Orders, new edition names, and more

Tags

academy aggregation AMA atlas community connections date tags events export features hackolade import intellishell JSON knowledge base migration modelling mongodb mongoodb productivity query regex releases schema security SQL Studio 3T tasks tips updates webinar windows

Browse by MongoDB topic

  • Connecting to MongoDB
  • Database Administration & Security
  • Getting Started with MongoDB
  • Getting Started with Studio 3T
  • Import/Export
  • Job Automation & Scheduling
  • MongoDB Aggregation Framework
  • MongoDB/Studio 3T Workshops
  • Performance
  • Query & CRUD Operations
  • Reference
  • Schema
  • Studio 3T Licensing
  • Support and other resources
  • Working with MongoDB & SQL
  • Working with MongoDB Atlas

Studio 3T

MongoDB Enterprise Certified Technology PartnerSince 2014, 3T has been helping thousands of MongoDB developers and administrators with their everyday jobs by providing the finest MongoDB tools on the market. We guarantee the best compatibility with current and legacy releases of MongoDB, continue to deliver new features with every new software release, and provide high quality support.

Find us on FacebookFind us on TwitterFind us on YouTubeFind us on LinkedIn

Education

  • Free MongoDB Tutorials
  • Connect to MongoDB
  • Connect to MongoDB Atlas
  • Import Data to MongoDB
  • Export MongoDB Data
  • Build Aggregation Queries
  • Query MongoDB with SQL
  • Migrate from SQL to MongoDB

Resources

  • Feedback and Support
  • Sales Support
  • Knowledge Base
  • FAQ
  • Reports
  • Case Studies
  • White Papers
  • Testimonials
  • Discounts

Company

  • About Us
  • Blog
  • Careers
  • Legal
  • Press
  • Privacy Policy
  • EULA

© 2022 3T Software Labs GmbH. All rights reserved.

  • Privacy Policy
  • Cookie settings
  • Impressum
When you click "Accept", you are agreeing to cookies being on your device. They may improve site navigation, site usage analysis, or the relevance of messages. It is up to you which cookies are enabled. Read our Privacy Policy.
Manage cookies
Accept
✕

Privacy Preference Center

A cookie is a small file of letters and numbers that is downloaded on to your computer when you visit a website. Cookies are used by many websites and can do a number of things, eg remembering your preferences, recording what you have put in your shopping basket, and counting the number of people looking at a website. In the form below you can choose which cookies, except for essential cookies, to allow or disable.

Accept all cookies

Manage consent preferences

Essential cookies are strictly necessary to provide an online service such as our website or a service on our website which you have requested. The website or service will not work without them.

Performance cookies allow us to collect information such as number of visits and sources of traffic. This information is used in aggregate form to help us understand how our websites are being used, allowing us to improve both our website’s performance and your experience.

Functional cookies collect information about your preferences and choices and make using the website a lot easier and more relevant. Without these cookies, some of the site functionality may not work as intended.

Social media cookies are cookies used to share user behaviour information with a third-party social media platform. They may consequently effect how social media sites present you with information in the future.

Accept selected