Skip to content
Studio 3T - The professional GUI, IDE and client for MongoDB
  • Tools
    • Aggregation Editor
    • IntelliShell
    • Visual Query Builder
    • Export Wizard
    • Import Wizard
    • Query Code
    • SQL Query
    • Connect
    • Schema Explorer
    • Compare
    • SQL ⇔ MongoDB Migration
    • Data Masking
    • Task Scheduler
    • Reschema
    • More Tools and Features
  • Solutions
  • Resources
    • Knowledge Base
    • MongoDB Tutorials & Courses
    • Tool/Feature Documentation
    • Blog
    • Community
    • Testimonials
    • Whitepapers
    • Reports
  • Contact us
    • Contact
    • Sales Support
    • Feedback and Support
    • Careers
    • About Us
  • Store
    • Buy Now
    • Preferred Resellers
    • Team Pricing
  • Download
  • My 3T
search

Studio 3T® Knowledge Base

  • Documentation
  • Tutorials
  • Workshops
Take the fastest route to learning MongoDB. Cover the basics in two hours with MongoDB 101, no registration required.
Start the free course

Getting Data Masking and Anonymization Right

Posted on: 12/11/2019 (last updated: 04/08/2021) by Phil Factor

This is the first of two articles I’ve written on data masking in MongoDB. Learn how to apply specific masking methods in my second article, Working with MongoDB Data? Use These Data Masking Techniques.

There is, unfortunately, no easy way to turn the data from a production database into a safe, anonymized version from which all sensitive data is safely extracted, and which is secure from re-identification. It is tricky to do properly.

Although data masking tools are necessary, they are not sufficient. In Europe, they can be used to implement a masking strategy only when the appropriate assessments have taken place and properly documented in a defensible GDPR compliance report.

In the States, the industry standard is the equivalent HIPAA Expert Determination. Why? Surely, common-sense tells you that if you mask out identifiers such as names, and phone numbers, then the data is no use to anyone.

In fact, the reverse is true: data can be re-identified alarmingly easily if just a few clues are left in the data. It takes the expert eye to spot these clues.

Masked person standing next to screen.
Photo by Max Bender on Unsplash.

Looking to mask collections in MongoDB? Try Data Masking for MongoDB for free. Apply a data masking technique depending on the field type and export masked documents to a new collection.

Why mask data?

Data masking, obfuscation, ‘pseudonymisation’, or ‘de-identification’ of data is required when certain data within a dataset must be kept private. This can be for many reasons, though privacy is a common concern.

There are two main purposes of the process of masking data, one of which is the requirement of some aspects of testing and development work on a database system, or for training purposes, and the other is the publication of open data.

Restricted data reports

It is important for scientific research, particularly medical, genetic and epidemiological research, that data can be shared.

Open government also requires that all manner of data be available for inspection. Before such data is shared, that part of the data that would uniquely identify an individual or group of individuals is masked out, obfuscated or removed, under the assumption that the data will then no longer identify personal information.

To further this aim, the preparation of healthcare open data sets in the USA is subject to The HIPAA Safe Harbor provision that restricts disclosure of, for example, an individual’s full date of birth (only year of birth may be reported), and restrictions on the smallest reportable geographic unit. It also outlines 18 different identifiers that must be removed from such shared data. The GDPR is less prescriptive, preferring an approach similar to HIPAA Expert Determination.

Test datasets

Data that is required for development and testing is a special case, because the entire database, rather than an extract or subset, needs to be present, albeit with masking or obfuscation in place.

This is generally used within the database industry to eliminate a bug that is present only in the production data and which cannot be replicated with generated data.

Sensitive data, and data that can be used to identify individuals, must be masked or obfuscated. It has an extra danger beyond ‘restricted data’.

What can go wrong?

Even data properly de-identified under the Privacy Rule may carry with it some private information, and, therefore, poses some risk of re-identification, a risk that grows as new datasets are released and as datasets are combined.

Privacy researchers have published research for several years that has proved, and demonstrated, that individuals can be identified from masked data. The data industry has found this fact difficult to absorb. Security experts have had to use more spectacular means to make their point.

  • In 2002, the HMO breach combined two databases, the voter registration list for Cambridge Massachusetts and the Massachusetts Group Insurance Commission (GIC) medical encounter data. A researcher was able to identify the medical records of the governor of Massachusetts at the time, William Weld.
  • In 2014, individual journeys and the identity of the people who made them was reverse-engineered from the anonymized public data set of hired ‘Boris Bikes’ in London.
  • In 2015, researchers found that they could, by examining three months of anonymized credit card records for 1.1 million people with no obvious identifiers except for four spatiotemporal points, thereby uniquely reidentify 90% of individuals and to uncover all of their records.
  • In 2017, journalists re-identified politicians in an anonymized browsing history dataset of 3 million German citizens, uncovering their medical information and their sexual preferences, including the porn preferences of a judge, and the medicines used by a German MP 23
  • In 2016, When the Australian Department of Health publicly released de-identified medical records for 10% of the population, privacy experts managed to re-identify them within six weeks
  • In 2016, data from the  New York City (NYC) taxi trips data set that had been publicly released was found to allow the income of individual identifiable taxi drivers to be calculated. Even without the decoding of the medallion numbers, 91% of the taxis that ply in NYC could be identified from the data.
  • Open data published in 2016 containing public transport ride registrations in Riga yielded personal data about the journeys people were taking.

These breaches were done by privacy researchers to prove the point. However, similar techniques are regularly used by journalists and private investigators.

For example, when a man was forcibly ejected from a United Airlines flight in 2018, he was filmed yelling that he was a doctor and that he was being profiled for being Chinese. This was enough: within hours, news reporters who used publicly-available information were at his house, requesting an interview.

You have, in the UK an 80% chance that you can be correctly identified solely from postcode, date of birth and gender. This goes up to 99% with seven attributes. There are a growing number of viable techniques for doing this.

The organisations who provide open data argue that where such data represents only a sample of the full data, or is incomplete, de-anonymization is unlikely to expose much personal information with sufficient certainty to be confident of a match.

However, de-anonymization approaches are now so sophisticated that they can identify individuals and data with a high level of probability.

Data masking, by itself, is no longer a guarantee that personal information cannot be identified. Techniques such as the generative copula-based method of re-identification will find that 99.98% of Americans would be correctly re-identified in any dataset that contains 15 demographic attributes.

The GDPR is wary of the effectiveness of pseudonymization and so requires that pseudonymized data should be treated the same way as the original data, in terms of security and encryption unless it has been subject to a risk-assessment.

Data and strategies for compliance

It is important to find a compromise between the concerns of privacy and of the benefits of exchanging and publishing data.

While there are very few, if any, cases of individuals who have been harmed by attacks with verified re-identifications, all of humanity has benefited from the use of de-identified health information.

De-identified health data is the workhorse that routinely supports numerous healthcare improvements and a wide variety of medical research activities: many of us owe our lives to the ongoing research and health system improvements that have been realized because of the analysis of de-identified data.

Likewise, the publication of open data by government and academic institutions has helped society greatly and enabled a range of expertise to be brought to bear on the way society is regulated.

To support this, many organisations, whether they be commercial, government, trusts, and service providers are required to be good custodians of the private data they hold on individuals. They are also often required to export the data outside the rigorous access-controlled security of the database.

To meet the conflicting requirements, there are several techniques used. The objective is to ensure that the data remains useful while the risk level is low.

Pseudonymised data has its direct identifiers removed but there is still a danger that it can be made to revert to an identifiable form and then acted upon in an inappropriate way.

Therefore, pseudonymised data is still classified as personal data, and cannot be considered anonymous so is subject to the GDPR and HIPAA controls and restrictions.

Some of the more commonly-used techniques are …

Restricting distribution and access

Although many public datasets can be used by anyone, many scientific datasets are highly restricted and use ‘document-retention rules to delete the data after use.

Other are subject to Data Use Agreements (DUAs) or subject to direct Health and Human Services (HHS) mandates for use conditions.

Encryption is the most generally-used technology to implement a file-level access control and is recommended by GDPR Article 4.

Aggregation

Data is generally aggregated at the level required by the recipients of the data.

If, for example, epidemiologists need the incidence of influenzas broken down by geographical area, it can be aggregated at that exact criterion, thereby avoiding individual data records.

No other data than the precise criteria should be included. If the researchers are clear about the hypothesis being tested, this should not worry them. If a ‘freedom of information’ request requires certain data, then just that data should be provided.

Sampling

In many cases, only a subset of the data is provided, thereby reducing the possibility of an exact identification.

The problem, of course, is the difficulty of creating a sample that is representative of the population as a whole.

Even heavily sampled, masked and anonymized datasets are unlikely to comply with the requirements of the GDPR that considers that each and every person in a dataset has to be protected for the dataset to be considered anonymous.

Masking

Masking is relevant only when a database needs to be delivered in a ‘de-identified’ form to enable the technology team within the organisation to perform certain testing, training and maintenance tasks.

Instead of leaving the sensitive information out entirely, the sensitive data is obfuscated or entirely generated. This is more difficult than it may seem, because of the use of keys in a relational database.

If, for example, a person’s name is used as part of a primary key, then the change has to be spread to all the referring foreign keys.

When a heavily-indexed column is obfuscated, any change in the statistical distribution of the key will change the way that the database performs.

Indexes often have distribution statistics attached to them and these statistics are used to determine the best execution strategy for the query using the index. If the statistics change, then the whole purpose of testing with the full dataset is lost.

Another problem with masking an identifiable characteristic is to ensure that it is done in all places where it exists. It is pointless to, for example, mast a person’s name in the main record, but forget that it is also in an XML field, or located de-normalised within associated records.

Sometimes, it is startling to examine a database in its raw file-format and see data which you thought was masked staring out from the screen from forgotten ‘note’ fields, or email addresses. With a database that isn’t normalised, it can be difficult to get this simple task right.

Need to obfuscate MongoDB collections? Data Masking for MongoDB, available in Studio 3T, provides field-level static data masking. Try it for free.

Shuffling and generation

If a column must be present for the database to be used, but its contents need not be accurate, then the data within that column can be shuffled to different rows so that the distribution remains the same but no longer can be used to identify individual entities in rows. Similarly, a column can be entirely generated according to specific rules.

Conclusions

Data that has had the obvious methods of identification removed can be ‘reidentified’ by various techniques, ranging from the obvious to the complex. There are no techniques that can guarantee compliance with what society requires for privacy. They have to be applied with care and intelligence, with a broad knowledge of the re-identification techniques around.

To be able to report to your organisation that the data in its care is secure, even in its masked/pseudoymized/de-identified form, the techniques you use for masking need to be underpinned by a written risk assessment. This assessment must be agreed by the member of the organisation who is responsible for the role of data protection officer (DPO) or equivalent, and is in a defensible compliance report.

  • Methods for De-identification of PHI | HHS.gov)
  • The ‘Re-Identification’ of Governor William Weld’s Medical Information: A Critical Re-Examination of Health Data Identification Risks and Privacy Protections, Then and Now by Daniel Barth-Jones :: SSRN)
  • I Know Where You Were Last Summer: London’s public bike data is telling everyone where you’ve been)
  • Unique in the shopping mall: On the reidentifiability of credit card metadata | Science)
  • ‘Anonymous’ browsing data can be easily exposed, researchers reveal | Technology | The Guardian)
  • Estimating the success of re-identifications in incomplete datasets using generative models | Nature Communications)
  • Anonymizing NYC Taxi Data: Does It Matter? – IEEE Conference Publication)
  • Privacy violations in Riga open data public transport system – IEEE Conference Publication)
  • Pseudonymization and the Inference Attack – Simple Talk)
  • The Case for De-Identifying Personal Health Information | Khaled El Emam)
  • Personal Data, Privacy, and the GDPR – Simple Talk)
  • Privacy in Databases by Mathy Vanhoef (2012) [PDF]
  • Computational disclosure control : a primer on data privacy protection) PhD thesis. Massachusetts Institute of Technology, 2001.
  • Robust De-anonymization of Large Sparse Datasets – IEEE Conference Publication) In Proceedings of the 2008 IEEE Symposium on Security and Privacy, pages 111- 125. IEEE Computer Society. 2008.
  • [0803.0032] Composition Attacks and Auxiliary Information in Data Privacy)
  • Only You, Your Doctor, and Many Others May Know | Technology Science) by Latanya Sweeney
  • Composition attacks and auxiliary information in data privacy. In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 265 273. ACM Press, 2008.
  • Only You, Your Doctor, and Many Others May Know

How helpful was this article?
This article was hideous
This article was bad
This article was ok
This article was good
This article was great
Thank you for your feedback!

About The Author

Phil Factor

Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications. Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

Article navigation

Related articles

  • Date Tags: Getting It Right On DateTime
  • Working with MongoDB Data? Use These Data Masking Techniques
  • What’s New in Studio 3T 2021.3 | Data Masking on Exports and Our Darkest Theme Yet
  • What’s New in Studio 3T 2021.5 | Streamlining UX and a new Data Masking technique
  • What’s New in Studio 3T 2021.10 | Value Search, Better Task Scheduling and Import Data Masking

Studio 3T

MongoDB Enterprise Certified Technology PartnerSince 2014, 3T has been helping thousands of MongoDB developers and administrators with their everyday jobs by providing the finest MongoDB tools on the market. We guarantee the best compatibility with current and legacy releases of MongoDB, continue to deliver new features with every new software release, and provide high quality support.

Find us on FacebookFind us on TwitterFind us on YouTubeFind us on LinkedIn

Education

  • Free MongoDB Tutorials
  • Connect to MongoDB
  • Connect to MongoDB Atlas
  • Import Data to MongoDB
  • Export MongoDB Data
  • Build Aggregation Queries
  • Query MongoDB with SQL
  • Migrate from SQL to MongoDB

Resources

  • Feedback and Support
  • Sales Support
  • Knowledge Base
  • FAQ
  • Reports
  • White Papers
  • Testimonials
  • Discounts

Company

  • About Us
  • Blog
  • Careers
  • Legal
  • Press
  • Privacy Policy
  • EULA

© 2023 3T Software Labs Ltd. All rights reserved.

  • Privacy Policy
  • Cookie settings
  • Impressum

We value your privacy

With your consent, we and third-party providers use cookies and similar technologies on our website to analyse your use of our site for market research or advertising purposes ("analytics and marketing") and to provide you with additional functions (“functional”). This may result in the creation of pseudonymous usage profiles and the transfer of personal data to third countries, including the USA, which may have no adequate level of protection for the processing of personal data.

By clicking “Accept all”, you consent to the storage of cookies and the processing of personal data for these purposes, including any transfers to third countries. By clicking on “Decline all”, you do not give your consent and we will only store cookies that are necessary for our website. You can customize the cookies we store on your device or change your selection at any time - thus also revoking your consent with effect for the future - under “Manage Cookies”, or “Cookie Settings” at the bottom of the page. You can find further information in our Privacy Policy.
Accept all
Decline all
Manage cookies
✕

Privacy Preference Center

With your consent, we and third-party providers use cookies and similar technologies on our website to analyse your use of our site for market research or advertising purposes ("analytics and marketing") and to provide you with additional functions (“functional”). This may result in the creation of pseudonymous usage profiles and the transfer of personal data to third countries, including the USA, which may have no adequate level of protection for the processing of personal data. Please choose for which purposes you wish to give us your consent and store your preferences by clicking on “Accept selected”. You can find further information in our Privacy Policy.

Accept all cookies

Manage consent preferences

Essential cookies are strictly necessary to provide an online service such as our website or a service on our website which you have requested. The website or service will not work without them.

Performance cookies allow us to collect information such as number of visits and sources of traffic. This information is used in aggregate form to help us understand how our websites are being used, allowing us to improve both our website’s performance and your experience.

Google Analytics

Google Ads

Bing Ads

Facebook

LinkedIn

Quora

Hotjar

Reddit

Functional cookies collect information about your preferences and choices and make using the website a lot easier and more relevant. Without these cookies, some of the site functionality may not work as intended.

HubSpot

Social media cookies are cookies used to share user behaviour information with a third-party social media platform. They may consequently effect how social media sites present you with information in the future.

Accept selected