Getting Data Masking and Anonymization Right

This is the first of two articles I’ve written on data masking in MongoDB. Learn how to apply specific masking methods in my second article, Working with MongoDB Data? Use These Data Masking Techniques.

There is, unfortunately, no easy way to turn the data from a production database into a safe, anonymized version from which all sensitive data is safely extracted, and which is secure from re-identification. It is tricky to do properly.

Although data masking tools are necessary, they are not sufficient. In Europe, they can be used to implement a masking strategy only when the appropriate assessments have taken place and properly documented in a defensible GDPR compliance report.

In the States, the industry standard is the equivalent HIPAA Expert Determination. Why? Surely, common-sense tells you that if you mask out identifiers such as names, and phone numbers, then the data is no use to anyone.

In fact, the reverse is true: data can be re-identified alarmingly easily if just a few clues are left in the data. It takes the expert eye to spot these clues.

Masked person standing next to screen. — Photo by Max Bender on Unsplash.

Looking to mask collections in MongoDB? Try Data Masking for MongoDB for free. Apply a data masking technique depending on the field type and export masked documents to a new collection.

Why mask data?

Data masking, obfuscation, ‘pseudonymisation’, or ‘de-identification’ of data is required when certain data within a dataset must be kept private. This can be for many reasons, though privacy is a common concern.

There are two main purposes of the process of masking data, one of which is the requirement of some aspects of testing and development work on a database system, or for training purposes, and the other is the publication of open data.

Restricted data reports

It is important for scientific research, particularly medical, genetic and epidemiological research, that data can be shared.

Open government also requires that all manner of data be available for inspection. Before such data is shared, that part of the data that would uniquely identify an individual or group of individuals is masked out, obfuscated or removed, under the assumption that the data will then no longer identify personal information.

To further this aim, the preparation of healthcare open data sets in the USA is subject to The HIPAA Safe Harbor provision that restricts disclosure of, for example, an individual’s full date of birth (only year of birth may be reported), and restrictions on the smallest reportable geographic unit. It also outlines 18 different identifiers that must be removed from such shared data. The GDPR is less prescriptive, preferring an approach similar to HIPAA Expert Determination.

Test datasets

Data that is required for development and testing is a special case, because the entire database, rather than an extract or subset, needs to be present, albeit with masking or obfuscation in place.

This is generally used within the database industry to eliminate a bug that is present only in the production data and which cannot be replicated with generated data.

Sensitive data, and data that can be used to identify individuals, must be masked or obfuscated. It has an extra danger beyond ‘restricted data’.

What can go wrong?

Even data properly de-identified under the Privacy Rule may carry with it some private information, and, therefore, poses some risk of re-identification, a risk that grows as new datasets are released and as datasets are combined.

Privacy researchers have published research for several years that has proved, and demonstrated, that individuals can be identified from masked data. The data industry has found this fact difficult to absorb. Security experts have had to use more spectacular means to make their point.

In 2002, the HMO breach combined two databases, the voter registration list for Cambridge Massachusetts and the Massachusetts Group Insurance Commission (GIC) medical encounter data. A researcher was able to identify the medical records of the governor of Massachusetts at the time, William Weld.
In 2014, individual journeys and the identity of the people who made them was reverse-engineered from the anonymized public data set of hired ‘Boris Bikes’ in London.
In 2015, researchers found that they could, by examining three months of anonymized credit card records for 1.1 million people with no obvious identifiers except for four spatiotemporal points, thereby uniquely reidentify 90% of individuals and to uncover all of their records.
In 2017, journalists re-identified politicians in an anonymized browsing history dataset of 3 million German citizens, uncovering their medical information and their sexual preferences, including the porn preferences of a judge, and the medicines used by a German MP 23
In 2016, When the Australian Department of Health publicly released de-identified medical records for 10% of the population, privacy experts managed to re-identify them within six weeks
In 2016, data from the New York City (NYC) taxi trips data set that had been publicly released was found to allow the income of individual identifiable taxi drivers to be calculated. Even without the decoding of the medallion numbers, 91% of the taxis that ply in NYC could be identified from the data.
Open data published in 2016 containing public transport ride registrations in Riga yielded personal data about the journeys people were taking.

These breaches were done by privacy researchers to prove the point. However, similar techniques are regularly used by journalists and private investigators.

For example, when a man was forcibly ejected from a United Airlines flight in 2018, he was filmed yelling that he was a doctor and that he was being profiled for being Chinese. This was enough: within hours, news reporters who used publicly-available information were at his house, requesting an interview.

You have, in the UK an 80% chance that you can be correctly identified solely from postcode, date of birth and gender. This goes up to 99% with seven attributes. There are a growing number of viable techniques for doing this.

The organisations who provide open data argue that where such data represents only a sample of the full data, or is incomplete, de-anonymization is unlikely to expose much personal information with sufficient certainty to be confident of a match.

However, de-anonymization approaches are now so sophisticated that they can identify individuals and data with a high level of probability.

Data masking, by itself, is no longer a guarantee that personal information cannot be identified. Techniques such as the generative copula-based method of re-identification will find that 99.98% of Americans would be correctly re-identified in any dataset that contains 15 demographic attributes.

The GDPR is wary of the effectiveness of pseudonymization and so requires that pseudonymized data should be treated the same way as the original data, in terms of security and encryption unless it has been subject to a risk-assessment.

Data and strategies for compliance

It is important to find a compromise between the concerns of privacy and of the benefits of exchanging and publishing data.

While there are very few, if any, cases of individuals who have been harmed by attacks with verified re-identifications, all of humanity has benefited from the use of de-identified health information.

De-identified health data is the workhorse that routinely supports numerous healthcare improvements and a wide variety of medical research activities: many of us owe our lives to the ongoing research and health system improvements that have been realized because of the analysis of de-identified data.

Likewise, the publication of open data by government and academic institutions has helped society greatly and enabled a range of expertise to be brought to bear on the way society is regulated.

To support this, many organisations, whether they be commercial, government, trusts, and service providers are required to be good custodians of the private data they hold on individuals. They are also often required to export the data outside the rigorous access-controlled security of the database.

To meet the conflicting requirements, there are several techniques used. The objective is to ensure that the data remains useful while the risk level is low.

Pseudonymised data has its direct identifiers removed but there is still a danger that it can be made to revert to an identifiable form and then acted upon in an inappropriate way.

Therefore, pseudonymised data is still classified as personal data, and cannot be considered anonymous so is subject to the GDPR and HIPAA controls and restrictions.

Some of the more commonly-used techniques are …

Restricting distribution and access

Although many public datasets can be used by anyone, many scientific datasets are highly restricted and use ‘document-retention rules to delete the data after use.

Other are subject to Data Use Agreements (DUAs) or subject to direct Health and Human Services (HHS) mandates for use conditions.

Encryption is the most generally-used technology to implement a file-level access control and is recommended by GDPR Article 4.

Aggregation

Data is generally aggregated at the level required by the recipients of the data.

If, for example, epidemiologists need the incidence of influenzas broken down by geographical area, it can be aggregated at that exact criterion, thereby avoiding individual data records.

No other data than the precise criteria should be included. If the researchers are clear about the hypothesis being tested, this should not worry them. If a ‘freedom of information’ request requires certain data, then just that data should be provided.

Sampling

In many cases, only a subset of the data is provided, thereby reducing the possibility of an exact identification.

The problem, of course, is the difficulty of creating a sample that is representative of the population as a whole.

Even heavily sampled, masked and anonymized datasets are unlikely to comply with the requirements of the GDPR that considers that each and every person in a dataset has to be protected for the dataset to be considered anonymous.

Masking

Masking is relevant only when a database needs to be delivered in a ‘de-identified’ form to enable the technology team within the organisation to perform certain testing, training and maintenance tasks.

Instead of leaving the sensitive information out entirely, the sensitive data is obfuscated or entirely generated. This is more difficult than it may seem, because of the use of keys in a relational database.

If, for example, a person’s name is used as part of a primary key, then the change has to be spread to all the referring foreign keys.

When a heavily-indexed column is obfuscated, any change in the statistical distribution of the key will change the way that the database performs.

Indexes often have distribution statistics attached to them and these statistics are used to determine the best execution strategy for the query using the index. If the statistics change, then the whole purpose of testing with the full dataset is lost.

Another problem with masking an identifiable characteristic is to ensure that it is done in all places where it exists. It is pointless to, for example, mast a person’s name in the main record, but forget that it is also in an XML field, or located de-normalised within associated records.

Sometimes, it is startling to examine a database in its raw file-format and see data which you thought was masked staring out from the screen from forgotten ‘note’ fields, or email addresses. With a database that isn’t normalised, it can be difficult to get this simple task right.

Need to obfuscate MongoDB collections? Data Masking for MongoDB, available in Studio 3T, provides field-level static data masking. Try it for free.

Shuffling and generation

If a column must be present for the database to be used, but its contents need not be accurate, then the data within that column can be shuffled to different rows so that the distribution remains the same but no longer can be used to identify individual entities in rows. Similarly, a column can be entirely generated according to specific rules.

Conclusions

Data that has had the obvious methods of identification removed can be ‘reidentified’ by various techniques, ranging from the obvious to the complex. There are no techniques that can guarantee compliance with what society requires for privacy. They have to be applied with care and intelligence, with a broad knowledge of the re-identification techniques around.

To be able to report to your organisation that the data in its care is secure, even in its masked/pseudoymized/de-identified form, the techniques you use for masking need to be underpinned by a written risk assessment. This assessment must be agreed by the member of the organisation who is responsible for the role of data protection officer (DPO) or equivalent, and is in a defensible compliance report.