There is, unfortunately, no easy way to turn the data from a production database into a safe, anonymized version from which all sensitive data is safely extracted, and which is secure from re-identification. It is tricky to do properly.
Although data masking tools are necessary, they are not sufficient. In Europe, they can be used to implement a masking strategy only when the appropriate assessments have taken place and properly documented in a defensible GDPR compliance report.
In the States, the industry standard is the equivalent HIPAA Expert Determination. Why? Surely, common-sense tells you that if you mask out identifiers such as names, and phone numbers, then the data is no use to anyone.
In fact, the reverse is true: data can be re-identified alarmingly easily if just a few clues are left in the data. It takes the expert eye to spot these clues.
Why mask data?
Data masking, obfuscation, ‘pseudonymisation’, or ‘de-identification’ of data is required when certain data within a dataset must be kept private. This can be for many reasons, though privacy is a common concern.
There are two main purposes of the process of masking data, one of which is the requirement of some aspects of testing and development work on a database system, or for training purposes, and the other is the publication of open data.
Restricted data reports
It is important for scientific research, particularly medical, genetic and epidemiological research, that data can be shared.
Open government also requires that all manner of data be available for inspection. Before such data is shared, that part of the data that would uniquely identify an individual or group of individuals is masked out, obfuscated or removed, under the assumption that the data will then no longer identify personal information.
To further this aim, the preparation of healthcare open data sets in the USA is subject to The HIPAA Safe Harbor provision that restricts disclosure of, for example, an individual’s full date of birth (only year of birth may be reported), and restrictions on the smallest reportable geographic unit. It also outlines 18 different identifiers that must be removed from such shared data. The GDPR is less prescriptive, preferring an approach similar to HIPAA Expert Determination.
Test datasets
Data that is required for development and testing is a special case, because the entire database, rather than an extract or subset, needs to be present, albeit with masking or obfuscation in place.
This is generally used within the database industry to eliminate a bug that is present only in the production data and which cannot be replicated with generated data.
Sensitive data, and data that can be used to identify individuals, must be masked or obfuscated. It has an extra danger beyond ‘restricted data’.
What can go wrong?
Even data properly de-identified under the Privacy Rule may carry with it some private information, and, therefore, poses some risk of re-identification, a risk that grows as new datasets are released and as datasets are combined.
Privacy researchers have published research for several years that has proved, and demonstrated, that individuals can be identified from masked data. The data industry has found this fact difficult to absorb. Security experts have had to use more spectacular means to make their point.
- In 2002, the HMO breach combined two databases, the voter registration list for Cambridge Massachusetts and the Massachusetts Group Insurance Commission (GIC) medical encounter data. A researcher was able to identify the medical records of the governor of Massachusetts at the time, William Weld.
- In 2014, individual journeys and the identity of the people who made them was reverse-engineered from the anonymized public data set of hired ‘Boris Bikes’ in London.
- In 2015, researchers found that they could, by examining three months of anonymized credit card records for 1.1 million people with no obvious identifiers except for four spatiotemporal points, thereby uniquely reidentify 90% of individuals and to uncover all of their records.
- In 2017, journalists re-identified politicians in an anonymized browsing history dataset of 3 million German citizens, uncovering their medical information and their sexual preferences, including the porn preferences of a judge, and the medicines used by a German MP 23
- In 2016, When the Australian Department of Health publicly released de-identified medical records for 10% of the population, privacy experts managed to re-identify them within six weeks
- In 2016, data from the New York City (NYC) taxi trips data set that had been publicly released was found to allow the income of individual identifiable taxi drivers to be calculated. Even without the decoding of the medallion numbers, 91% of the taxis that ply in NYC could be identified from the data.
- Open data published in 2016 containing public transport ride registrations in Riga yielded personal data about the journeys people were taking.
These breaches were done by privacy researchers to prove the point. However, similar techniques are regularly used by journalists and private investigators.
For example, when a man was forcibly ejected from a United Airlines flight in 2018, he was filmed yelling that he was a doctor and that he was being profiled for being Chinese. This was enough: within hours, news reporters who used publicly-available information were at his house, requesting an interview.
You have, in the UK an 80% chance that you can be correctly identified solely from postcode, date of birth and gender. This goes up to 99% with seven attributes. There are a growing number of viable techniques for doing this.
The organisations who provide open data argue that where such data represents only a sample of the full data, or is incomplete, de-anonymization is unlikely to expose much personal information with sufficient certainty to be confident of a match.
However, de-anonymization approaches are now so sophisticated that they can identify individuals and data with a high level of probability.
Data masking, by itself, is no longer a guarantee that personal information cannot be identified. Techniques such as the generative copula-based method of re-identification will find that 99.98% of Americans would be correctly re-identified in any dataset that contains 15 demographic attributes.
The GDPR is wary of the effectiveness of pseudonymization and so requires that pseudonymized data should be treated the same way as the original data, in terms of security and encryption unless it has been subject to a risk-assessment.
Data and strategies for compliance
It is important to find a compromise between the concerns of privacy and of the benefits of exchanging and publishing data.
While there are very few, if any, cases of individuals who have been harmed by attacks with verified re-identifications, all of humanity has benefited from the use of de-identified health information.
De-identified health data is the workhorse that routinely supports numerous healthcare improvements and a wide variety of medical research activities: many of us owe our lives to the ongoing research and health system improvements that have been realized because of the analysis of de-identified data.
Likewise, the publication of open data by government and academic institutions has helped society greatly and enabled a range of expertise to be brought to bear on the way society is regulated.
To support this, many organisations, whether they be commercial, government, trusts, and service providers are required to be good custodians of the private data they hold on individuals. They are also often required to export the data outside the rigorous access-controlled security of the database.
To meet the conflicting requirements, there are several techniques used. The objective is to ensure that the data remains useful while the risk level is low.
Pseudonymised data has its direct identifiers removed but there is still a danger that it can be made to revert to an identifiable form and then acted upon in an inappropriate way.
Therefore, pseudonymised data is still classified as personal data, and cannot be considered anonymous so is subject to the GDPR and HIPAA controls and restrictions.
Some of the more commonly-used techniques are …
Restricting distribution and access
Although many public datasets can be used by anyone, many scientific datasets are highly restricted and use ‘document-retention rules to delete the data after use.
Other are subject to Data Use Agreements (DUAs) or subject to direct Health and Human Services (HHS) mandates for use conditions.
Encryption is the most generally-used technology to implement a file-level access control and is recommended by GDPR Article 4.
Aggregation
Data is generally aggregated at the level required by the recipients of the data.
If, for example, epidemiologists need the incidence of influenzas broken down by geographical area, it can be aggregated at that exact criterion, thereby avoiding individual data records.
No other data than the precise criteria should be included. If the researchers are clear about the hypothesis being tested, this should not worry them. If a ‘freedom of information’ request requires certain data, then just that data should be provided.
Sampling
In many cases, only a subset of the data is provided, thereby reducing the possibility of an exact identification.
The problem, of course, is the difficulty of creating a sample that is representative of the population as a whole.
Even heavily sampled, masked and anonymized datasets are unlikely to comply with the requirements of the GDPR that considers that each and every person in a dataset has to be protected for the dataset to be considered anonymous.
Masking
Masking is relevant only when a database needs to be delivered in a ‘de-identified’ form to enable the technology team within the organisation to perform certain testing, training and maintenance tasks.
Instead of leaving the sensitive information out entirely, the sensitive data is obfuscated or entirely generated. This is more difficult than it may seem, because of the use of keys in a relational database.
If, for example, a person’s name is used as part of a primary key, then the change has to be spread to all the referring foreign keys.
When a heavily-indexed column is obfuscated, any change in the statistical distribution of the key will change the way that the database performs.
Indexes often have distribution statistics attached to them and these statistics are used to determine the best execution strategy for the query using the index. If the statistics change, then the whole purpose of testing with the full dataset is lost.
Another problem with masking an identifiable characteristic is to ensure that it is done in all places where it exists. It is pointless to, for example, mast a person’s name in the main record, but forget that it is also in an XML field, or located de-normalised within associated records.
Sometimes, it is startling to examine a database in its raw file-format and see data which you thought was masked staring out from the screen from forgotten ‘note’ fields, or email addresses. With a database that isn’t normalised, it can be difficult to get this simple task right.
Shuffling and generation
If a column must be present for the database to be used, but its contents need not be accurate, then the data within that column can be shuffled to different rows so that the distribution remains the same but no longer can be used to identify individual entities in rows. Similarly, a column can be entirely generated according to specific rules.
Conclusions
Data that has had the obvious methods of identification removed can be ‘reidentified’ by various techniques, ranging from the obvious to the complex. There are no techniques that can guarantee compliance with what society requires for privacy. They have to be applied with care and intelligence, with a broad knowledge of the re-identification techniques around.
To be able to report to your organisation that the data in its care is secure, even in its masked/pseudoymized/de-identified form, the techniques you use for masking need to be underpinned by a written risk assessment. This assessment must be agreed by the member of the organisation who is responsible for the role of data protection officer (DPO) or equivalent, and is in a defensible compliance report.
- Methods for De-identification of PHI | HHS.gov)
- The ‘Re-Identification’ of Governor William Weld’s Medical Information: A Critical Re-Examination of Health Data Identification Risks and Privacy Protections, Then and Now by Daniel Barth-Jones :: SSRN)
- I Know Where You Were Last Summer: London’s public bike data is telling everyone where you’ve been)
- Unique in the shopping mall: On the reidentifiability of credit card metadata | Science)
- ‘Anonymous’ browsing data can be easily exposed, researchers reveal | Technology | The Guardian)
- Estimating the success of re-identifications in incomplete datasets using generative models | Nature Communications)
- Anonymizing NYC Taxi Data: Does It Matter? – IEEE Conference Publication)
- Privacy violations in Riga open data public transport system – IEEE Conference Publication)
- Pseudonymization and the Inference Attack – Simple Talk)
- The Case for De-Identifying Personal Health Information | Khaled El Emam)
- Personal Data, Privacy, and the GDPR – Simple Talk)
- Privacy in Databases by Mathy Vanhoef (2012) [PDF]
- Computational disclosure control : a primer on data privacy protection) PhD thesis. Massachusetts Institute of Technology, 2001.
- Robust De-anonymization of Large Sparse Datasets – IEEE Conference Publication) In Proceedings of the 2008 IEEE Symposium on Security and Privacy, pages 111- 125. IEEE Computer Society. 2008.
- [0803.0032] Composition Attacks and Auxiliary Information in Data Privacy)
- Only You, Your Doctor, and Many Others May Know | Technology Science) by Latanya Sweeney
- Composition attacks and auxiliary information in data privacy. In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 265 273. ACM Press, 2008.
- Only You, Your Doctor, and Many Others May Know