• Medical record numbers.
• Health plan beneficiary numbers.
• Account numbers.
• Certificate/license numbers.
• Vehicle identifiers and serial numbers, including license plate numbers.
• Device identifiers and serial numbers.
• Web universal resource locators (URLs).
• Internet protocol (IP) address numbers.
• Biometric identifiers, including fingerprints and voiceprints.
• Full-face photographic images and any comparable images.
• Any other unique identifying number, characteristic, or code, unless otherwise permitted by the Privacy Rule for re-identification.
2.4 DISCLOSURE RISK IN MICRODATA SETS
When publishing a microdata file, the data collector must guarantee that no sensitive information about specific individuals is disclosed. Usually two types of disclosure are considered in microdata sets [44].
• Identity disclosure. This type of disclosure violates privacy viewed as anonymity. It occurs when the intruder is able to associate a record in the released data set with the individual that originated it. After re-identification, the intruder associates the values of the confidential attributes for the record to the re-identified individual. Two main approaches are usually employed to measure identity disclosure risk: uniqueness and reidentification.
– Uniqueness. Roughly speaking, the risk of identity disclosure is measured as the probability that rare combinations of attribute values in the released protected data are indeed rare in the original population the data come from.
– Record linkage. This is an empirical approach to evaluate the risk of disclosure. In this case, the data protector (also known as data controller) uses a record linkage algorithm (or several such algorithms) to link each record in the anonymized data with a record in the original data set. Since the protector knows the real correspondence between original and anonymized records, he can determine the percentage of correctly linked pairs, which he uses to estimate the number of re-identifications that might be obtained by a specialized intruder. If this number is unacceptably high, then more intense anonymization by the controller is needed before the anonymized data set is ready for release.
• Attribute disclosure. This type of disclosure violates privacy viewed as confidentiality. It occurs when access to the released data allows the intruder to determine the value of a confidential attribute of an individual with enough accuracy.
The above two types of disclosure are independent. Even if identity disclosure happens, there may not be attribute disclosure if the confidential attributes in the released data set have been masked. On the other side, attribute disclosure may still happen even without identity disclosure. For example, imagine that the salary is one of the confidential attributes and the job is a quasi-identifier attribute; if an intruder is interested in a specific individual whose job he knows to be “accountant” and there are several accountants in the data set (including the target individual), the intruder will be unable to re-identify the individual’s record based only on her job, but he will be able to lower-bound and upper-bound the individual’s salary (which lies between the minimum and the maximum salary of all the accountants in the data set). Specifically, attribute disclosure happens if the range of possible salary values for the matching records is narrow.
2.5 MICRODATA ANONYMIZATION
To avoid disclosure, data collectors do not publish the original microdata set X, but a modified version Y of it. This data set Y is called the protected, anonymized, or sanitized version of X. Microdata protection methods can generate the protected data set by either masking the original data or generating synthetic data.
• Masking. The protected data Y are generated by modifying the original records in X. Masking induces a relation between the records in Y and the original records in X. When applied to quasi-identifier attributes, the identity behind each record is masked (which yields anonymity). When applied to confidential attributes, the values of the confidential data are masked (which yields confidentiality, even if the subject to whom the record corresponds might still be re-identifiable). Masking methods can in turn be divided in two categories depending on their effect on the original data.
– Perturbative masking. The microdata set is distorted before publication. The perturbation method used should be such that the statistics computed on the perturbed data set do not differ significantly from the statistics that would be obtained on the original data set. Noise addition, microaggregation, data/rank swapping, microdata rounding, resampling, and PRAM are examples of perturbative masking methods.
– Non-perturbative masking. Non-perturbative methods do not alter data; rather, they produce partial suppressions or reductions of detail/coarsening in the original data set. Sampling, global recoding, top and bottom coding, and local suppression are examples of non-perturbative masking methods.
– Fully synthetic [77], where every attribute value for every record has been synthesized. The population units (subjects) contained in Y are not the original population units in X but a new sample from the underlying population.
• Synthetic data. The protected data set Y consists of randomly simulated records that do not directly derive from the records in X; the only connection between X and Y is that the latter preserves some statistics from the former (typically a model relating the attributes in X). The generation of a synthetic data set takes three steps [27, 77]: (i) a model for the population is proposed, (ii) the model is adjusted to the original data set X, and (iii) the synthetic data set Y is generated by drawing from the model. There are three types of synthetic data sets:
– Partially synthetic [74], where only the data items (the attribute values) with high risk of disclosure are synthesized. The population units in Y are the same population units in X (in particular, X and Y have the same number of records).
– Hybrid [19, 65], where the original data set is mixed with a fully synthetic data set.
In a fully synthetic data set any dependency between X and Y must come from the model. In other words, X and Y are independent conditionally to the adjusted model. The disclosure risk in fully synthetic data sets is usually low, as we justify next. On the one side, the population units in Y are not the original population units in X. On the other side, the information about the original data X conveyed by Y is only the one incorporated by the model, which is usually limited to some statistical properties. In a partially synthetic data set, the disclosure risk is reduced by replacing the values in the original data set at a higher risk of disclosure with simulated values. The simulated values assigned to an individual should be representative but are not directly related to her. In hybrid data sets, the level of protection we get is the lowest; mixing original and synthetic records breaks the conditional independence between the original data and the synthetic data. The parameters of the mixture determine the amount of dependence.
2.6 MEASURING INFORMATION LOSS
The evaluation of the utility of the protected data set must be based on the intended uses of the data. The closer the results obtained for these uses between СКАЧАТЬ