Database Anonymization. David Sánchez
Чтение книги онлайн.

Читать онлайн книгу Database Anonymization - David Sánchez страница 5

СКАЧАТЬ the attributes collected. We use X to denote the collected microdata file. We assume that X contains information about n respondents and m attributes. We use xi, to refer to the record contributed by respondent i, and xj (or Xj) to refer to attribute j. The value of attribute j for respondent i is denoted by
.

      The attributes in a microdata set are usually classified in the following non-exclusive categories.

      • Identifiers. An attribute is an identifier if it provides unambiguous re-identification of the individual to which the record refers. Some examples of identifier attributes are the social security number, the passport number, etc. If a record contains an identifier, any sensitive information contained in other attributes may immediately be linked to a specific individual. To avoid direct re-identification of an individual, identifier attributes must be removed or encrypted. In the following chapters, we assume that identifier attributes have previously been removed.

      • Quasi-identifiers. Unlike an identifier, a quasi-identifier attribute alone does not lead to record re-identification. However, in combination with other quasi-identifier attributes, it may allow unambiguous re-identification of some individuals. For example, [99] shows that 87% of the population in the U.S. can be unambiguously identified by combining a 5-digit ZIP code, birth date, and sex. Removing quasi-identifier attributes, as proposed for the identifiers, is not possible, because quasi-identifiers are most of the time required to perform any useful analysis of the data. Deciding whether a specific attribute should be considered a quasi-identifier is a thorny issue. In practice, any information an intruder has about an individual can be used in record re-identification. For uninformed intruders, only the attributes available in an external non-anonymized data set should be classified as quasi-identifiers; in the presence of informed intruders any attribute may potentially be a quasi-identifier. Thus, in the strictest case, to make sure all potential quasi-identifiers have been removed, one ought to remove all attributes (!).

      • Confidential attributes. Confidential attributes hold sensitive information on the individuals that took part in the data collection process (e.g., salary, health condition, sex orientation, etc.). The primary goal of microdata protection techniques is to prevent intruders from learning confidential information about a specific individual. This goal involves not only preventing the intruder from determining the exact value that a confidential attribute takes for some individual, but also preventing accurate inferences on the value of that attribute (such as bounding it).

      • Non-confidential attributes. Non-confidential attributes are those that do not belong to any of the previous categories. As they do not contain sensitive information about individuals and cannot be used for record re-identification, they do not affect our discussion on disclosure limitation for microdata sets. Therefore, we assume that none of the attributes in X belong to this category.

      A first attempt to come up with a formal definition of privacy was made by Dalenius in [14]. He stated that access to the released data should not allow any attacker to increase his knowledge about confidential information related to a specific individual. In other words, the prior and the posterior beliefs about an individual in the database should be similar. Because the ultimate goal in privacy is to keep the secrecy of sensitive information about specific individuals, this is a natural definition of privacy. However, Dalenius’ definition is too strict to be useful in practice. This was illustrated with two examples [29]. The first one considers an adversary whose prior view is that everyone has two left feet. By accessing a statistical database, the adversary learns that almost everybody has one left foot and one right foot, thus modifying his posterior belief about individuals to a great extent. In the second example, the use of auxiliary information makes things worse. Suppose that a statistical database teaches the average height of a group of individuals, and that it is not possible to learn this information in any other way. Suppose also that the actual height of a person is considered to be a sensitive piece of information. Let the attacker have the following side information: “Adam is one centimeter taller than the average English man.” Access to the database teaches Adam’s height, while having the side information but no database access teaches much less. Thus, Dalenius’ view of privacy is not feasible in presence of background information (if any utility is to be provided).

      The privacy criteria used in practice offer only limited disclosure control guarantees. Two main views of privacy are used for microdata releases: anonymity (it should not be possible to re-identify any individual in the published data) and confidentiality or secrecy (access to the released data should not reveal confidential information related to any specific individual).

      The confidentiality view of privacy is closer to Dalenius’ proposal, being the main difference that it limits the amount of information provided by the data set rather than the change between prior and posterior beliefs about an individual. There are several approaches to attain confidentiality. A basic example of SDC technique that gives confidentiality is noise addition. By adding a random noise to a confidential data item, we mask its value: we report a value drawn from a random distribution rather than the actual value. The amount of noise added determines the level of confidentiality.

      The anonymity view of privacy seeks to hide each individual in a group. This is indeed quite intuitive a view of privacy: the privacy of an individual is protected if we are not able to distinguish her from other individuals in a group. This view of privacy is commonly used in legal frameworks. For instance, the U.S. Health Insurance Portability and Accountability Act (HIPAA) of 1996 requires removing several attributes that could potentially identify an individual; in this way, the individual stays anonymous. However, we should keep in mind that if the value of the confidential attribute has a small variability within the group of indistinguishable individuals, disclosure still happens for these individuals: even if we are not able to tell which record belongs to each of the individuals, the low variability of the confidential attribute gives us a good estimation of its actual value.

       The Health Insurance Portability and Accountability Act (HIPAA)

      The Privacy Rule allows a covered entity to de-identify data by removing all 18 elements that could be used to identify the individual or the individual’s relatives, employers, or household members; these elements are enumerated in the Privacy Rule. The covered entity also must have no actual knowledge that the remaining information could be used alone or in combination with other information to identify the individual who is the subject of the information. Under this method, the identifiers that must be removed are the following:

      • Names.

      • All geographic subdivisions smaller than a state, including street address, city, county, precinct, ZIP code, and their equivalent geographical codes, except for the initial three digits of a ZIP code if, according to the current publicly available data from the Bureau of the Census:

      – The geographic unit formed by combining all ZIP codes with the same three initial digits contains more than 20,000 people.

      – The initial three digits of a ZIP code for all such geographic units containing 20,000 or fewer people are changed to 000.

      • All elements of dates (except year) for dates directly related to an individual, including birth date, admission date, discharge date, date of death; and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older.

      • Telephone numbers.

      • Facsimile numbers.

      • Electronic mail addresses.

      • СКАЧАТЬ