Database Anonymization. David Sánchez
Чтение книги онлайн.

Читать онлайн книгу Database Anonymization - David Sánchez страница 7

СКАЧАТЬ preserved. However, very often, microdata protection cannot be performed in a data use specific manner, due to the following reasons.

      • Potential data uses are very diverse and it may even be hard to identify them all at the moment of the data release.

      • Even if all the data uses could be identified, releasing several versions of the same original data set so that the i-th version has been optimized for the i-th data use may result in unexpected disclosure.

      Since data must often be protected with no specific use in mind, it is usually more appropriate to refer to information loss rather than to utility. Measures of information loss provide generic ways for the data protector to assess how much harm is being inflicted to the data by a particular data masking technique.

      Information loss measures for numerical data. Assume a microdata set X with n individuals (records) x1,…,xn and m continuous attributes x1,…,xm. Let Y be the protected microdata set. The following tools are useful to characterize the information contained in the data set:

      • Covariance matrices V (on X) and V′ (on Y).

      • Correlation matrices R and R′.

      • Correlation matrices RF and RF′ between the m attributes and the m factors PC1, PC2,…,PCp obtained through principal components analysis.

      • Communality between each of the m attributes and the first principal component PC1 (or other principal components PCi’s). Communality is the percent of each attribute that is explained by PC1 (or PCi). Let C be the vector of communalities for X, and C′ the corresponding vector for Y.

      • Factor score coefficient matrices F and F′. Matrix F contains the factors that should multiply each attribute in X to obtain its projection on each principal component. F′ is the corresponding matrix for Y.

      There does not seem to be a single quantitative measure which completely reflects the structural differences between X and Y. Therefore, in [25, 87] it was proposed to measure the information loss through the discrepancies between matrices X, V, R, RF, C, and F obtained on the original data and the corresponding X′, V′, R′, RF′, C′, and F′ obtained on the protected data set. In particular, discrepancy between correlations is related to the information loss for data uses such as regressions and cross-tabulations. Matrix discrepancy can be measured in at least three ways.

      • Mean square error. Sum of squared componentwise differences between pairs of matrices, divided by the number of cells in either matrix.

      • Mean absolute error. Sum of absolute componentwise differences between pairs of matrices, divided by the number of cells in either matrix.

      • Mean variation. Sum of absolute percent variation of components in the matrix computed on the protected data with respect to components in the matrix computed on the original data, divided by the number of cells in either matrix. This approach has the advantage of not being affected by scale changes of attributes.

      Information loss measures for categorical data. These have been usually based on direct comparison of categorical values, comparison of contingency tables, or on Shannon’s entropy [25]. More recently, the importance of the semantics underlying categorical data for data utility has been realized [60, 83]. As a result, semantically grounded information loss measures that exploits the formal semantics provided by structured knowledge sources (such as taxonomies or ontologies) have been proposed both to measure the practical utility and to guide the sanitization algorithms in terms of the preservation of data semantics [23, 57, 59].

      Bounded information loss measures. The information loss measures discussed above are unbounded, i.e., they do not take values in a predefined interval. On the other hand, as discussed below, disclosure risk measures are naturally bounded (the risk of disclosure is naturally bounded between 0 and 1). Defining bounded information loss measures may be convenient to enable the data protector to trade off information loss against disclosure risk. In [61], probabilistic information loss measures bounded between 0 and 1 are proposed for continuous data.

      Propensity scores: a global information loss measure for all types of data. In [105], an information loss measure U applicable to continuous and categorical microdata was proposed. It is computed as follows.

      1. Merge the original microdata set X and the anonymized microdata set Y, and add to the merged data set a binary attribute T with value 1 for the anonymized records and 0 for the original records.

      2. Regress T on the rest of attributes of the merged data set and call the adjusted attribute . For categorical attributes, logistic regression can be used.

      3. Let the propensity score p̂i of record i of the merged data set be the value of for record i. Then the utility of Y is high if the propensity scores of the anonymized and original records are similar (this means that, based on the regression model used, anonymized records cannot be distinguished from original records).

      4. Hence, if the number of original and anonymized records is the same, say N, a utility measure is

      The farther U from 0, the more information loss, and conversely.

      The goal of SDC to modify data so that sufficient protection is provided at minimum information loss suggests that a good anonymization method is one close to optimizing the trade-off between disclosure risk and information loss. Several approaches have been proposed to handle this tradeoff. Here we discuss SDC scores and R-U maps.

       SDC scores

      An SDC score is a formula that combines the effects of information loss and disclosure risk in a single figure. Having adopted an SDC score as a good trade-off measure, the goal is to optimize the score value. Following this idea, [25] proposed a score for method performance rating based on the average of information loss and disclosure risk measures. For each method M and parameterization P, the following score is computed:

      where IL is an information loss measure, DR is a disclosure risk measure, and Y is the protected data set obtained after applying method M with parameterization P to an original data set X. In [25] IL and DR were computed using a weighted combination of several information loss and disclosure risk measures. With the resulting score, a ranking of masking methods (and their parametrizations) was obtained. Using a score permits regarding the selection of a masking method and its parameters as an optimization problem: a masking method can be applied to the original data file and then a post-masking optimization procedure can be applied to decrease the score obtained (that is, to reduce information loss and disclosure risk). On the negative side, no specific score weighting can do justice to all methods. Thus, when ranking methods, the values of all measures of information loss and disclosure risk should be supplied along with the overall score.

       R-U maps

      A СКАЧАТЬ