Database Anonymization. David Sánchez
Чтение книги онлайн.

Читать онлайн книгу Database Anonymization - David Sánchez страница 9

СКАЧАТЬ

      • Masking by correlated noise addition. Noise addition alone always modifies the variance of the original attributes. Thus, if we want to preserve the correlation coefficients of the original data, the covariances must be modified. This is what masking by correlated noise does. By taking the covariance matrix of the noise to be proportional to the covariance matrix of the original data we have:

      • Masking by noise addition and linear transformation. In [48], a method is proposed that ensures by additional transformations that the sample covariance matrix of the masked attributes is an unbiased estimator for the covariance matrix of the original attributes.

      • Masking by noise addition and nonlinear transformation. Combining simple additive noise and nonlinear transformation has also been proposed, in such a way that application to discrete attributes is possible and univariate distributions are preserved. Unfortunately, the application of this method is very time-consuming and requires expert knowledge on the data set and the algorithm. See [44] for more details.

      Noise addition methods with normal distributions are naturally meant for continuous data, even though some adaptations to categorical data have been also proposed [76]. Moreover, the introduction of the differential privacy model for disclosure control has motivated the use of other noise distributions. The focus here is on the preservation of the privacy guarantees of the model rather than the statistical properties of the data. The addition of uncorrelated Laplace distributed noise is the most common approach to attain differential privacy [29]. For the case of discrete data, the geometric distribution [33] is a better alternative to the Laplace distributed noise. It has also been shown that the Laplace distribution is not the optimal noise in attaining differential privacy for continuous data [92].

       Data/rank swapping

      Data swapping was originally presented as an SDC method for databases containing only categorical attributes. The basic idea behind the method is to transform a database by exchanging values of confidential attributes among individual records. Records are exchanged in such a way that low-order frequency counts or marginals are maintained.

      In spite of the original procedure not being very used in practice, its basic idea had a clear influence in subsequent methods. A variant of data swapping for microdata is rank swapping, which will be described next in some detail. Although originally described only for ordinal attributes [36], rank swapping can also be used for any numerical attribute. See Algorithm 1. First, values of an attribute A are ranked in ascending order, then each ranked value of A is swapped with another ranked value randomly chosen within a restricted range (e.g., the rank of two swapped values cannot differ by more than p% of the total number of records, where p is an input parameter).

      This algorithm is independently used on each original attribute in the original data set. It is reasonable to expect that multivariate statistics computed from data swapped with this algorithm will be less distorted than those computed after an unconstrained swap.

       Microaggregation

      Microaggregation is a family of SDC techniques for continuous microdata. The rationale behind microaggregation is that confidentiality rules in use allow publication of microdata sets if records correspond to groups of k or more individuals, where no individual dominates (i.e., contributes too much to) the group and k is a threshold value. Strict application of such confidentiality rules leads to replacing individual values with values computed on small aggregates (microaggregates) prior to publication. This is the basic principle of microaggregation. To obtain microaggregates in a microdata set with n records, these are combined to form g groups of size at least k. For each attribute, the average value over each group is computed and is used to replace each of the original averaged values. Groups are formed using a criterion of maximal similarity. Once the procedure has been completed, the resulting (modified) records can be published. The optimal k-partition (from the information loss point of view) is defined to be the one that maximizes within-group homogeneity; the higher the within-group homogeneity, the lower the information loss, since microaggregation replaces values in a group by the group centroid. The sum of squares criterion is common to measure homogeneity in clustering. The within-groups sum of squares SSE is defined as:

      The between-groups sum of squares SSA is

      The total sum of squares is SST = SSA C SSE or explicitly

      The lower the SSE, the higher the within group homogeneity. Thus, in terms of sums of squares, the optimal k-partition is the one that minimizes SSE (or equivalently, maximizes SSA).

      Given a microdata set consisting of p attributes, these can be microaggregated together or partitioned into several groups of attributes and then microaggregated. Also, the way to form groups may vary. Several taxonomies are possible to classify the microaggregation algorithms in the literature: (i) fixed group size [17, 26, 45] vs. variable group size [21, 49, 57]; (ii) exact optimal (only for the univariate case, [41]) vs. heuristic microaggregation (the rest of the microaggregation literature); (iii) categorical [26, 57] vs. continuous (the rest of the references cited in this paragraph). Also, depending on whether they deal with one or several attributes at a time, microaggregation methods can be classified into univariate and multivariate.

      • Univariate methods deal with multi-attribute data sets by microaggregating one attribute at a time. Input records are sorted by the first attribute, then groups of successive k values of the first attribute are created and all values within that group are replaced by the group representative (e.g., centroid). The same procedure is repeated for the rest of the attributes. Notice that all attribute values of each record are moved together when sorting records by a particular attribute; hence, the relation between the attribute values within each record is preserved. This approach is known as individual ranking [16, 17]. Individual ranking just reduces the variability of attributes, thereby providing some anonymization. In [25] it was shown that individual ranking causes low information loss and, thus, its output better preserves analytical utility. However, the disclosure risk in the anonymized output remains unacceptably high [22].

      • To deal with several attributes at a time, the trivial option is to map multi-attribute data sets to univariate data by projecting the former onto a single axis (e.g., using the sum of z-scores or the first principal component, see [16]) and then use univariate microaggregation on the univariate data. Another option avoiding the information loss due to single-axis projection is to use multivariate microaggregation able to deal with unprojected multi-attribute data [21]. If we define optimal microaggregation as finding a partition in groups of size at least k such that within-groups homogeneity is maximum, it turns out that, while optimal univariate microaggregation can be solved in polynomial time [41], unfortunately optimal multivariate microaggregation is NP-hard [70]. This justifies the use of heuristic methods for multivariate microaggregation, such as the MDAV (Maximum Distance to Average Vector, [20, 26]). In any case, multivariate microaggregation leads to higher information loss than individual ranking [25].

      To illustrate these approaches, we next give the details of the MDAV heuristic СКАЧАТЬ