Название: Database Anonymization
Автор: David Sánchez
Издательство: Ingram
Жанр: Компьютеры: прочее
Серия: Synthesis Lectures on Information Security, Privacy, and Trust
isbn: 9781681731988
isbn:
9.2 Differentially Private Data Sets by Insensitive Microaggregation
9.3 General Insensitive Microaggregation
9.4 Differential Privacy with Categorical Attributes
9.5 A Semantic Distance for Differential Privacy
9.6 Integrating Heterogeneous Attribute Types
9.7 Summary
10 Differential Privacy by Individual Ranking Microaggregation
10.1 Limitations of Multivariate Microaggregation
10.2 Sensitivity Reduction Via Individual Ranking
10.3 Choosing the Microggregation Parameter k
10.4 Summary
11 Conclusions and Research Directions
11.1 Summary and Conclusions
11.2 Research Directions
Preface
If jet airplanes ushered in the first dramatic reduction of our world’s perceived size, the next shrinking came in the mid 1990s, when the Internet became widespread and the Information Age started to become a reality. We now live in a global village and some (often quite powerful) voices proclaim that maintaining one’s privacy is as hopeless as it used to be in conventional small villages. Should this be true, the ingenuity of humans would have created their own nightmare.
Whereas security is essential for organizations to survive, individuals and sometimes even companies need also some privacy to develop comfortably and lead a free life. This is the reason individual privacy is mentioned in the Universal Declaration of Human Rights (1948) and data privacy is protected by law in most Western countries. Indeed, without privacy, other fundamental rights, like freedom of speech and democracy, are impaired. The outstanding challenge is to create technology that implements those legal guarantees in a way compatible with functionality and security.
This book is devoted to privacy preservation in data releases. Indeed, in our era of big data, harnessing the enormous wealth of information available is essential to increasing the progress and well-being of humankind. The challenge is how to release data that are useful for administrations and companies to make accurate decisions without disclosing sensitive information on specific identifiable individuals.
This conflict between utility and privacy has motivated research by several communities since the 1970s, both in official statistics and computer science. Specifically, computer scientists contributed the important notion of the privacy model in the late 1990s, with k-anonymity being the first practical privacy model. The idea of a privacy model is to state ex ante privacy guarantees that can be attained for a particular data set using one (or several) anonymization methods.
In addition to k-anonymity, we survey here its extensions l-diversity and t-closeness, as well as the alternative paradigm of differential privacy. Further, we draw on our recent research to report connections and synergies between all these privacy models: in fact, the k-anonymity-like models and differential privacy turn out to be more related than previously thought. We also show how microaggregation, a well-known family of anonymization methods that we have developed to a large extent since the late 1990s, can be used to create anonymization methods that satisfy most of the surveyed privacy models while improving the utility of the resulting protected data.
We sincerely hope that the reader, whether academic or practitioner, will benefit from this piece of work. On our side, we have enjoyed writing it and also conducting the original research described in some of the chapters.
Josep Domingo-Ferrer, David Sánchez, and Jordi Soria-Comas
January 2016
Acknowledgments
We thank Professor Elisa Bertino for encouraging us to write this Synthesis Lecture. This work was partly supported by the European Commission (through project H2020 “CLARUS”), by the Spanish Government (through projects “ICWT” TIN2012-32757 and “SmartGlacis” TIN2014-57364-C2-1-R), by the Government of Catalonia (under grant 2014 SGR 537), and by the Templeton World Charity Foundation (under project “CO-UTILITY”). Josep Domingo-Ferrer is partially supported as an ICREA-Acadèmia researcher by the Government of Catalonia. The authors are with the UNESCO Chair in Data Privacy, but the opinions expressed in this work are the authors’ own and do not necessarily reflect the views of UNESCO or any of the funders.
Josep Domingo-Ferrer, David Sánchez, and Jordi Soria-Comas
January 2016
CHAPTER 1
Introduction
The current social and economic context increasingly demands open data to improve planning, scientific research, market analysis, etc. In particular, the public sector is pushed to release as much information as possible for the sake of transparency. Organizations releasing data include national statistical institutes (whose core mission is to publish statistical information), healthcare authorities (which occasionally release epidemiologic information) or even private organizations (which sometimes publish consumer surveys). When published data refer to individual respondents, care must be exerted for the privacy of the latter not to be violated. It should be de facto impossible to relate the published data to specific individuals. Indeed, supplying data to national statistical institutes is compulsory in most countries but, in return, these institutes commit to preserving the privacy of the respondents. Hence, rather than publishing accurate information for each individual, the aim should be to provide useful statistical information, that is, to preserve as much as possible in the released data the statistical properties of the original data.
Disclosure risk limitation has a long tradition in official statistics, where privacy-preserving databases on individuals are called statistical databases. Inference control in statistical databases, also known as Statistical Disclosure Control (SDC), Statistical Disclosure Limitation (SDL), database anonymization, or database sanitization, is a discipline that seeks to protect data so that they can be published without revealing confidential information that can be linked to specific individuals among those to whom the data correspond.
Disclosure limitation has also been a topic of interest in the computer science research community, which refers to it as Privacy Preserving Data Publishing (PPDP) and Privacy Preserving Data Mining (PPDM). The latter focuses on protecting the privacy of the results of data mining tasks, whereas the former focuses on the publication of data of individuals.
Whereas both СКАЧАТЬ