Over the last twenty years, there has been a tremendous growth in the amount of private data collected about individuals. With the rapid growth in database, networking, and computing technologies, such data can be integrated and analyzed digitally. On the one hand, this has led to the development of data mining tools that aim to infer useful trends from this data. But, on the other hand, easy access to personal data poses a threat to individual privacy. In this thesis, we provide models and algorithms for protecting the privacy of individuals in such large data sets while still allowing users to mine useful trends and statistics.
We focus on the problem of statistical disclosure control - revealing aggregate statistics about a population while preserving the privacy of individuals. A statistical database can be viewed as a table containing personal records, where the rows correspond to individuals and the columns correspond to different attributes. For example, a medical database may contain attributes such as name, social security number, address, age, gender, ethnicity, and medical history for each patient. We would like the medical researchers to have some form of access to this database so as to learn trends such as correlation between age and heart disease, while maintaining individual privacy. There are broadly two frameworks for protecting privacy in statistical databases. In the interactive framework, the user (researcher) queries the database through a privacy mechanism, which may deny the query or alter the answer to the query in order to ensure privacy. In the non-interactive framework, the original database is first sanitized so as to preserve privacy and then the modified version is released. We study methods under both these frameworks as each method is useful in different contexts.
The first part of the thesis focuses on the interactive framework and provides models and algorithms for two methods used in this framework. We first consider the online query auditing problem: given a sequence of queries that have already been posed about the data, their corresponding answers and given a new query, deny the answer if privacy can be breached or give the true answer otherwise. We uncover the fundamental problem that query denials leak information. As this problem was overlooked in previous work, some of the previously suggested auditors can be used by an attacker to compromise the privacy of a large fraction of the individuals in the data. To overcome this problem, we introduce a new model called simulatable auditing where query denials provably do not leak information. We also describe a probabilistic notion of (partial) compromise, in order to overcome the known limitations of the existing privacy definition. We then present simulatable auditing algorithms under both these definitions. The second problem we consider is output perturbation, in which the database administrator computes exact answer to the query and then outputs a perturbed answer (by adding random noise) as the response to the query. Inspired by the desire to enable individuals to retain control over their information, we provide a fault-tolerant distributed implementation of output perturbation schemes, thereby eliminating the need for a trusted database administrator.
The second part of the thesis focuses on the non-interactive framework and considers two anonymization methods for publishing data for analysis from a table containing personal records. We present approximation algorithms for anonymizing databases using the k-Anonymity model. Then we propose a new method for anonymizing data records, where the data records are clustered and then cluster centers are published. To ensure privacy of the data records, we impose the constraint that each cluster must contain no fewer than a pre-specified number of data records. We provide approximation algorithms to come up with such a clustering.