7 research outputs found
Differential Privacy in Metric Spaces: Numerical, Categorical and Functional Data Under the One Roof
We study Differential Privacy in the abstract setting of Probability on
metric spaces. Numerical, categorical and functional data can be handled in a
uniform manner in this setting. We demonstrate how mechanisms based on data
sanitisation and those that rely on adding noise to query responses fit within
this framework. We prove that once the sanitisation is differentially private,
then so is the query response for any query. We show how to construct
sanitisations for high-dimensional databases using simple 1-dimensional
mechanisms. We also provide lower bounds on the expected error for
differentially private sanitisations in the general metric space setting.
Finally, we consider the question of sufficient sets for differential privacy
and show that for relaxed differential privacy, any algebra generating the
Borel -algebra is a sufficient set for relaxed differential privacy.Comment: 18 Page
Characterizing the Sample Complexity of Private Learners
In 2008, Kasiviswanathan et al. defined private learning as a combination of
PAC learning and differential privacy. Informally, a private learner is applied
to a collection of labeled individual information and outputs a hypothesis
while preserving the privacy of each individual. Kasiviswanathan et al. gave a
generic construction of private learners for (finite) concept classes, with
sample complexity logarithmic in the size of the concept class. This sample
complexity is higher than what is needed for non-private learners, hence
leaving open the possibility that the sample complexity of private learning may
be sometimes significantly higher than that of non-private learning.
We give a combinatorial characterization of the sample size sufficient and
necessary to privately learn a class of concepts. This characterization is
analogous to the well known characterization of the sample complexity of
non-private learning in terms of the VC dimension of the concept class. We
introduce the notion of probabilistic representation of a concept class, and
our new complexity measure RepDim corresponds to the size of the smallest
probabilistic representation of the concept class.
We show that any private learning algorithm for a concept class C with sample
complexity m implies RepDim(C)=O(m), and that there exists a private learning
algorithm with sample complexity m=O(RepDim(C)). We further demonstrate that a
similar characterization holds for the database size needed for privately
computing a large class of optimization problems and also for the well studied
problem of private data release
Enabling Efficient Fuzzy Keyword Search over Encrypted Data in Cloud Computing
As Cloud Computing becomes prevalent, more and more sensitive information are being centralized into the cloud. For the
protection of data privacy, sensitive data usually have to be encrypted before outsourcing, which makes effective data
utilization a very challenging task. Although traditional searchable encryption schemes allow a user to securely search
over encrypted data through keywords and selectively retrieve files of interest, these techniques support only
\emph{exact} keyword search. That is, there is no tolerance of minor typos and format inconsistencies which, on the
other hand, are typical user searching behavior and happen very frequently. This significant drawback makes existing
techniques unsuitable in Cloud Computing as it greatly affects system usability, rendering user searching experiences
very frustrating and system efficacy very low. In this paper, for the first time we formalize and solve the problem of
effective fuzzy keyword search over encrypted cloud data while maintaining keyword privacy. Fuzzy keyword search
greatly enhances system usability by returning the matching files when users\u27 searching inputs exactly match the predefined keywords or the closest possible matching files based on keyword similarity semantics, when exact match fails. In our solution, we exploit edit distance to quantify keywords similarity and develop two advanced
techniques on constructing fuzzy keyword sets, which achieve optimized storage and representation overheads. We further propose a brand new symbol-based trie-traverse searching scheme, where a multi-way tree structure is built up using symbols transformed from the resulted fuzzy keyword sets. Through rigorous security analysis, we show that our proposed solution is secure and privacy-preserving, while correctly realizing the goal of fuzzy keyword search. Extensive
experimental results demonstrate the efficiency of the proposed solution
Topics in Massive Data Summarization.
We consider three problems in this thesis.
First, we want to construct a nearly workload-optimal histogram. Given B, we want to find the near optimal B bucket histogram under associated workload w within 1 + epsilon error tolerance. In the cash register model where data is streamed as a series of updates, we can build a histogram using polylogarithmic space, polylogarithmic time to process each item, and polylogarithmic post-processing time to build the histogram. All these results need the workload to be explicitly stored since we show that if the workload is summarized in small space lossily, algorithmic results such as above do not exist.
Then, we consider the problem of private computation of approximate Heavy Hitters. Alice and Bob each hold a vector and, in the vector sum, they want to find the B largest values along with their indices. We show how to solve the problem privately with polylogarithmic communication, polynomial work and constantly many rounds in the sense that nothing is learned by Alice and Bob beyond what is implied by their input, the ideal top-B output, and goodness of approximation (equivalently,the Euclidean norm of the vector sum). We give lower bounds showing that the Euclidean norm must leak by any efficient algorithm.
In the third problem, we want to build a near optimal histogram on probabilistic data streams. Given B, we want to find the near optimal B bucket histogram on probabilistic data streams under both L1 measurement and L2 measurement. We give deterministic algorithms without sampling. We can build histograms using poly-logarithmic space, polylogarithmic time to process each item, and polylogarithmic post-processing time to build the histogram. The result we give under L2 measurement is within 1 + epsilon error tolerance, and the result under L1 measurement is heuristic. We also give a direction to give guarantees to the heuristic.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/60841/1/xuanzh_1.pd
Private Approximation of Search Problems
Many approximation algorithms have been presented in the last decades for hard search problems. The focus of this paper is on cryptographic applications, where it is desired to design algorithms which do not leak unnecessary information. Specifically, we are interested in private approximation algorithms – efficient algorithms whose output does not leak information not implied by the optimal solutions to the search problems. Privacy requirements add constraints on the approximation algorithms; in particular, known approximation algorithms usually leak a lot of information. For functions, [Feigenbaum et al., ICALP 2001] presented a natural requirement that a private algorithm should not leak information not implied by the original function. Generalizing this requirement to search problems is not straight forward as an input may have many different outputs. We present a new definition that captures a minimal privacy requirement from such algorithms – applied to an input instance, it should not leak any information that is not implied by its collection of exact solutions. Although our privacy requirement seems minimal, we show that for well studied problems, as vertex cover and maximum exact 3SAT, private approximation algorithms are unlikely to exist even for poor approximation ratios. Similar to [Halevi et al., STOC 2001], we define a relaxed notion of approximation algorithms that leak (little) information, and demonstrate the applicability of this notion by showing near optimal approximation algorithms for maximum exact 3SAT which leak little information
Private approximation of clustering and vertex cover
Private approximation of search problems deals with finding approximate solutions to search problems while disclosing as little information as possible. The focus of this work is on private approximation of the vertex cover problem and two well studied clustering problems – k-center and k-median. Vertex cover was considered in [Beimel, Carmi, Nissim, and Weinreb, STOC, 2006] and we improve their infeasibility results. Clustering algorithms are frequently applied to sensitive data, and hence are of interest in the contexts of secure computation and private approximation. We show that these problems do not admit private approximations, or even approximation algorithms that leak significant number of bits. For the vertex cover problem we show a tight infeasibility result: every algorithm that ρ(n)-approximates vertex-cover must leak Ω(n/ρ(n)) bits (where n is the number of vertices in the graph). For the clustering problems we prove that even approximation algorithms with a poor approximation ratio must leak Ω(n) bits (where n is the number of points in the instance). For these results we develop new proof techniques, which are more simple and intuitive than those in Beimel et al., and yet allow stronger infeasibility results. Our proofs rely on the hardness of the promise problem where a unique optimal solution exists [Valiant and Vazirani, Theoretical Computer Science, 1986], on the hardness of approximating witnesses for NP-hard problems ([Kumar and Sivakumar