    OptimShare: A Unified Framework for Privacy Preserving Data Sharing -- Towards the Practical Utility of Data with Privacy

    Tabular data sharing serves as a common method for data exchange. However, sharing sensitive information without adequate privacy protection can compromise individual privacy. Thus, ensuring privacy-preserving data sharing is crucial. Differential privacy (DP) is regarded as the gold standard in data privacy. Despite this, current DP methods tend to generate privacy-preserving tabular datasets that often suffer from limited practical utility due to heavy perturbation and disregard for the tables' utility dynamics. Besides, there has not been much research on selective attribute release, particularly in the context of controlled partially perturbed data sharing. This has significant implications for scenarios such as cross-agency data sharing in real-world situations. We introduce OptimShare: a utility-focused, multi-criteria solution designed to perturb input datasets selectively optimized for specific real-world applications. OptimShare combines the principles of differential privacy, fuzzy logic, and probability theory to establish an integrated tool for privacy-preserving data sharing. Empirical assessments confirm that OptimShare successfully strikes a balance between better data utility and robust privacy, effectively serving various real-world problem scenarios

    MVG Mechanism: Differential Privacy under Matrix-Valued Query

    Differential privacy mechanism design has traditionally been tailored for a scalar-valued query function. Although many mechanisms such as the Laplace and Gaussian mechanisms can be extended to a matrix-valued query function by adding i.i.d. noise to each element of the matrix, this method is often suboptimal as it forfeits an opportunity to exploit the structural characteristics typically associated with matrix analysis. To address this challenge, we propose a novel differential privacy mechanism called the Matrix-Variate Gaussian (MVG) mechanism, which adds a matrix-valued noise drawn from a matrix-variate Gaussian distribution, and we rigorously prove that the MVG mechanism preserves (ϵ,δ)(\epsilon,\delta)-differential privacy. Furthermore, we introduce the concept of directional noise made possible by the design of the MVG mechanism. Directional noise allows the impact of the noise on the utility of the matrix-valued query function to be moderated. Finally, we experimentally demonstrate the performance of our mechanism using three matrix-valued queries on three privacy-sensitive datasets. We find that the MVG mechanism notably outperforms four previous state-of-the-art approaches, and provides comparable utility to the non-private baseline.Comment: Appeared in CCS'1

    MobilityMirror: Bias-Adjusted Transportation Datasets

    We describe customized synthetic datasets for publishing mobility data. Private companies are providing new transportation modalities, and their data is of high value for integrative transportation research, policy enforcement, and public accountability. However, these companies are disincentivized from sharing data not only to protect the privacy of individuals (drivers and/or passengers), but also to protect their own competitive advantage. Moreover, demographic biases arising from how the services are delivered may be amplified if released data is used in other contexts. We describe a model and algorithm for releasing origin-destination histograms that removes selected biases in the data using causality-based methods. We compute the origin-destination histogram of the original dataset then adjust the counts to remove undesirable causal relationships that can lead to discrimination or violate contractual obligations with data owners. We evaluate the utility of the algorithm on real data from a dockless bike share program in Seattle and taxi data in New York, and show that these adjusted transportation datasets can retain utility while removing bias in the underlying data.Comment: Presented at BIDU 2018 workshop and published in Springer Communications in Computer and Information Science vol 92

    Differentially Private High-Dimensional Data Publication in Internet of Things

    Internet of Things and the related computing paradigms, such as cloud computing and fog computing, provide solutions for various applications and services with massive and high-dimensional data, while produces threatens on the personal privacy. Differential privacy is a promising privacy-preserving definition for various applications and is enforced by injecting random noise into each query result such that the adversary with arbitrary background knowledge cannot infer sensitive input from the noisy results. Nevertheless, existing differentially private mechanisms have poor utility and high computation complexity on high-dimensional data because the necessary noise in queries is proportional to the size of the data domain, which is exponential to the dimensionality. To address these issues, we develop a compressed sensing mechanism (CSM) that enforces differential privacy on the basis of the compressed sensing framework while providing accurate results to linear queries. We derive the utility guarantee of CSM theoretically. An extensive experimental evaluation on real-world datasets over multiple fields demonstrates that our proposed mechanism consistently outperforms several state-of-the-art mechanisms under differential privacy

    Heterogeneous differential privacy for vertically partitioned databases

    © 2019 John Wiley & Sons, Ltd. Existing privacy-preserving approaches are generally designed to provide privacy guarantee for individual data in a database, which reduces the utility of the database for data analysis. In this paper, we propose a novel differential privacy mechanism to preserve the heterogeneous privacy of a vertically partitioned database based on attributes. We first present the concept of privacy label, which characterizes the privacy information of the database and is instantiated by the classification. Then, we use an information-based method to systematically explore the dependencies between all attributes and the privacy label. We finally assign privacy weights to every attribute and design a heterogeneous mechanism according to the basic Laplace mechanism. Evaluations using real datasets demonstrate that the proposed mechanism achieves a balanced privacy and utility

    Privacy-Preserving Data Collection and Sharing in Modern Mobile Internet Systems

    With the ubiquity and widespread use of mobile devices such as laptops, smartphones, smartwatches, and IoT devices, large volumes of user data are generated and recorded. While there is great value in collecting, analyzing and sharing this data for improving products and services, data privacy poses a major concern. This dissertation research addresses the problem of privacy-preserving data collection and sharing in the context of both mobile trajectory data and mobile Internet access data. The first contribution of this dissertation research is the design and development of a system for utility-aware synthesis of differentially private and attack-resilient location traces, called AdaTrace. Given a set of real location traces, AdaTrace executes a four-phase process consisting of feature extraction, synopsis construction, noise injection, and generation of synthetic location traces. Compared to representative prior approaches, the location traces generated by AdaTrace offer up to 3-fold improvement in utility, measured using a variety of utility metrics and datasets, while preserving both differential privacy and attack resilience. The second contribution of this dissertation research is the design and development of locally private protocols for privacy-sensitive collection of mobile and Web user data. Motivated by the excessive utility loss of existing Local Differential Privacy (LDP) protocols under small user populations, this dissertation introduces the notion of Condensed Local Differential Privacy (CLDP) and a suite of protocols satisfying CLDP to enable the collection of various types of user data, ranging from ordinal data types in finite metric spaces (malware infection statistics), to non-ordinal items (OS versions and transaction categories), and to sequences of ordinal or non-ordinal items. Using cybersecurity data and case studies from Symantec, a major cybersecurity vendor, we show that proposed CLDP protocols are practical for key tasks including malware outbreak detection, OS vulnerability analysis, and inspecting suspicious activities on infected machines. The third contribution of this dissertation research is the development of a framework and a prototype system for evaluating privacy-utility tradeoffs of different LDP protocols, called LDPLens. LDPLens introduces metrics to evaluate protocol tradeoffs based on factors such as the utility metric, the data collection scenario, and the user-specified adversary metric. We develop a common Bayesian adversary model to analyze LDP protocols, and we formally and experimentally analyze Adversarial Success Rate (ASR) under each protocol. Motivated by the findings that numerous factors impact the ASR and utility behaviors of LDP protocols, we develop LDPLens to provide effective recommendations for finding the most suitable protocol in a given setting. Our three case studies with real-world datasets demonstrate that using the protocol recommended by LDPLens can offer substantial reduction in utility loss or in ASR, compared to using a randomly chosen protocol.Ph.D

    Enhancing Privacy and Fairness in Search Systems

    Following a period of expedited progress in the capabilities of digital systems, the society begins to realize that systems designed to assist people in various tasks can also harm individuals and society. Mediating access to information and explicitly or implicitly ranking people in increasingly many applications, search systems have a substantial potential to contribute to such unwanted outcomes. Since they collect vast amounts of data about both searchers and search subjects, they have the potential to violate the privacy of both of these groups of users. Moreover, in applications where rankings influence people's economic livelihood outside of the platform, such as sharing economy or hiring support websites, search engines have an immense economic power over their users in that they control user exposure in ranked results. This thesis develops new models and methods broadly covering different aspects of privacy and fairness in search systems for both searchers and search subjects. Specifically, it makes the following contributions: (1) We propose a model for computing individually fair rankings where search subjects get exposure proportional to their relevance. The exposure is amortized over time using constrained optimization to overcome searcher attention biases while preserving ranking utility. (2) We propose a model for computing sensitive search exposure where each subject gets to know the sensitive queries that lead to her profile in the top-k search results. The problem of finding exposing queries is technically modeled as reverse nearest neighbor search, followed by a weekly-supervised learning to rank model ordering the queries by privacy-sensitivity. (3) We propose a model for quantifying privacy risks from textual data in online communities. The method builds on a topic model where each topic is annotated by a crowdsourced sensitivity score, and privacy risks are associated with a user's relevance to sensitive topics. We propose relevance measures capturing different dimensions of user interest in a topic and show how they correlate with human risk perceptions. (4) We propose a model for privacy-preserving personalized search where search queries of different users are split and merged into synthetic profiles. The model mediates the privacy-utility trade-off by keeping semantically coherent fragments of search histories within individual profiles, while trying to minimize the similarity of any of the synthetic profiles to the original user profiles. The models are evaluated using information retrieval techniques and user studies over a variety of datasets, ranging from query logs, through social media and community question answering postings, to item listings from sharing economy platforms.Nach einer Zeit schneller Fortschritte in den Fähigkeiten digitaler Systeme beginnt die Gesellschaft zu erkennen, dass Systeme, die Menschen bei verschiedenen Aufgaben unterstützen sollen, den Einzelnen und die Gesellschaft auch schädigen können. Suchsysteme haben ein erhebliches Potenzial, um zu solchen unerwünschten Ergebnissen beizutragen, weil sie den Zugang zu Informationen vermitteln und explizit oder implizit Menschen in immer mehr Anwendungen in Ranglisten anordnen. Da sie riesige Datenmengen sowohl über Suchende als auch über Gesuchte sammeln, können sie die Privatsphäre dieser beiden Benutzergruppen verletzen. In Anwendungen, in denen Ranglisten einen Einfluss auf den finanziellen Lebensunterhalt der Menschen außerhalb der Plattform haben, z. B. auf Sharing-Economy-Plattformen oder Jobbörsen, haben Suchmaschinen eine immense wirtschaftliche Macht über ihre Nutzer, indem sie die Sichtbarkeit von Personen in Suchergebnissen kontrollieren. In dieser Dissertation werden neue Modelle und Methoden entwickelt, die verschiedene Aspekte der Privatsphäre und der Fairness in Suchsystemen, sowohl für Suchende als auch für Gesuchte, abdecken. Insbesondere leistet die Arbeit folgende Beiträge: (1) Wir schlagen ein Modell für die Berechnung von fairen Rankings vor, bei denen Suchsubjekte entsprechend ihrer Relevanz angezeigt werden. Die Sichtbarkeit wird im Laufe der Zeit durch ein Optimierungsmodell adjustiert, um die Verzerrungen der Sichtbarkeit für Sucher zu kompensieren, während die Nützlichkeit des Rankings beibehalten bleibt. (2) Wir schlagen ein Modell für die Bestimmung kritischer Suchanfragen vor, in dem für jeden Nutzer Aanfragen, die zu seinem Nutzerprofil in den Top-k-Suchergebnissen führen, herausgefunden werden. Das Problem der Berechnung von exponierenden Suchanfragen wird als Reverse-Nearest-Neighbor-Suche modelliert. Solche kritischen Suchanfragen werden dann von einem Learning-to-Rank-Modell geordnet, um die sensitiven Suchanfragen herauszufinden. (3) Wir schlagen ein Modell zur Quantifizierung von Risiken für die Privatsphäre aus Textdaten in Online Communities vor. Die Methode baut auf einem Themenmodell auf, bei dem jedes Thema durch einen Crowdsourcing-Sensitivitätswert annotiert wird. Die Risiko-Scores sind mit der Relevanz eines Benutzers mit kritischen Themen verbunden. Wir schlagen Relevanzmaße vor, die unterschiedliche Dimensionen des Benutzerinteresses an einem Thema erfassen, und wir zeigen, wie diese Maße mit der Risikowahrnehmung von Menschen korrelieren. (4) Wir schlagen ein Modell für personalisierte Suche vor, in dem die Privatsphäre geschützt wird. In dem Modell werden Suchanfragen von Nutzer partitioniert und in synthetische Profile eingefügt. Das Modell erreicht einen guten Kompromiss zwischen der Suchsystemnützlichkeit und der Privatsphäre, indem semantisch kohärente Fragmente der Suchhistorie innerhalb einzelner Profile beibehalten werden, wobei gleichzeitig angestrebt wird, die Ähnlichkeit der synthetischen Profile mit den ursprünglichen Nutzerprofilen zu minimieren. Die Modelle werden mithilfe von Informationssuchtechniken und Nutzerstudien ausgewertet. Wir benutzen eine Vielzahl von Datensätzen, die von Abfrageprotokollen über soziale Medien Postings und die Fragen vom Q&A Forums bis hin zu Artikellistungen von Sharing-Economy-Plattformen reichen