32 research outputs found
LoPub: High-Dimensional Crowdsourced Data Publication with Local Differential Privacy
High-dimensional crowdsourced data collected from numerous users produces rich knowledge about our society. However, it also brings unprecedented privacy threats to the participants. Local differential privacy (LDP), a variant of differential privacy, is recently proposed as a state-of-the-art privacy notion. Unfortunately, achieving LDP on high-dimensional crowdsourced data publication raises great challenges in terms of both computational efficiency and data utility. To this end, based on Expectation Maximization (EM) algorithm and Lasso regression, we first propose efficient multi-dimensional joint distribution estimation algorithms with LDP. Then, we develop a Local differentially private high-dimensional data Publication algorithm, LoPub, by taking advantage of our distribution estimation techniques. In particular, correlations among multiple attributes are identified to reduce the dimensionality of crowdsourced data, thus speeding up the distribution learning process and achieving high data utility. Extensive experiments on realworld datasets demonstrate that our multivariate distribution estimation scheme significantly outperforms existing estimation schemes in terms of both communication overhead and estimation speed. Moreover, LoPub can keep, on average, 80% and 60% accuracy over the released datasets in terms of SVM and random forest classification, respectively
Differentially Private High-Dimensional Data Publication in Internet of Things
Internet of Things and the related computing paradigms, such as cloud computing and fog computing, provide solutions for various applications and services with massive and high-dimensional data, while produces threatens on the personal privacy. Differential privacy is a promising privacy-preserving definition for various applications and is enforced by injecting random noise into each query result such that the adversary with arbitrary background knowledge cannot infer sensitive input from the noisy results. Nevertheless, existing differentially private mechanisms have poor utility and high computation complexity on high-dimensional data because the necessary noise in queries is proportional to the size of the data domain, which is exponential to the dimensionality. To address these issues, we develop a compressed sensing mechanism (CSM) that enforces differential privacy on the basis of the compressed sensing framework while providing accurate results to linear queries. We derive the utility guarantee of CSM theoretically. An extensive experimental evaluation on real-world datasets over multiple fields demonstrates that our proposed mechanism consistently outperforms several state-of-the-art mechanisms under differential privacy
Privacy Amplification via Shuffling: Unified, Simplified, and Tightened
In decentralized settings, the shuffle model of differential privacy has
emerged as a promising alternative to the classical local model. Analyzing
privacy amplification via shuffling is a critical component in both
single-message and multi-message shuffle protocols. However, current methods
used in these two areas are distinct and specific, making them less convenient
for protocol designers and practitioners. In this work, we introduce
variation-ratio reduction as a unified framework for privacy amplification
analyses in the shuffle model. This framework utilizes total variation bounds
of local messages and probability ratio bounds of other users' blanket
messages, converting them to indistinguishable levels. Our results indicate
that the framework yields tighter bounds for both single-message and
multi-message encoders (e.g., with local DP, local metric DP, or general
multi-message randomizers). Specifically, for a broad range of local
randomizers having extremal probability design, our amplification bounds are
precisely tight. We also demonstrate that variation-ratio reduction is
well-suited for parallel composition in the shuffle model and results in
stricter privacy accounting for common sampling-based local randomizers. Our
experimental findings show that, compared to existing amplification bounds, our
numerical amplification bounds can save up to of the budget for
single-message protocols, of the budget for multi-message protocols, and
- of the budget for parallel composition. Additionally, our
implementation for numerical amplification bounds has only
complexity and is highly efficient in practice, taking just minutes for
users. The code for our implementation can be found at
\url{https://github.com/wangsw/PrivacyAmplification}.Comment: Code available at https://github.com/wangsw/PrivacyAmplificatio
Shallow Representations, Profound Discoveries : A methodological study of game culture in social media
This thesis explores the potential of representation learning techniques in game studies, highlighting their effectiveness and addressing challenges in data analysis. The primary focus of this thesis is shallow representation learning, which utilizes simpler model architectures but is able to yield effective modeling results. This thesis investigates the following research objectives: disentangling the dependencies of data, modeling temporal dynamics, learning multiple representations, and learning from heterogeneous data. The contributions of this thesis are made from two perspectives: empirical analysis and methodology development, to address these objectives. Chapters 1 and 2 provide a thorough introduction, motivation, and necessary background information for the thesis, framing the research and setting the stage for subsequent publications. Chapters 3 to 5 summarize the contribution of the 6 publications, each of which contributes to demonstrating the effectiveness of representation learning techniques in addressing various analytical challenges.
In Chapter 1 and 2, the research objects and questions are also motivated and described. In particular, Introduction to the primary application field game studies is provided and the connections of data analysis and game culture is highlighted. Basic notion of representation learning, and canonical techniques such as probabilistic principal component analysis, topic modeling, and embedding models are described. Analytical challenges and data types are also described to motivate the research of this thesis.
Chapter 3 presents two empirical analyses conducted in Publication I and II that present empirical data analysis on player typologies and temporal dynamics of player perceptions. The first empirical analysis takes the advantage of a factor model to offer a flexible player typology analysis. Results and analytical framework are particularly useful for personalized gamification. The Second empirical analysis uses topic modeling to analyze the temporal dynamic of player perceptions of the game No Man’s Sky in relation to game changes. The results reflect a variety of player perceptions including general gaming activities, game mechanic. Moreover, a set of underlying topics that are directly related to game updates and changes are extracted and the temporal dynamics of them have reflected that players responds differently to different updates and changes.
Chapter 4 presents two method developments that are related to factor models. The first method, DNBGFA, developed in Publication III, is a matrix factorization model for modeling the temporal dynamics of non-negative matrices from multiple sources. The second mothod, CFTM, developed in Publication IV introduces a factor model to a topic model to handle sophisticated document-level covariates. The develeopd methods in Chapter 4 are also demonstrated for analyzing text data.
Chapter 5 summarizes Publication V and Publication VI that develop embedding models. Publication V introduces Bayesian non-parametric to a graph embedding model to learn multiple representations for nodes. Publication VI utilizes a Gaussian copula model to deal with heterogeneous data in representation learning. The develeopd methods in Chapter 5 are also demonstrated for data analysis tasks in the context of online communities.
Lastly, Chapter 6 renders discussions and conclusions. Contributions of this thesis are highlighted, limitations, ongoing challenges, and potential future research directions are discussed
A Statistical Approach to the Alignment of fMRI Data
Multi-subject functional Magnetic Resonance Image studies are critical. The anatomical and functional structure varies across subjects, so the image alignment is necessary. We define a probabilistic model to describe functional alignment. Imposing a prior distribution, as the matrix Fisher Von Mises distribution, of the orthogonal transformation parameter, the anatomical information is embedded in the estimation of the parameters, i.e., penalizing the combination of spatially distant voxels. Real applications show an improvement in the classification and interpretability of the results compared to various functional alignment methods
A comparison of the CAR and DAGAR spatial random effects models with an application to diabetics rate estimation in Belgium
When hierarchically modelling an epidemiological phenomenon on a finite collection of sites in space, one must always take a latent spatial effect into account in order to capture the correlation structure that links the phenomenon to the territory. In this work, we compare two autoregressive spatial models that can be used for this purpose: the classical CAR model and the more recent DAGAR model. Differently from the former, the latter has a desirable property: its ρ parameter can be naturally interpreted as the average neighbor pair correlation and, in addition, this parameter can be directly estimated when the effect is modelled using a DAGAR rather than a CAR structure. As an application, we model the diabetics rate in Belgium in 2014 and show the adequacy of these models in predicting the response variable when no covariates are available