17,864 research outputs found
Semi-supervised model-based clustering with controlled clusters leakage
In this paper, we focus on finding clusters in partially categorized data
sets. We propose a semi-supervised version of Gaussian mixture model, called
C3L, which retrieves natural subgroups of given categories. In contrast to
other semi-supervised models, C3L is parametrized by user-defined leakage
level, which controls maximal inconsistency between initial categorization and
resulting clustering. Our method can be implemented as a module in practical
expert systems to detect clusters, which combine expert knowledge with true
distribution of data. Moreover, it can be used for improving the results of
less flexible clustering techniques, such as projection pursuit clustering. The
paper presents extensive theoretical analysis of the model and fast algorithm
for its efficient optimization. Experimental results show that C3L finds high
quality clustering model, which can be applied in discovering meaningful groups
in partially classified data
Context-Aware Generative Adversarial Privacy
Preserving the utility of published datasets while simultaneously providing
provable privacy guarantees is a well-known challenge. On the one hand,
context-free privacy solutions, such as differential privacy, provide strong
privacy guarantees, but often lead to a significant reduction in utility. On
the other hand, context-aware privacy solutions, such as information theoretic
privacy, achieve an improved privacy-utility tradeoff, but assume that the data
holder has access to dataset statistics. We circumvent these limitations by
introducing a novel context-aware privacy framework called generative
adversarial privacy (GAP). GAP leverages recent advancements in generative
adversarial networks (GANs) to allow the data holder to learn privatization
schemes from the dataset itself. Under GAP, learning the privacy mechanism is
formulated as a constrained minimax game between two players: a privatizer that
sanitizes the dataset in a way that limits the risk of inference attacks on the
individuals' private variables, and an adversary that tries to infer the
private variables from the sanitized dataset. To evaluate GAP's performance, we
investigate two simple (yet canonical) statistical dataset models: (a) the
binary data model, and (b) the binary Gaussian mixture model. For both models,
we derive game-theoretically optimal minimax privacy mechanisms, and show that
the privacy mechanisms learned from data (in a generative adversarial fashion)
match the theoretically optimal ones. This demonstrates that our framework can
be easily applied in practice, even in the absence of dataset statistics.Comment: Improved version of a paper accepted by Entropy Journal, Special
Issue on Information Theory in Machine Learning and Data Scienc
Privacy-Preserving Adversarial Networks
We propose a data-driven framework for optimizing privacy-preserving data
release mechanisms to attain the information-theoretically optimal tradeoff
between minimizing distortion of useful data and concealing specific sensitive
information. Our approach employs adversarially-trained neural networks to
implement randomized mechanisms and to perform a variational approximation of
mutual information privacy. We validate our Privacy-Preserving Adversarial
Networks (PPAN) framework via proof-of-concept experiments on discrete and
continuous synthetic data, as well as the MNIST handwritten digits dataset. For
synthetic data, our model-agnostic PPAN approach achieves tradeoff points very
close to the optimal tradeoffs that are analytically-derived from model
knowledge. In experiments with the MNIST data, we visually demonstrate a
learned tradeoff between minimizing the pixel-level distortion versus
concealing the written digit.Comment: 16 page
Innovations orthogonalization: a solution to the major pitfalls of EEG/MEG "leakage correction"
The problem of interest here is the study of brain functional and effective
connectivity based on non-invasive EEG-MEG inverse solution time series. These
signals generally have low spatial resolution, such that an estimated signal at
any one site is an instantaneous linear mixture of the true, actual, unobserved
signals across all cortical sites. False connectivity can result from analysis
of these low-resolution signals. Recent efforts toward "unmixing" have been
developed, under the name of "leakage correction". One recent noteworthy
approach is that by Colclough et al (2015 NeuroImage, 117:439-448), which
forces the inverse solution signals to have zero cross-correlation at lag zero.
One goal is to show that Colclough's method produces false human connectomes
under very broad conditions. The second major goal is to develop a new
solution, that appropriately "unmixes" the inverse solution signals, based on
innovations orthogonalization. The new method first fits a multivariate
autoregression to the inverse solution signals, giving the mixed innovations.
Second, the mixed innovations are orthogonalized. Third, the mixed and
orthogonalized innovations allow the estimation of the "unmixing" matrix, which
is then finally used to "unmix" the inverse solution signals. It is shown that
under very broad conditions, the new method produces proper human connectomes,
even when the signals are not generated by an autoregressive model.Comment: preprint, technical report, under license
"Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND
4.0)", https://creativecommons.org/licenses/by-nc-nd/4.0
- …