17,864 research outputs found

    Semi-supervised model-based clustering with controlled clusters leakage

    Full text link
    In this paper, we focus on finding clusters in partially categorized data sets. We propose a semi-supervised version of Gaussian mixture model, called C3L, which retrieves natural subgroups of given categories. In contrast to other semi-supervised models, C3L is parametrized by user-defined leakage level, which controls maximal inconsistency between initial categorization and resulting clustering. Our method can be implemented as a module in practical expert systems to detect clusters, which combine expert knowledge with true distribution of data. Moreover, it can be used for improving the results of less flexible clustering techniques, such as projection pursuit clustering. The paper presents extensive theoretical analysis of the model and fast algorithm for its efficient optimization. Experimental results show that C3L finds high quality clustering model, which can be applied in discovering meaningful groups in partially classified data

    Context-Aware Generative Adversarial Privacy

    Full text link
    Preserving the utility of published datasets while simultaneously providing provable privacy guarantees is a well-known challenge. On the one hand, context-free privacy solutions, such as differential privacy, provide strong privacy guarantees, but often lead to a significant reduction in utility. On the other hand, context-aware privacy solutions, such as information theoretic privacy, achieve an improved privacy-utility tradeoff, but assume that the data holder has access to dataset statistics. We circumvent these limitations by introducing a novel context-aware privacy framework called generative adversarial privacy (GAP). GAP leverages recent advancements in generative adversarial networks (GANs) to allow the data holder to learn privatization schemes from the dataset itself. Under GAP, learning the privacy mechanism is formulated as a constrained minimax game between two players: a privatizer that sanitizes the dataset in a way that limits the risk of inference attacks on the individuals' private variables, and an adversary that tries to infer the private variables from the sanitized dataset. To evaluate GAP's performance, we investigate two simple (yet canonical) statistical dataset models: (a) the binary data model, and (b) the binary Gaussian mixture model. For both models, we derive game-theoretically optimal minimax privacy mechanisms, and show that the privacy mechanisms learned from data (in a generative adversarial fashion) match the theoretically optimal ones. This demonstrates that our framework can be easily applied in practice, even in the absence of dataset statistics.Comment: Improved version of a paper accepted by Entropy Journal, Special Issue on Information Theory in Machine Learning and Data Scienc

    Privacy-Preserving Adversarial Networks

    Full text link
    We propose a data-driven framework for optimizing privacy-preserving data release mechanisms to attain the information-theoretically optimal tradeoff between minimizing distortion of useful data and concealing specific sensitive information. Our approach employs adversarially-trained neural networks to implement randomized mechanisms and to perform a variational approximation of mutual information privacy. We validate our Privacy-Preserving Adversarial Networks (PPAN) framework via proof-of-concept experiments on discrete and continuous synthetic data, as well as the MNIST handwritten digits dataset. For synthetic data, our model-agnostic PPAN approach achieves tradeoff points very close to the optimal tradeoffs that are analytically-derived from model knowledge. In experiments with the MNIST data, we visually demonstrate a learned tradeoff between minimizing the pixel-level distortion versus concealing the written digit.Comment: 16 page

    Innovations orthogonalization: a solution to the major pitfalls of EEG/MEG "leakage correction"

    Full text link
    The problem of interest here is the study of brain functional and effective connectivity based on non-invasive EEG-MEG inverse solution time series. These signals generally have low spatial resolution, such that an estimated signal at any one site is an instantaneous linear mixture of the true, actual, unobserved signals across all cortical sites. False connectivity can result from analysis of these low-resolution signals. Recent efforts toward "unmixing" have been developed, under the name of "leakage correction". One recent noteworthy approach is that by Colclough et al (2015 NeuroImage, 117:439-448), which forces the inverse solution signals to have zero cross-correlation at lag zero. One goal is to show that Colclough's method produces false human connectomes under very broad conditions. The second major goal is to develop a new solution, that appropriately "unmixes" the inverse solution signals, based on innovations orthogonalization. The new method first fits a multivariate autoregression to the inverse solution signals, giving the mixed innovations. Second, the mixed innovations are orthogonalized. Third, the mixed and orthogonalized innovations allow the estimation of the "unmixing" matrix, which is then finally used to "unmix" the inverse solution signals. It is shown that under very broad conditions, the new method produces proper human connectomes, even when the signals are not generated by an autoregressive model.Comment: preprint, technical report, under license "Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)", https://creativecommons.org/licenses/by-nc-nd/4.0
    corecore