6,329 research outputs found

    Scalable Privacy-Preserving Data Sharing Methodology for Genome-Wide Association Studies

    Full text link
    The protection of privacy of individual-level information in genome-wide association study (GWAS) databases has been a major concern of researchers following the publication of "an attack" on GWAS data by Homer et al. (2008) Traditional statistical methods for confidentiality and privacy protection of statistical databases do not scale well to deal with GWAS data, especially in terms of guarantees regarding protection from linkage to external information. The more recent concept of differential privacy, introduced by the cryptographic community, is an approach that provides a rigorous definition of privacy with meaningful privacy guarantees in the presence of arbitrary external information, although the guarantees may come at a serious price in terms of data utility. Building on such notions, Uhler et al. (2013) proposed new methods to release aggregate GWAS data without compromising an individual's privacy. We extend the methods developed in Uhler et al. (2013) for releasing differentially-private χ2\chi^2-statistics by allowing for arbitrary number of cases and controls, and for releasing differentially-private allelic test statistics. We also provide a new interpretation by assuming the controls' data are known, which is a realistic assumption because some GWAS use publicly available data as controls. We assess the performance of the proposed methods through a risk-utility analysis on a real data set consisting of DNA samples collected by the Wellcome Trust Case Control Consortium and compare the methods with the differentially-private release mechanism proposed by Johnson and Shmatikov (2013).Comment: 28 pages, 2 figures, source code available upon reques

    Planning as Optimization: Dynamically Discovering Optimal Configurations for Runtime Situations

    Full text link
    The large number of possible configurations of modern software-based systems, combined with the large number of possible environmental situations of such systems, prohibits enumerating all adaptation options at design time and necessitates planning at run time to dynamically identify an appropriate configuration for a situation. While numerous planning techniques exist, they typically assume a detailed state-based model of the system and that the situations that warrant adaptations are known. Both of these assumptions can be violated in complex, real-world systems. As a result, adaptation planning must rely on simple models that capture what can be changed (input parameters) and observed in the system and environment (output and context parameters). We therefore propose planning as optimization: the use of optimization strategies to discover optimal system configurations at runtime for each distinct situation that is also dynamically identified at runtime. We apply our approach to CrowdNav, an open-source traffic routing system with the characteristics of a real-world system. We identify situations via clustering and conduct an empirical study that compares Bayesian optimization and two types of evolutionary optimization (NSGA-II and novelty search) in CrowdNav

    Private Incremental Regression

    Full text link
    Data is continuously generated by modern data sources, and a recent challenge in machine learning has been to develop techniques that perform well in an incremental (streaming) setting. In this paper, we investigate the problem of private machine learning, where as common in practice, the data is not given at once, but rather arrives incrementally over time. We introduce the problems of private incremental ERM and private incremental regression where the general goal is to always maintain a good empirical risk minimizer for the history observed under differential privacy. Our first contribution is a generic transformation of private batch ERM mechanisms into private incremental ERM mechanisms, based on a simple idea of invoking the private batch ERM procedure at some regular time intervals. We take this construction as a baseline for comparison. We then provide two mechanisms for the private incremental regression problem. Our first mechanism is based on privately constructing a noisy incremental gradient function, which is then used in a modified projected gradient procedure at every timestep. This mechanism has an excess empirical risk of ≈d\approx\sqrt{d}, where dd is the dimensionality of the data. While from the results of [Bassily et al. 2014] this bound is tight in the worst-case, we show that certain geometric properties of the input and constraint set can be used to derive significantly better results for certain interesting regression problems.Comment: To appear in PODS 201

    Robust 1-Bit Compressed Sensing via Hinge Loss Minimization

    Full text link
    This work theoretically studies the problem of estimating a structured high-dimensional signal x0∈Rnx_0 \in \mathbb{R}^n from noisy 11-bit Gaussian measurements. Our recovery approach is based on a simple convex program which uses the hinge loss function as data fidelity term. While such a risk minimization strategy is very natural to learn binary output models, such as in classification, its capacity to estimate a specific signal vector is largely unexplored. A major difficulty is that the hinge loss is just piecewise linear, so that its "curvature energy" is concentrated in a single point. This is substantially different from other popular loss functions considered in signal estimation, e.g., the square or logistic loss, which are at least locally strongly convex. It is therefore somewhat unexpected that we can still prove very similar types of recovery guarantees for the hinge loss estimator, even in the presence of strong noise. More specifically, our non-asymptotic error bounds show that stable and robust reconstruction of x0x_0 can be achieved with the optimal oversampling rate O(m−1/2)O(m^{-1/2}) in terms of the number of measurements mm. Moreover, we permit a wide class of structural assumptions on the ground truth signal, in the sense that x0x_0 can belong to an arbitrary bounded convex set K⊂RnK \subset \mathbb{R}^n. The proofs of our main results rely on some recent advances in statistical learning theory due to Mendelson. In particular, we invoke an adapted version of Mendelson's small ball method that allows us to establish a quadratic lower bound on the error of the first order Taylor approximation of the empirical hinge loss function

    Improved Practical Matrix Sketching with Guarantees

    Full text link
    Matrices have become essential data representations for many large-scale problems in data analytics, and hence matrix sketching is a critical task. Although much research has focused on improving the error/size tradeoff under various sketching paradigms, the many forms of error bounds make these approaches hard to compare in theory and in practice. This paper attempts to categorize and compare most known methods under row-wise streaming updates with provable guarantees, and then to tweak some of these methods to gain practical improvements while retaining guarantees. For instance, we observe that a simple heuristic iSVD, with no guarantees, tends to outperform all known approaches in terms of size/error trade-off. We modify the best performing method with guarantees FrequentDirections under the size/error trade-off to match the performance of iSVD and retain its guarantees. We also demonstrate some adversarial datasets where iSVD performs quite poorly. In comparing techniques in the time/error trade-off, techniques based on hashing or sampling tend to perform better. In this setting we modify the most studied sampling regime to retain error guarantee but obtain dramatic improvements in the time/error trade-off. Finally, we provide easy replication of our studies on APT, a new testbed which makes available not only code and datasets, but also a computing platform with fixed environmental settings.Comment: 27 page

    Driving Markov chain Monte Carlo with a dependent random stream

    Full text link
    Markov chain Monte Carlo is a widely-used technique for generating a dependent sequence of samples from complex distributions. Conventionally, these methods require a source of independent random variates. Most implementations use pseudo-random numbers instead because generating true independent variates with a physical system is not straightforward. In this paper we show how to modify some commonly used Markov chains to use a dependent stream of random numbers in place of independent uniform variates. The resulting Markov chains have the correct invariant distribution without requiring detailed knowledge of the stream's dependencies or even its marginal distribution. As a side-effect, sometimes far fewer random numbers are required to obtain accurate results.Comment: 16 pages, 4 figure

    Sparsity-Cognizant Total Least-Squares for Perturbed Compressive Sampling

    Full text link
    Solving linear regression problems based on the total least-squares (TLS) criterion has well-documented merits in various applications, where perturbations appear both in the data vector as well as in the regression matrix. However, existing TLS approaches do not account for sparsity possibly present in the unknown vector of regression coefficients. On the other hand, sparsity is the key attribute exploited by modern compressive sampling and variable selection approaches to linear regression, which include noise in the data, but do not account for perturbations in the regression matrix. The present paper fills this gap by formulating and solving TLS optimization problems under sparsity constraints. Near-optimum and reduced-complexity suboptimum sparse (S-) TLS algorithms are developed to address the perturbed compressive sampling (and the related dictionary learning) challenge, when there is a mismatch between the true and adopted bases over which the unknown vector is sparse. The novel S-TLS schemes also allow for perturbations in the regression matrix of the least-absolute selection and shrinkage selection operator (Lasso), and endow TLS approaches with ability to cope with sparse, under-determined "errors-in-variables" models. Interesting generalizations can further exploit prior knowledge on the perturbations to obtain novel weighted and structured S-TLS solvers. Analysis and simulations demonstrate the practical impact of S-TLS in calibrating the mismatch effects of contemporary grid-based approaches to cognitive radio sensing, and robust direction-of-arrival estimation using antenna arrays.Comment: 30 pages, 10 figures, submitted to IEEE Transactions on Signal Processin

    Minimizing Negative Transfer of Knowledge in Multivariate Gaussian Processes: A Scalable and Regularized Approach

    Full text link
    Recently there has been an increasing interest in the multivariate Gaussian process (MGP) which extends the Gaussian process (GP) to deal with multiple outputs. One approach to construct the MGP and account for non-trivial commonalities amongst outputs employs a convolution process (CP). The CP is based on the idea of sharing latent functions across several convolutions. Despite the elegance of the CP construction, it provides new challenges that need yet to be tackled. First, even with a moderate number of outputs, model building is extremely prohibitive due to the huge increase in computational demands and number of parameters to be estimated. Second, the negative transfer of knowledge may occur when some outputs do not share commonalities. In this paper we address these issues. We propose a regularized pairwise modeling approach for the MGP established using CP. The key feature of our approach is to distribute the estimation of the full multivariate model into a group of bivariate GPs which are individually built. Interestingly pairwise modeling turns out to possess unique characteristics, which allows us to tackle the challenge of negative transfer through penalizing the latent function that facilitates information sharing in each bivariate model. Predictions are then made through combining predictions from the bivariate models within a Bayesian framework. The proposed method has excellent scalability when the number of outputs is large and minimizes the negative transfer of knowledge between uncorrelated outputs. Statistical guarantees for the proposed method are studied and its advantageous features are demonstrated through numerical studies
    • …
    corecore