6,329 research outputs found
Scalable Privacy-Preserving Data Sharing Methodology for Genome-Wide Association Studies
The protection of privacy of individual-level information in genome-wide
association study (GWAS) databases has been a major concern of researchers
following the publication of "an attack" on GWAS data by Homer et al. (2008)
Traditional statistical methods for confidentiality and privacy protection of
statistical databases do not scale well to deal with GWAS data, especially in
terms of guarantees regarding protection from linkage to external information.
The more recent concept of differential privacy, introduced by the
cryptographic community, is an approach that provides a rigorous definition of
privacy with meaningful privacy guarantees in the presence of arbitrary
external information, although the guarantees may come at a serious price in
terms of data utility. Building on such notions, Uhler et al. (2013) proposed
new methods to release aggregate GWAS data without compromising an individual's
privacy. We extend the methods developed in Uhler et al. (2013) for releasing
differentially-private -statistics by allowing for arbitrary number of
cases and controls, and for releasing differentially-private allelic test
statistics. We also provide a new interpretation by assuming the controls' data
are known, which is a realistic assumption because some GWAS use publicly
available data as controls. We assess the performance of the proposed methods
through a risk-utility analysis on a real data set consisting of DNA samples
collected by the Wellcome Trust Case Control Consortium and compare the methods
with the differentially-private release mechanism proposed by Johnson and
Shmatikov (2013).Comment: 28 pages, 2 figures, source code available upon reques
Planning as Optimization: Dynamically Discovering Optimal Configurations for Runtime Situations
The large number of possible configurations of modern software-based systems,
combined with the large number of possible environmental situations of such
systems, prohibits enumerating all adaptation options at design time and
necessitates planning at run time to dynamically identify an appropriate
configuration for a situation. While numerous planning techniques exist, they
typically assume a detailed state-based model of the system and that the
situations that warrant adaptations are known. Both of these assumptions can be
violated in complex, real-world systems. As a result, adaptation planning must
rely on simple models that capture what can be changed (input parameters) and
observed in the system and environment (output and context parameters). We
therefore propose planning as optimization: the use of optimization strategies
to discover optimal system configurations at runtime for each distinct
situation that is also dynamically identified at runtime. We apply our approach
to CrowdNav, an open-source traffic routing system with the characteristics of
a real-world system. We identify situations via clustering and conduct an
empirical study that compares Bayesian optimization and two types of
evolutionary optimization (NSGA-II and novelty search) in CrowdNav
Private Incremental Regression
Data is continuously generated by modern data sources, and a recent challenge
in machine learning has been to develop techniques that perform well in an
incremental (streaming) setting. In this paper, we investigate the problem of
private machine learning, where as common in practice, the data is not given at
once, but rather arrives incrementally over time.
We introduce the problems of private incremental ERM and private incremental
regression where the general goal is to always maintain a good empirical risk
minimizer for the history observed under differential privacy. Our first
contribution is a generic transformation of private batch ERM mechanisms into
private incremental ERM mechanisms, based on a simple idea of invoking the
private batch ERM procedure at some regular time intervals. We take this
construction as a baseline for comparison. We then provide two mechanisms for
the private incremental regression problem. Our first mechanism is based on
privately constructing a noisy incremental gradient function, which is then
used in a modified projected gradient procedure at every timestep. This
mechanism has an excess empirical risk of , where is the
dimensionality of the data. While from the results of [Bassily et al. 2014]
this bound is tight in the worst-case, we show that certain geometric
properties of the input and constraint set can be used to derive significantly
better results for certain interesting regression problems.Comment: To appear in PODS 201
Robust 1-Bit Compressed Sensing via Hinge Loss Minimization
This work theoretically studies the problem of estimating a structured
high-dimensional signal from noisy -bit Gaussian
measurements. Our recovery approach is based on a simple convex program which
uses the hinge loss function as data fidelity term. While such a risk
minimization strategy is very natural to learn binary output models, such as in
classification, its capacity to estimate a specific signal vector is largely
unexplored. A major difficulty is that the hinge loss is just piecewise linear,
so that its "curvature energy" is concentrated in a single point. This is
substantially different from other popular loss functions considered in signal
estimation, e.g., the square or logistic loss, which are at least locally
strongly convex. It is therefore somewhat unexpected that we can still prove
very similar types of recovery guarantees for the hinge loss estimator, even in
the presence of strong noise. More specifically, our non-asymptotic error
bounds show that stable and robust reconstruction of can be achieved with
the optimal oversampling rate in terms of the number of
measurements . Moreover, we permit a wide class of structural assumptions on
the ground truth signal, in the sense that can belong to an arbitrary
bounded convex set . The proofs of our main results
rely on some recent advances in statistical learning theory due to Mendelson.
In particular, we invoke an adapted version of Mendelson's small ball method
that allows us to establish a quadratic lower bound on the error of the first
order Taylor approximation of the empirical hinge loss function
Improved Practical Matrix Sketching with Guarantees
Matrices have become essential data representations for many large-scale
problems in data analytics, and hence matrix sketching is a critical task.
Although much research has focused on improving the error/size tradeoff under
various sketching paradigms, the many forms of error bounds make these
approaches hard to compare in theory and in practice. This paper attempts to
categorize and compare most known methods under row-wise streaming updates with
provable guarantees, and then to tweak some of these methods to gain practical
improvements while retaining guarantees.
For instance, we observe that a simple heuristic iSVD, with no guarantees,
tends to outperform all known approaches in terms of size/error trade-off. We
modify the best performing method with guarantees FrequentDirections under the
size/error trade-off to match the performance of iSVD and retain its
guarantees. We also demonstrate some adversarial datasets where iSVD performs
quite poorly. In comparing techniques in the time/error trade-off, techniques
based on hashing or sampling tend to perform better. In this setting we modify
the most studied sampling regime to retain error guarantee but obtain dramatic
improvements in the time/error trade-off.
Finally, we provide easy replication of our studies on APT, a new testbed
which makes available not only code and datasets, but also a computing platform
with fixed environmental settings.Comment: 27 page
Driving Markov chain Monte Carlo with a dependent random stream
Markov chain Monte Carlo is a widely-used technique for generating a
dependent sequence of samples from complex distributions. Conventionally, these
methods require a source of independent random variates. Most implementations
use pseudo-random numbers instead because generating true independent variates
with a physical system is not straightforward. In this paper we show how to
modify some commonly used Markov chains to use a dependent stream of random
numbers in place of independent uniform variates. The resulting Markov chains
have the correct invariant distribution without requiring detailed knowledge of
the stream's dependencies or even its marginal distribution. As a side-effect,
sometimes far fewer random numbers are required to obtain accurate results.Comment: 16 pages, 4 figure
Sparsity-Cognizant Total Least-Squares for Perturbed Compressive Sampling
Solving linear regression problems based on the total least-squares (TLS)
criterion has well-documented merits in various applications, where
perturbations appear both in the data vector as well as in the regression
matrix. However, existing TLS approaches do not account for sparsity possibly
present in the unknown vector of regression coefficients. On the other hand,
sparsity is the key attribute exploited by modern compressive sampling and
variable selection approaches to linear regression, which include noise in the
data, but do not account for perturbations in the regression matrix. The
present paper fills this gap by formulating and solving TLS optimization
problems under sparsity constraints. Near-optimum and reduced-complexity
suboptimum sparse (S-) TLS algorithms are developed to address the perturbed
compressive sampling (and the related dictionary learning) challenge, when
there is a mismatch between the true and adopted bases over which the unknown
vector is sparse. The novel S-TLS schemes also allow for perturbations in the
regression matrix of the least-absolute selection and shrinkage selection
operator (Lasso), and endow TLS approaches with ability to cope with sparse,
under-determined "errors-in-variables" models. Interesting generalizations can
further exploit prior knowledge on the perturbations to obtain novel weighted
and structured S-TLS solvers. Analysis and simulations demonstrate the
practical impact of S-TLS in calibrating the mismatch effects of contemporary
grid-based approaches to cognitive radio sensing, and robust
direction-of-arrival estimation using antenna arrays.Comment: 30 pages, 10 figures, submitted to IEEE Transactions on Signal
Processin
Minimizing Negative Transfer of Knowledge in Multivariate Gaussian Processes: A Scalable and Regularized Approach
Recently there has been an increasing interest in the multivariate Gaussian
process (MGP) which extends the Gaussian process (GP) to deal with multiple
outputs. One approach to construct the MGP and account for non-trivial
commonalities amongst outputs employs a convolution process (CP). The CP is
based on the idea of sharing latent functions across several convolutions.
Despite the elegance of the CP construction, it provides new challenges that
need yet to be tackled. First, even with a moderate number of outputs, model
building is extremely prohibitive due to the huge increase in computational
demands and number of parameters to be estimated. Second, the negative transfer
of knowledge may occur when some outputs do not share commonalities. In this
paper we address these issues. We propose a regularized pairwise modeling
approach for the MGP established using CP. The key feature of our approach is
to distribute the estimation of the full multivariate model into a group of
bivariate GPs which are individually built. Interestingly pairwise modeling
turns out to possess unique characteristics, which allows us to tackle the
challenge of negative transfer through penalizing the latent function that
facilitates information sharing in each bivariate model. Predictions are then
made through combining predictions from the bivariate models within a Bayesian
framework. The proposed method has excellent scalability when the number of
outputs is large and minimizes the negative transfer of knowledge between
uncorrelated outputs. Statistical guarantees for the proposed method are
studied and its advantageous features are demonstrated through numerical
studies
- …