26,052 research outputs found
Non-convex Global Minimization and False Discovery Rate Control for the TREX
The TREX is a recently introduced method for performing sparse
high-dimensional regression. Despite its statistical promise as an alternative
to the lasso, square-root lasso, and scaled lasso, the TREX is computationally
challenging in that it requires solving a non-convex optimization problem. This
paper shows a remarkable result: despite the non-convexity of the TREX problem,
there exists a polynomial-time algorithm that is guaranteed to find the global
minimum. This result adds the TREX to a very short list of non-convex
optimization problems that can be globally optimized (principal components
analysis being a famous example). After deriving and developing this new
approach, we demonstrate that (i) the ability of the preexisting TREX heuristic
to reach the global minimum is strongly dependent on the difficulty of the
underlying statistical problem, (ii) the new polynomial-time algorithm for TREX
permits a novel variable ranking and selection scheme, (iii) this scheme can be
incorporated into a rule that controls the false discovery rate (FDR) of
included features in the model. To achieve this last aim, we provide an
extension of the results of Barber & Candes (2015) to establish that the
knockoff filter framework can be applied to the TREX. This investigation thus
provides both a rare case study of a heuristic for non-convex optimization and
a novel way of exploiting non-convexity for statistical inference
Algorithmic and Statistical Perspectives on Large-Scale Data Analysis
In recent years, ideas from statistics and scientific computing have begun to
interact in increasingly sophisticated and fruitful ways with ideas from
computer science and the theory of algorithms to aid in the development of
improved worst-case algorithms that are useful for large-scale scientific and
Internet data analysis problems. In this chapter, I will describe two recent
examples---one having to do with selecting good columns or features from a (DNA
Single Nucleotide Polymorphism) data matrix, and the other having to do with
selecting good clusters or communities from a data graph (representing a social
or information network)---that drew on ideas from both areas and that may serve
as a model for exploiting complementary algorithmic and statistical
perspectives in order to solve applied large-scale data analysis problems.Comment: 33 pages. To appear in Uwe Naumann and Olaf Schenk, editors,
"Combinatorial Scientific Computing," Chapman and Hall/CRC Press, 201
Laplacian Mixture Modeling for Network Analysis and Unsupervised Learning on Graphs
Laplacian mixture models identify overlapping regions of influence in
unlabeled graph and network data in a scalable and computationally efficient
way, yielding useful low-dimensional representations. By combining Laplacian
eigenspace and finite mixture modeling methods, they provide probabilistic or
fuzzy dimensionality reductions or domain decompositions for a variety of input
data types, including mixture distributions, feature vectors, and graphs or
networks. Provable optimal recovery using the algorithm is analytically shown
for a nontrivial class of cluster graphs. Heuristic approximations for scalable
high-performance implementations are described and empirically tested.
Connections to PageRank and community detection in network analysis demonstrate
the wide applicability of this approach. The origins of fuzzy spectral methods,
beginning with generalized heat or diffusion equations in physics, are reviewed
and summarized. Comparisons to other dimensionality reduction and clustering
methods for challenging unsupervised machine learning problems are also
discussed.Comment: 13 figures, 35 reference
New efficient algorithms for multiple change-point detection with kernels
Several statistical approaches based on reproducing kernels have been
proposed to detect abrupt changes arising in the full distribution of the
observations and not only in the mean or variance. Some of these approaches
enjoy good statistical properties (oracle inequality, \ldots). Nonetheless,
they have a high computational cost both in terms of time and memory. This
makes their application difficult even for small and medium sample sizes (). This computational issue is addressed by first describing a new
efficient and exact algorithm for kernel multiple change-point detection with
an improved worst-case complexity that is quadratic in time and linear in
space. It allows dealing with medium size signals (up to ).
Second, a faster but approximation algorithm is described. It is based on a
low-rank approximation to the Gram matrix. It is linear in time and space. This
approximation algorithm can be applied to large-scale signals ().
These exact and approximation algorithms have been implemented in \texttt{R}
and \texttt{C} for various kernels. The computational and statistical
performances of these new algorithms have been assessed through empirical
experiments. The runtime of the new algorithms is observed to be faster than
that of other considered procedures. Finally, simulations confirmed the higher
statistical accuracy of kernel-based approaches to detect changes that are not
only in the mean. These simulations also illustrate the flexibility of
kernel-based approaches to analyze complex biological profiles made of DNA copy
number and allele B frequencies. An R package implementing the approach will be
made available on github
Solving Hard Computational Problems Efficiently: Asymptotic Parametric Complexity 3-Coloring Algorithm
Many practical problems in almost all scientific and technological
disciplines have been classified as computationally hard (NP-hard or even
NP-complete). In life sciences, combinatorial optimization problems frequently
arise in molecular biology, e.g., genome sequencing; global alignment of
multiple genomes; identifying siblings or discovery of dysregulated pathways.In
almost all of these problems, there is the need for proving a hypothesis about
certain property of an object that can be present only when it adopts some
particular admissible structure (an NP-certificate) or be absent (no admissible
structure), however, none of the standard approaches can discard the hypothesis
when no solution can be found, since none can provide a proof that there is no
admissible structure. This article presents an algorithm that introduces a
novel type of solution method to "efficiently" solve the graph 3-coloring
problem; an NP-complete problem. The proposed method provides certificates
(proofs) in both cases: present or absent, so it is possible to accept or
reject the hypothesis on the basis of a rigorous proof. It provides exact
solutions and is polynomial-time (i.e., efficient) however parametric. The only
requirement is sufficient computational power, which is controlled by the
parameter . Nevertheless, here it is proved that the
probability of requiring a value of to obtain a solution for a
random graph decreases exponentially: , making
tractable almost all problem instances. Thorough experimental analyses were
performed. The algorithm was tested on random graphs, planar graphs and
4-regular planar graphs. The obtained experimental results are in accordance
with the theoretical expected results.Comment: Working pape
Overcommitment in Cloud Services -- Bin packing with Chance Constraints
This paper considers a traditional problem of resource allocation, scheduling
jobs on machines. One such recent application is cloud computing, where jobs
arrive in an online fashion with capacity requirements and need to be
immediately scheduled on physical machines in data centers. It is often
observed that the requested capacities are not fully utilized, hence offering
an opportunity to employ an overcommitment policy, i.e., selling resources
beyond capacity. Setting the right overcommitment level can induce a
significant cost reduction for the cloud provider, while only inducing a very
low risk of violating capacity constraints. We introduce and study a model that
quantifies the value of overcommitment by modeling the problem as a bin packing
with chance constraints. We then propose an alternative formulation that
transforms each chance constraint into a submodular function. We show that our
model captures the risk pooling effect and can guide scheduling and
overcommitment decisions. We also develop a family of online algorithms that
are intuitive, easy to implement and provide a constant factor guarantee from
optimal. Finally, we calibrate our model using realistic workload data, and
test our approach in a practical setting. Our analysis and experiments
illustrate the benefit of overcommitment in cloud services, and suggest a cost
reduction of 1.5% to 17% depending on the provider's risk tolerance
Data-driven linear decision rule approach for distributionally robust optimization of on-line signal control
We propose a two-stage, on-line signal control strategy for dynamic networks using a linear decision rule (LDR) approach and a distributionally robust optimization (DRO) technique. The first (off-line) stage formulates a LDR that maps real-time traffic data to optimal signal control policies. A DRO problem is solved to optimize the on-line performance of the LDR in the presence of uncertainties associated with the observed traffic states and ambiguity in their underlying distribution functions. We employ a data-driven calibration of the uncertainty set, which takes into account historical traffic data. The second (on-line) stage implements a very efficient linear decision rule whose performance is guaranteed by the off-line computation. We test the proposed signal control procedure in a simulation environment that is informed by actual traffic data obtained in Glasgow, and demonstrate its full potential in on-line operation and deployability on realistic networks, as well as its effectiveness in improving traffic
- âŠ