Search CORE

26,052 research outputs found

Non-convex Global Minimization and False Discovery Rate Control for the TREX

Author: Bien Jacob
Gaynanova Irina
Lederer Johannes
Müller Christian
Publication venue: 'Informa UK Limited'
Publication date: 20/09/2016
Field of study

The TREX is a recently introduced method for performing sparse high-dimensional regression. Despite its statistical promise as an alternative to the lasso, square-root lasso, and scaled lasso, the TREX is computationally challenging in that it requires solving a non-convex optimization problem. This paper shows a remarkable result: despite the non-convexity of the TREX problem, there exists a polynomial-time algorithm that is guaranteed to find the global minimum. This result adds the TREX to a very short list of non-convex optimization problems that can be globally optimized (principal components analysis being a famous example). After deriving and developing this new approach, we demonstrate that (i) the ability of the preexisting TREX heuristic to reach the global minimum is strongly dependent on the difficulty of the underlying statistical problem, (ii) the new polynomial-time algorithm for TREX permits a novel variable ranking and selection scheme, (iii) this scheme can be incorporated into a rule that controls the false discovery rate (FDR) of included features in the model. To achieve this last aim, we provide an extension of the results of Barber & Candes (2015) to establish that the knockoff filter framework can be applied to the TREX. This investigation thus provides both a rare case study of a heuristic for non-convex optimization and a novel way of exploiting non-convexity for statistical inference

arXiv.org e-Print Archive

FigShare

Algorithmic and Statistical Perspectives on Large-Scale Data Analysis

Author: Mahoney Michael W.
Publication venue
Publication date: 08/10/2010
Field of study

In recent years, ideas from statistics and scientific computing have begun to interact in increasingly sophisticated and fruitful ways with ideas from computer science and the theory of algorithms to aid in the development of improved worst-case algorithms that are useful for large-scale scientific and Internet data analysis problems. In this chapter, I will describe two recent examples---one having to do with selecting good columns or features from a (DNA Single Nucleotide Polymorphism) data matrix, and the other having to do with selecting good clusters or communities from a data graph (representing a social or information network)---that drew on ideas from both areas and that may serve as a model for exploiting complementary algorithmic and statistical perspectives in order to solve applied large-scale data analysis problems.Comment: 33 pages. To appear in Uwe Naumann and Olaf Schenk, editors, "Combinatorial Scientific Computing," Chapman and Hall/CRC Press, 201

arXiv.org e-Print Archive

CiteSeerX

Laplacian Mixture Modeling for Network Analysis and Unsupervised Learning on Graphs

Author: Korenblum Daniel
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2018
Field of study

Laplacian mixture models identify overlapping regions of influence in unlabeled graph and network data in a scalable and computationally efficient way, yielding useful low-dimensional representations. By combining Laplacian eigenspace and finite mixture modeling methods, they provide probabilistic or fuzzy dimensionality reductions or domain decompositions for a variety of input data types, including mixture distributions, feature vectors, and graphs or networks. Provable optimal recovery using the algorithm is analytically shown for a nontrivial class of cluster graphs. Heuristic approximations for scalable high-performance implementations are described and empirically tested. Connections to PageRank and community detection in network analysis demonstrate the wide applicability of this approach. The origins of fuzzy spectral methods, beginning with generalized heat or diffusion equations in physics, are reviewed and summarized. Comparisons to other dimensionality reduction and clustering methods for challenging unsupervised machine learning problems are also discussed.Comment: 13 figures, 35 reference

arXiv.org e-Print Archive

Directory of Open Access Journals

New efficient algorithms for multiple change-point detection with kernels

Author: Celisse Alain
Marot Guillemette
Pierre-Jean Morgane
Rigaill Guillem
Publication venue
Publication date: 01/09/2016
Field of study

Several statistical approaches based on reproducing kernels have been proposed to detect abrupt changes arising in the full distribution of the observations and not only in the mean or variance. Some of these approaches enjoy good statistical properties (oracle inequality, \ldots). Nonetheless, they have a high computational cost both in terms of time and memory. This makes their application difficult even for small and medium sample sizes (

n< 10^4

). This computational issue is addressed by first describing a new efficient and exact algorithm for kernel multiple change-point detection with an improved worst-case complexity that is quadratic in time and linear in space. It allows dealing with medium size signals (up to

n \approx 10^5

). Second, a faster but approximation algorithm is described. It is based on a low-rank approximation to the Gram matrix. It is linear in time and space. This approximation algorithm can be applied to large-scale signals (

n \geq 10^6

). These exact and approximation algorithms have been implemented in \texttt{R} and \texttt{C} for various kernels. The computational and statistical performances of these new algorithms have been assessed through empirical experiments. The runtime of the new algorithms is observed to be faster than that of other considered procedures. Finally, simulations confirmed the higher statistical accuracy of kernel-based approaches to detect changes that are not only in the mean. These simulations also illustrate the flexibility of kernel-based approaches to analyze complex biological profiles made of DNA copy number and allele B frequencies. An R package implementing the approach will be made available on github

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Hal-Diderot

Solving Hard Computational Problems Efficiently: Asymptotic Parametric Complexity 3-Coloring Algorithm

Author: Martin H. Jose Antonio
Publication venue
Publication date: 22/09/2012
Field of study

Many practical problems in almost all scientific and technological disciplines have been classified as computationally hard (NP-hard or even NP-complete). In life sciences, combinatorial optimization problems frequently arise in molecular biology, e.g., genome sequencing; global alignment of multiple genomes; identifying siblings or discovery of dysregulated pathways.In almost all of these problems, there is the need for proving a hypothesis about certain property of an object that can be present only when it adopts some particular admissible structure (an NP-certificate) or be absent (no admissible structure), however, none of the standard approaches can discard the hypothesis when no solution can be found, since none can provide a proof that there is no admissible structure. This article presents an algorithm that introduces a novel type of solution method to "efficiently" solve the graph 3-coloring problem; an NP-complete problem. The proposed method provides certificates (proofs) in both cases: present or absent, so it is possible to accept or reject the hypothesis on the basis of a rigorous proof. It provides exact solutions and is polynomial-time (i.e., efficient) however parametric. The only requirement is sufficient computational power, which is controlled by the parameter

\alpha\in\mathbb{N}

. Nevertheless, here it is proved that the probability of requiring a value of

\alpha>k

to obtain a solution for a random graph decreases exponentially:

P(\alpha>k) \leq 2^{-(k+1)}

, making tractable almost all problem instances. Thorough experimental analyses were performed. The algorithm was tested on random graphs, planar graphs and 4-regular planar graphs. The obtained experimental results are in accordance with the theoretical expected results.Comment: Working pape

arXiv.org e-Print Archive

Directory of Open Access Journals

Overcommitment in Cloud Services -- Bin packing with Chance Constraints

Author: Cohen Maxime C.
Keller Philipp W.
Mirrokni Vahab
Zadimoghaddam Morteza
Publication venue
Publication date: 25/05/2017
Field of study

This paper considers a traditional problem of resource allocation, scheduling jobs on machines. One such recent application is cloud computing, where jobs arrive in an online fashion with capacity requirements and need to be immediately scheduled on physical machines in data centers. It is often observed that the requested capacities are not fully utilized, hence offering an opportunity to employ an overcommitment policy, i.e., selling resources beyond capacity. Setting the right overcommitment level can induce a significant cost reduction for the cloud provider, while only inducing a very low risk of violating capacity constraints. We introduce and study a model that quantifies the value of overcommitment by modeling the problem as a bin packing with chance constraints. We then propose an alternative formulation that transforms each chance constraint into a submodular function. We show that our model captures the risk pooling effect and can guide scheduling and overcommitment decisions. We also develop a family of online algorithms that are intuitive, easy to implement and provide a constant factor guarantee from optimal. Finally, we calibrate our model using realistic workload data, and test our approach in a practical setting. Our analysis and experiments illustrate the benefit of overcommitment in cloud services, and suggest a cost reduction of 1.5% to 17% depending on the provider's risk tolerance

arXiv.org e-Print Archive

Data-driven linear decision rule approach for distributionally robust optimization of on-line signal control

Author: Friesz TL
Gayah V
Han K
Liu H
Yao T
Publication venue: 'Elsevier BV'
Publication date: 26/05/2015
Field of study

We propose a two-stage, on-line signal control strategy for dynamic networks using a linear decision rule (LDR) approach and a distributionally robust optimization (DRO) technique. The first (off-line) stage formulates a LDR that maps real-time traffic data to optimal signal control policies. A DRO problem is solved to optimize the on-line performance of the LDR in the presence of uncertainties associated with the observed traffic states and ambiguity in their underlying distribution functions. We employ a data-driven calibration of the uncertainty set, which takes into account historical traffic data. The second (on-line) stage implements a very efficient linear decision rule whose performance is guaranteed by the off-line computation. We test the proposed signal control procedure in a simulation environment that is informed by actual traffic data obtained in Glasgow, and demonstrate its full potential in on-line operation and deployability on realistic networks, as well as its effectiveness in improving traffic

Spiral - Imperial College Digital Repository