7,452 research outputs found
Random Forests for Big Data
Big Data is one of the major challenges of statistical science and has
numerous consequences from algorithmic and theoretical viewpoints. Big Data
always involve massive data but they also often include online data and data
heterogeneity. Recently some statistical methods have been adapted to process
Big Data, like linear regression models, clustering methods and bootstrapping
schemes. Based on decision trees combined with aggregation and bootstrap ideas,
random forests were introduced by Breiman in 2001. They are a powerful
nonparametric statistical method allowing to consider in a single and versatile
framework regression problems, as well as two-class and multi-class
classification problems. Focusing on classification problems, this paper
proposes a selective review of available proposals that deal with scaling
random forests to Big Data problems. These proposals rely on parallel
environments or on online adaptations of random forests. We also describe how
related quantities -- such as out-of-bag error and variable importance -- are
addressed in these methods. Then, we formulate various remarks for random
forests in the Big Data context. Finally, we experiment five variants on two
massive datasets (15 and 120 millions of observations), a simulated one as well
as real world data. One variant relies on subsampling while three others are
related to parallel implementations of random forests and involve either
various adaptations of bootstrap to Big Data or to "divide-and-conquer"
approaches. The fifth variant relates on online learning of random forests.
These numerical experiments lead to highlight the relative performance of the
different variants, as well as some of their limitations
Distributed Robust Learning
We propose a framework for distributed robust statistical learning on {\em
big contaminated data}. The Distributed Robust Learning (DRL) framework can
reduce the computational time of traditional robust learning methods by several
orders of magnitude. We analyze the robustness property of DRL, showing that
DRL not only preserves the robustness of the base robust learning method, but
also tolerates contaminations on a constant fraction of results from computing
nodes (node failures). More precisely, even in presence of the most adversarial
outlier distribution over computing nodes, DRL still achieves a breakdown point
of at least , where is the break down point of
corresponding centralized algorithm. This is in stark contrast with naive
division-and-averaging implementation, which may reduce the breakdown point by
a factor of when computing nodes are used. We then specialize the
DRL framework for two concrete cases: distributed robust principal component
analysis and distributed robust regression. We demonstrate the efficiency and
the robustness advantages of DRL through comprehensive simulations and
predicting image tags on a large-scale image set.Comment: 18 pages, 2 figure
Online Updating of Statistical Inference in the Big Data Setting
We present statistical methods for big data arising from online analytical
processing, where large amounts of data arrive in streams and require fast
analysis without storage/access to the historical data. In particular, we
develop iterative estimating algorithms and statistical inferences for linear
models and estimating equations that update as new data arrive. These
algorithms are computationally efficient, minimally storage-intensive, and
allow for possible rank deficiencies in the subset design matrices due to
rare-event covariates. Within the linear model setting, the proposed
online-updating framework leads to predictive residual tests that can be used
to assess the goodness-of-fit of the hypothesized model. We also propose a new
online-updating estimator under the estimating equation setting. Theoretical
properties of the goodness-of-fit tests and proposed estimators are examined in
detail. In simulation studies and real data applications, our estimator
compares favorably with competing approaches under the estimating equation
setting.Comment: Submitted to Technometric
Distributed Low-rank Subspace Segmentation
Vision problems ranging from image clustering to motion segmentation to
semi-supervised learning can naturally be framed as subspace segmentation
problems, in which one aims to recover multiple low-dimensional subspaces from
noisy and corrupted input data. Low-Rank Representation (LRR), a convex
formulation of the subspace segmentation problem, is provably and empirically
accurate on small problems but does not scale to the massive sizes of modern
vision datasets. Moreover, past work aimed at scaling up low-rank matrix
factorization is not applicable to LRR given its non-decomposable constraints.
In this work, we propose a novel divide-and-conquer algorithm for large-scale
subspace segmentation that can cope with LRR's non-decomposable constraints and
maintains LRR's strong recovery guarantees. This has immediate implications for
the scalability of subspace segmentation, which we demonstrate on a benchmark
face recognition dataset and in simulations. We then introduce novel
applications of LRR-based subspace segmentation to large-scale semi-supervised
learning for multimedia event detection, concept detection, and image tagging.
In each case, we obtain state-of-the-art results and order-of-magnitude speed
ups
"Divide and Conquer" Semiclassical Molecular Dynamics: A practical method for Spectroscopic calculations of High Dimensional Molecular Systems
We extensively describe our recently established "divide-and-conquer"
semiclassical method [M. Ceotto, G. Di Liberto and R. Conte, Phys. Rev. Lett.
119, 010401 (2017)] and propose a new implementation of it to increase the
accuracy of results. The technique permits to perform spectroscopic
calculations of high dimensional systems by dividing the full-dimensional
problem into a set of smaller dimensional ones. The partition procedure,
originally based on a dynamical analysis of the Hessian matrix, is here more
rigorously achieved through a hierarchical subspace-separation criterion based
on Liouville's theorem. Comparisons of calculated vibrational frequencies to
exact quantum ones for a set of molecules including benzene show that the new
implementation performs better than the original one and that, on average, the
loss in accuracy with respect to full-dimensional semiclassical calculations is
reduced to only 10 wavenumbers. Furthermore, by investigating the challenging
Zundel cation, we also demonstrate that the "divide-and-conquer" approach
allows to deal with complex strongly anharmonic molecular systems. Overall the
method very much helps the assignment and physical interpretation of
experimental IR spectra by providing accurate vibrational fundamentals and
overtones decomposed into reduced dimensionality spectra
- …