10,855 research outputs found
Mondrian Forests for Large-Scale Regression when Uncertainty Matters
Many real-world regression problems demand a measure of the uncertainty
associated with each prediction. Standard decision forests deliver efficient
state-of-the-art predictive performance, but high-quality uncertainty estimates
are lacking. Gaussian processes (GPs) deliver uncertainty estimates, but
scaling GPs to large-scale data sets comes at the cost of approximating the
uncertainty estimates. We extend Mondrian forests, first proposed by
Lakshminarayanan et al. (2014) for classification problems, to the large-scale
non-parametric regression setting. Using a novel hierarchical Gaussian prior
that dovetails with the Mondrian forest framework, we obtain principled
uncertainty estimates, while still retaining the computational advantages of
decision forests. Through a combination of illustrative examples, real-world
large-scale datasets, and Bayesian optimization benchmarks, we demonstrate that
Mondrian forests outperform approximate GPs on large-scale regression tasks and
deliver better-calibrated uncertainty assessments than decision-forest-based
methods.Comment: Proceedings of the 19th International Conference on Artificial
Intelligence and Statistics (AISTATS) 2016, Cadiz, Spain. JMLR: W&CP volume
5
A machine learning based framework to identify and classify long terminal repeat retrotransposons
Transposable elements (TEs) are repetitive nucleotide sequences that make up a large portion of eukaryotic genomes. They can move and duplicate within a genome, increasing genome size and contributing to genetic diversity within and across species. Accurate identification and classification of TEs present in a genome is an important step towards understanding their effects on genes and their role in genome evolution. We introduce TE-LEARNER, a framework based on machine learning that automatically identifies TEs in a given genome and assigns a classification to them. We present an implementation of our framework towards LTR retrotransposons, a particular type of TEs characterized by having long terminal repeats (LTRs) at their boundaries. We evaluate the predictive performance of our framework on the well-annotated genomes of Drosophila melanogaster and Arabidopsis thaliana and we compare our results for three LTR retrotransposon superfamilies with the results of three widely used methods for TE identification or classification: REPEATMASKER, CENSOR and LTRDIGEST. In contrast to these methods, TE-LEARNER is the first to incorporate machine learning techniques, outperforming these methods in terms of predictive performance , while able to learn models and make predictions efficiently. Moreover, we show that our method was able to identify TEs that none of the above method could find, and we investigated TE-LEARNER'S predictions which did not correspond to an official annotation. It turns out that many of these predictions are in fact strongly homologous to a known TE
On Machine-Learned Classification of Variable Stars with Sparse and Noisy Time-Series Data
With the coming data deluge from synoptic surveys, there is a growing need
for frameworks that can quickly and automatically produce calibrated
classification probabilities for newly-observed variables based on a small
number of time-series measurements. In this paper, we introduce a methodology
for variable-star classification, drawing from modern machine-learning
techniques. We describe how to homogenize the information gleaned from light
curves by selection and computation of real-numbered metrics ("feature"),
detail methods to robustly estimate periodic light-curve features, introduce
tree-ensemble methods for accurate variable star classification, and show how
to rigorously evaluate the classification results using cross validation. On a
25-class data set of 1542 well-studied variable stars, we achieve a 22.8%
overall classification error using the random forest classifier; this
represents a 24% improvement over the best previous classifier on these data.
This methodology is effective for identifying samples of specific science
classes: for pulsational variables used in Milky Way tomography we obtain a
discovery efficiency of 98.2% and for eclipsing systems we find an efficiency
of 99.1%, both at 95% purity. We show that the random forest (RF) classifier is
superior to other machine-learned methods in terms of accuracy, speed, and
relative immunity to features with no useful class information; the RF
classifier can also be used to estimate the importance of each feature in
classification. Additionally, we present the first astronomical use of
hierarchical classification methods to incorporate a known class taxonomy in
the classifier, which further reduces the catastrophic error rate to 7.8%.
Excluding low-amplitude sources, our overall error rate improves to 14%, with a
catastrophic error rate of 3.5%.Comment: 23 pages, 9 figure
Random Forests for Big Data
Big Data is one of the major challenges of statistical science and has
numerous consequences from algorithmic and theoretical viewpoints. Big Data
always involve massive data but they also often include online data and data
heterogeneity. Recently some statistical methods have been adapted to process
Big Data, like linear regression models, clustering methods and bootstrapping
schemes. Based on decision trees combined with aggregation and bootstrap ideas,
random forests were introduced by Breiman in 2001. They are a powerful
nonparametric statistical method allowing to consider in a single and versatile
framework regression problems, as well as two-class and multi-class
classification problems. Focusing on classification problems, this paper
proposes a selective review of available proposals that deal with scaling
random forests to Big Data problems. These proposals rely on parallel
environments or on online adaptations of random forests. We also describe how
related quantities -- such as out-of-bag error and variable importance -- are
addressed in these methods. Then, we formulate various remarks for random
forests in the Big Data context. Finally, we experiment five variants on two
massive datasets (15 and 120 millions of observations), a simulated one as well
as real world data. One variant relies on subsampling while three others are
related to parallel implementations of random forests and involve either
various adaptations of bootstrap to Big Data or to "divide-and-conquer"
approaches. The fifth variant relates on online learning of random forests.
These numerical experiments lead to highlight the relative performance of the
different variants, as well as some of their limitations
Random forests with random projections of the output space for high dimensional multi-label classification
We adapt the idea of random projections applied to the output space, so as to
enhance tree-based ensemble methods in the context of multi-label
classification. We show how learning time complexity can be reduced without
affecting computational complexity and accuracy of predictions. We also show
that random output space projections may be used in order to reach different
bias-variance tradeoffs, over a broad panel of benchmark problems, and that
this may lead to improved accuracy while reducing significantly the
computational burden of the learning stage
Deep representation learning for human motion prediction and classification
Generative models of 3D human motion are often restricted to a small number
of activities and can therefore not generalize well to novel movements or
applications. In this work we propose a deep learning framework for human
motion capture data that learns a generic representation from a large corpus of
motion capture data and generalizes well to new, unseen, motions. Using an
encoding-decoding network that learns to predict future 3D poses from the most
recent past, we extract a feature representation of human motion. Most work on
deep learning for sequence prediction focuses on video and speech. Since
skeletal data has a different structure, we present and evaluate different
network architectures that make different assumptions about time dependencies
and limb correlations. To quantify the learned features, we use the output of
different layers for action classification and visualize the receptive fields
of the network units. Our method outperforms the recent state of the art in
skeletal motion prediction even though these use action specific training data.
Our results show that deep feedforward networks, trained from a generic mocap
database, can successfully be used for feature extraction from human motion
data and that this representation can be used as a foundation for
classification and prediction.Comment: This paper is published at the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 201
Brain Tumor Segmentation with Deep Neural Networks
In this paper, we present a fully automatic brain tumor segmentation method
based on Deep Neural Networks (DNNs). The proposed networks are tailored to
glioblastomas (both low and high grade) pictured in MR images. By their very
nature, these tumors can appear anywhere in the brain and have almost any kind
of shape, size, and contrast. These reasons motivate our exploration of a
machine learning solution that exploits a flexible, high capacity DNN while
being extremely efficient. Here, we give a description of different model
choices that we've found to be necessary for obtaining competitive performance.
We explore in particular different architectures based on Convolutional Neural
Networks (CNN), i.e. DNNs specifically adapted to image data.
We present a novel CNN architecture which differs from those traditionally
used in computer vision. Our CNN exploits both local features as well as more
global contextual features simultaneously. Also, different from most
traditional uses of CNNs, our networks use a final layer that is a
convolutional implementation of a fully connected layer which allows a 40 fold
speed up. We also describe a 2-phase training procedure that allows us to
tackle difficulties related to the imbalance of tumor labels. Finally, we
explore a cascade architecture in which the output of a basic CNN is treated as
an additional source of information for a subsequent CNN. Results reported on
the 2013 BRATS test dataset reveal that our architecture improves over the
currently published state-of-the-art while being over 30 times faster
- …