6,328 research outputs found
The Loss Rank Principle for Model Selection
We introduce a new principle for model selection in regression and
classification. Many regression models are controlled by some smoothness or
flexibility or complexity parameter c, e.g. the number of neighbors to be
averaged over in k nearest neighbor (kNN) regression or the polynomial degree
in regression with polynomials. Let f_D^c be the (best) regressor of complexity
c on data D. A more flexible regressor can fit more data D' well than a more
rigid one. If something (here small loss) is easy to achieve it's typically
worth less. We define the loss rank of f_D^c as the number of other
(fictitious) data D' that are fitted better by f_D'^c than D is fitted by
f_D^c. We suggest selecting the model complexity c that has minimal loss rank
(LoRP). Unlike most penalized maximum likelihood variants (AIC,BIC,MDL), LoRP
only depends on the regression function and loss function. It works without a
stochastic noise model, and is directly applicable to any non-parametric
regressor, like kNN. In this paper we formalize, discuss, and motivate LoRP,
study it for specific regression problems, in particular linear ones, and
compare it to other model selection schemes.Comment: 16 page
Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science
As the field of data science continues to grow, there will be an
ever-increasing demand for tools that make machine learning accessible to
non-experts. In this paper, we introduce the concept of tree-based pipeline
optimization for automating one of the most tedious parts of machine
learning---pipeline design. We implement an open source Tree-based Pipeline
Optimization Tool (TPOT) in Python and demonstrate its effectiveness on a
series of simulated and real-world benchmark data sets. In particular, we show
that TPOT can design machine learning pipelines that provide a significant
improvement over a basic machine learning analysis while requiring little to no
input nor prior knowledge from the user. We also address the tendency for TPOT
to design overly complex pipelines by integrating Pareto optimization, which
produces compact pipelines without sacrificing classification accuracy. As
such, this work represents an important step toward fully automating machine
learning pipeline design.Comment: 8 pages, 5 figures, preprint to appear in GECCO 2016, edits not yet
made from reviewer comment
Tracing the Evolution of Physics on the Backbone of Citation Networks
Many innovations are inspired by past ideas in a non-trivial way. Tracing
these origins and identifying scientific branches is crucial for research
inspirations. In this paper, we use citation relations to identify the
descendant chart, i.e. the family tree of research papers. Unlike other
spanning trees which focus on cost or distance minimization, we make use of the
nature of citations and identify the most important parent for each
publication, leading to a tree-like backbone of the citation network. Measures
are introduced to validate the backbone as the descendant chart. We show that
citation backbones can well characterize the hierarchical and fractal structure
of scientific development, and lead to accurate classification of fields and
sub-fields.Comment: 6 pages, 5 figure
Machine Learning for Quantum Mechanical Properties of Atoms in Molecules
We introduce machine learning models of quantum mechanical observables of
atoms in molecules. Instant out-of-sample predictions for proton and carbon
nuclear chemical shifts, atomic core level excitations, and forces on atoms
reach accuracies on par with density functional theory reference. Locality is
exploited within non-linear regression via local atom-centered coordinate
systems. The approach is validated on a diverse set of 9k small organic
molecules. Linear scaling of computational cost in system size is demonstrated
for saturated polymers with up to sub-mesoscale lengths
Detection of trend changes in time series using Bayesian inference
Change points in time series are perceived as isolated singularities where
two regular trends of a given signal do not match. The detection of such
transitions is of fundamental interest for the understanding of the system's
internal dynamics. In practice observational noise makes it difficult to detect
such change points in time series. In this work we elaborate a Bayesian method
to estimate the location of the singularities and to produce some confidence
intervals. We validate the ability and sensitivity of our inference method by
estimating change points of synthetic data sets. As an application we use our
algorithm to analyze the annual flow volume of the Nile River at Aswan from
1871 to 1970, where we confirm a well-established significant transition point
within the time series.Comment: 9 pages, 12 figures, submitte
A General Optimization Technique for High Quality Community Detection in Complex Networks
Recent years have witnessed the development of a large body of algorithms for
community detection in complex networks. Most of them are based upon the
optimization of objective functions, among which modularity is the most common,
though a number of alternatives have been suggested in the scientific
literature. We present here an effective general search strategy for the
optimization of various objective functions for community detection purposes.
When applied to modularity, on both real-world and synthetic networks, our
search strategy substantially outperforms the best existing algorithms in terms
of final scores of the objective function; for description length, its
performance is on par with the original Infomap algorithm. The execution time
of our algorithm is on par with non-greedy alternatives present in literature,
and networks of up to 10,000 nodes can be analyzed in time spans ranging from
minutes to a few hours on average workstations, making our approach readily
applicable to tasks which require the quality of partitioning to be as high as
possible, and are not limited by strict time constraints. Finally, based on the
most effective of the available optimization techniques, we compare the
performance of modularity and code length as objective functions, in terms of
the quality of the partitions one can achieve by optimizing them. To this end,
we evaluated the ability of each objective function to reconstruct the
underlying structure of a large set of synthetic and real-world networks.Comment: MAIN text: 14 pages, 4 figures, 1 table Supplementary information: 19
pages, 8 figures, 5 table
Expected exponential loss for gaze-based video and volume ground truth annotation
Many recent machine learning approaches used in medical imaging are highly
reliant on large amounts of image and ground truth data. In the context of
object segmentation, pixel-wise annotations are extremely expensive to collect,
especially in video and 3D volumes. To reduce this annotation burden, we
propose a novel framework to allow annotators to simply observe the object to
segment and record where they have looked at with a \$200 eye gaze tracker. Our
method then estimates pixel-wise probabilities for the presence of the object
throughout the sequence from which we train a classifier in semi-supervised
setting using a novel Expected Exponential loss function. We show that our
framework provides superior performances on a wide range of medical image
settings compared to existing strategies and that our method can be combined
with current crowd-sourcing paradigms as well.Comment: 9 pages, 5 figues, MICCAI 2017 - LABELS Worksho
Variable Selection and Model Averaging in Semiparametric Overdispersed Generalized Linear Models
We express the mean and variance terms in a double exponential regression
model as additive functions of the predictors and use Bayesian variable
selection to determine which predictors enter the model, and whether they enter
linearly or flexibly. When the variance term is null we obtain a generalized
additive model, which becomes a generalized linear model if the predictors
enter the mean linearly. The model is estimated using Markov chain Monte Carlo
simulation and the methodology is illustrated using real and simulated data
sets.Comment: 8 graphs 35 page
- …