6,305 research outputs found

    The Loss Rank Principle for Model Selection

    Full text link
    We introduce a new principle for model selection in regression and classification. Many regression models are controlled by some smoothness or flexibility or complexity parameter c, e.g. the number of neighbors to be averaged over in k nearest neighbor (kNN) regression or the polynomial degree in regression with polynomials. Let f_D^c be the (best) regressor of complexity c on data D. A more flexible regressor can fit more data D' well than a more rigid one. If something (here small loss) is easy to achieve it's typically worth less. We define the loss rank of f_D^c as the number of other (fictitious) data D' that are fitted better by f_D'^c than D is fitted by f_D^c. We suggest selecting the model complexity c that has minimal loss rank (LoRP). Unlike most penalized maximum likelihood variants (AIC,BIC,MDL), LoRP only depends on the regression function and loss function. It works without a stochastic noise model, and is directly applicable to any non-parametric regressor, like kNN. In this paper we formalize, discuss, and motivate LoRP, study it for specific regression problems, in particular linear ones, and compare it to other model selection schemes.Comment: 16 page

    Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science

    Full text link
    As the field of data science continues to grow, there will be an ever-increasing demand for tools that make machine learning accessible to non-experts. In this paper, we introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning---pipeline design. We implement an open source Tree-based Pipeline Optimization Tool (TPOT) in Python and demonstrate its effectiveness on a series of simulated and real-world benchmark data sets. In particular, we show that TPOT can design machine learning pipelines that provide a significant improvement over a basic machine learning analysis while requiring little to no input nor prior knowledge from the user. We also address the tendency for TPOT to design overly complex pipelines by integrating Pareto optimization, which produces compact pipelines without sacrificing classification accuracy. As such, this work represents an important step toward fully automating machine learning pipeline design.Comment: 8 pages, 5 figures, preprint to appear in GECCO 2016, edits not yet made from reviewer comment

    Tracing the Evolution of Physics on the Backbone of Citation Networks

    Get PDF
    Many innovations are inspired by past ideas in a non-trivial way. Tracing these origins and identifying scientific branches is crucial for research inspirations. In this paper, we use citation relations to identify the descendant chart, i.e. the family tree of research papers. Unlike other spanning trees which focus on cost or distance minimization, we make use of the nature of citations and identify the most important parent for each publication, leading to a tree-like backbone of the citation network. Measures are introduced to validate the backbone as the descendant chart. We show that citation backbones can well characterize the hierarchical and fractal structure of scientific development, and lead to accurate classification of fields and sub-fields.Comment: 6 pages, 5 figure

    Machine Learning for Quantum Mechanical Properties of Atoms in Molecules

    Get PDF
    We introduce machine learning models of quantum mechanical observables of atoms in molecules. Instant out-of-sample predictions for proton and carbon nuclear chemical shifts, atomic core level excitations, and forces on atoms reach accuracies on par with density functional theory reference. Locality is exploited within non-linear regression via local atom-centered coordinate systems. The approach is validated on a diverse set of 9k small organic molecules. Linear scaling of computational cost in system size is demonstrated for saturated polymers with up to sub-mesoscale lengths

    Detection of trend changes in time series using Bayesian inference

    Full text link
    Change points in time series are perceived as isolated singularities where two regular trends of a given signal do not match. The detection of such transitions is of fundamental interest for the understanding of the system's internal dynamics. In practice observational noise makes it difficult to detect such change points in time series. In this work we elaborate a Bayesian method to estimate the location of the singularities and to produce some confidence intervals. We validate the ability and sensitivity of our inference method by estimating change points of synthetic data sets. As an application we use our algorithm to analyze the annual flow volume of the Nile River at Aswan from 1871 to 1970, where we confirm a well-established significant transition point within the time series.Comment: 9 pages, 12 figures, submitte

    A General Optimization Technique for High Quality Community Detection in Complex Networks

    Get PDF
    Recent years have witnessed the development of a large body of algorithms for community detection in complex networks. Most of them are based upon the optimization of objective functions, among which modularity is the most common, though a number of alternatives have been suggested in the scientific literature. We present here an effective general search strategy for the optimization of various objective functions for community detection purposes. When applied to modularity, on both real-world and synthetic networks, our search strategy substantially outperforms the best existing algorithms in terms of final scores of the objective function; for description length, its performance is on par with the original Infomap algorithm. The execution time of our algorithm is on par with non-greedy alternatives present in literature, and networks of up to 10,000 nodes can be analyzed in time spans ranging from minutes to a few hours on average workstations, making our approach readily applicable to tasks which require the quality of partitioning to be as high as possible, and are not limited by strict time constraints. Finally, based on the most effective of the available optimization techniques, we compare the performance of modularity and code length as objective functions, in terms of the quality of the partitions one can achieve by optimizing them. To this end, we evaluated the ability of each objective function to reconstruct the underlying structure of a large set of synthetic and real-world networks.Comment: MAIN text: 14 pages, 4 figures, 1 table Supplementary information: 19 pages, 8 figures, 5 table

    Expected exponential loss for gaze-based video and volume ground truth annotation

    Full text link
    Many recent machine learning approaches used in medical imaging are highly reliant on large amounts of image and ground truth data. In the context of object segmentation, pixel-wise annotations are extremely expensive to collect, especially in video and 3D volumes. To reduce this annotation burden, we propose a novel framework to allow annotators to simply observe the object to segment and record where they have looked at with a \$200 eye gaze tracker. Our method then estimates pixel-wise probabilities for the presence of the object throughout the sequence from which we train a classifier in semi-supervised setting using a novel Expected Exponential loss function. We show that our framework provides superior performances on a wide range of medical image settings compared to existing strategies and that our method can be combined with current crowd-sourcing paradigms as well.Comment: 9 pages, 5 figues, MICCAI 2017 - LABELS Worksho

    Variable Selection and Model Averaging in Semiparametric Overdispersed Generalized Linear Models

    Full text link
    We express the mean and variance terms in a double exponential regression model as additive functions of the predictors and use Bayesian variable selection to determine which predictors enter the model, and whether they enter linearly or flexibly. When the variance term is null we obtain a generalized additive model, which becomes a generalized linear model if the predictors enter the mean linearly. The model is estimated using Markov chain Monte Carlo simulation and the methodology is illustrated using real and simulated data sets.Comment: 8 graphs 35 page
    corecore