4,007 research outputs found
Experiment Databases: Creating a New Platform for Meta-Learning Research
Many studies in machine learning try to investigate what makes an algorithm succeed or fail on certain datasets. However, the field is still evolving relatively quickly, and new algorithms, preprocessing methods, learning tasks and evaluation procedures continue to emerge in the literature. Thus, it is impossible for a single study to cover this expanding space of learning approaches. In this paper, we propose a community-based approach for the analysis of learning algorithms, driven by sharing meta-data from previous experiments in a uniform way. We illustrate how organizing this information in a central database can create a practical public platform for any kind of exploitation of meta-knowledge, allowing effective reuse of previous experimentation and targeted analysis of the collected results
Bounding Optimality Gap in Stochastic Optimization via Bagging: Statistical Efficiency and Stability
We study a statistical method to estimate the optimal value, and the
optimality gap of a given solution for stochastic optimization as an assessment
of the solution quality. Our approach is based on bootstrap aggregating, or
bagging, resampled sample average approximation (SAA). We show how this
approach leads to valid statistical confidence bounds for non-smooth
optimization. We also demonstrate its statistical efficiency and stability that
are especially desirable in limited-data situations, and compare these
properties with some existing methods. We present our theory that views SAA as
a kernel in an infinite-order symmetric statistic, which can be approximated
via bagging. We substantiate our theoretical findings with numerical results
Yield Curve Predictability, Regimes, and Macroeconomic Information: A Data-Driven Approach
We propose an empirical approach to determine the various economic sources driving the US yield curve. We allow the conditional dynamics of the yield at different maturities to change in reaction to past information coming from several relevant predictor variables. We consider both endogenous, yield curve factors and exogenous, macroeconomic factors as predictors in our model, letting the data themselves choose the most important variables. We find clear, different economic patterns in the local dynamics and regime specification of the yields depending on the maturity. Moreover, we present strong empirical evidence for the accuracy of the model in fitting in-sample and predicting out-of-sample the yield curve in comparison to several alternative approaches.Yield curve modeling and forecasting; Macroeconomic variables; Tree-structured models; Threshold regimes; GARCH; Bagging
A scale-space approach with wavelets to singularity estimation
This paper is concerned with the problem of determining the typical features of a curve when it is observed with noise. It has been shown that one can characterize the Lipschitz singularities of a signal by following the propagation across scales of the modulus maxima of its continuous wavelet transform. A nonparametric approach, based on appropriate thresholding of the empirical wavelet coefficients, is proposed to estimate the wavelet maxima of a signal observed with noise at various scales. In order to identify the singularities of the unknown signal, we introduce a new tool, "the structural intensity", that computes the "density" of the location of the modulus maxima of a wavelet representation along various scales. This approach is shown to be an effective technique for detecting the significant singularities of a signal corrupted by noise and for removing spurious estimates. The asymptotic properties of the resulting estimators are studied and illustrated by simulations. An application to a real data set is also proposed
Consistency of random forests
Random forests are a learning algorithm proposed by Breiman [Mach. Learn. 45
(2001) 5--32] that combines several randomized decision trees and aggregates
their predictions by averaging. Despite its wide usage and outstanding
practical performance, little is known about the mathematical properties of the
procedure. This disparity between theory and practice originates in the
difficulty to simultaneously analyze both the randomization process and the
highly data-dependent tree structure. In the present paper, we take a step
forward in forest exploration by proving a consistency result for Breiman's
[Mach. Learn. 45 (2001) 5--32] original algorithm in the context of additive
regression models. Our analysis also sheds an interesting light on how random
forests can nicely adapt to sparsity. 1. Introduction. Random forests are an
ensemble learning method for classification and regression that constructs a
number of randomized decision trees during the training phase and predicts by
averaging the results. Since its publication in the seminal paper of Breiman
(2001), the procedure has become a major data analysis tool, that performs well
in practice in comparison with many standard methods. What has greatly
contributed to the popularity of forests is the fact that they can be applied
to a wide range of prediction problems and have few parameters to tune. Aside
from being simple to use, the method is generally recognized for its accuracy
and its ability to deal with small sample sizes, high-dimensional feature
spaces and complex data structures. The random forest methodology has been
successfully involved in many practical problems, including air quality
prediction (winning code of the EMC data science global hackathon in 2012, see
http://www.kaggle.com/c/dsg-hackathon), chemoinformatics [Svetnik et al.
(2003)], ecology [Prasad, Iverson and Liaw (2006), Cutler et al. (2007)], 3
Classification
In Classification learning, an algorithm is presented with a set of classified examples or ‘‘instances’’ from which it is expected to infer a way of classifying unseen instances into one of several ‘‘classes’’. Instances have a set of features or ‘‘attributes’’ whose values define that particular instance. Numeric prediction, or ‘‘regression,’’ is a variant of classification learning in which the class attribute is numeric rather than categorical. Classification learning is sometimes called supervised because the method operates under supervision by being provided with the actual outcome for each of the training instances. This contrasts with Data clustering (see entry Data Clustering), where the classes are not given, and with Association learning (see entry Association Learning), which seeks any association – not just one that predicts the class
- …