27,576 research outputs found
Bayesian Approximate Kernel Regression with Variable Selection
Nonlinear kernel regression models are often used in statistics and machine
learning because they are more accurate than linear models. Variable selection
for kernel regression models is a challenge partly because, unlike the linear
regression setting, there is no clear concept of an effect size for regression
coefficients. In this paper, we propose a novel framework that provides an
effect size analog of each explanatory variable for Bayesian kernel regression
models when the kernel is shift-invariant --- for example, the Gaussian kernel.
We use function analytic properties of shift-invariant reproducing kernel
Hilbert spaces (RKHS) to define a linear vector space that: (i) captures
nonlinear structure, and (ii) can be projected onto the original explanatory
variables. The projection onto the original explanatory variables serves as an
analog of effect sizes. The specific function analytic property we use is that
shift-invariant kernel functions can be approximated via random Fourier bases.
Based on the random Fourier expansion we propose a computationally efficient
class of Bayesian approximate kernel regression (BAKR) models for both
nonlinear regression and binary classification for which one can compute an
analog of effect sizes. We illustrate the utility of BAKR by examining two
important problems in statistical genetics: genomic selection (i.e. phenotypic
prediction) and association mapping (i.e. inference of significant variants or
loci). State-of-the-art methods for genomic selection and association mapping
are based on kernel regression and linear models, respectively. BAKR is the
first method that is competitive in both settings.Comment: 22 pages, 3 figures, 3 tables; theory added; new simulations
presented; references adde
Advocating better habitat use and selection models in bird ecology
Studies on habitat use and habitat selection represent a basic aspect of bird ecology, due to its importance in natural history, distribution, response to environmental changes, management and conservation. Basically, a statistical model that identifies environmental variables linked to a species presence is searched for. In this sense, there is a wide array of analytical methods that identify important explanatory variables within a model, with higher explanatory and predictive power than classical regression approaches. However, some of these powerful models are not widespread in ornithological studies, partly because of their complex theory, and in some cases, difficulties on their implementation and interpretation. Here, I describe generalized linear models and other five statistical models for the analysis of bird habitat use and selection outperforming classical approaches: generalized additive models, mixed effects models, occupancy models, binomial N-mixture models and decision trees (classification and regression trees, bagging, random forests and boosting). Each of these models has its benefits and drawbacks, but major advantages include dealing with non-normal distributions (presence-absence and abundance data typically found in habitat use and selection studies), heterogeneous variances, non-linear and complex relationships among variables, lack of statistical independence and imperfect detection. To aid ornithologists in making use of the methods described, a readable description of each method is provided, as well as a flowchart along with some recommendations to help them decide the most appropriate analysis. The use of these models in ornithological studies is encouraged, given their huge potential as statistical tools in bird ecology.Fil: Palacio, Facundo Xavier. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Universidad Nacional de La Plata. Facultad de Ciencias Naturales y Museo. División Zoología de Vertebrados. Sección Ornitología; Argentin
Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values
This work is motivated by the needs of predictive analytics on healthcare
data as represented by Electronic Medical Records. Such data is invariably
problematic: noisy, with missing entries, with imbalance in classes of
interests, leading to serious bias in predictive modeling. Since standard data
mining methods often produce poor performance measures, we argue for
development of specialized techniques of data-preprocessing and classification.
In this paper, we propose a new method to simultaneously classify large
datasets and reduce the effects of missing values. It is based on a multilevel
framework of the cost-sensitive SVM and the expected maximization imputation
method for missing values, which relies on iterated regression analyses. We
compare classification results of multilevel SVM-based algorithms on public
benchmark datasets with imbalanced classes and missing values as well as real
data in health applications, and show that our multilevel SVM-based method
produces fast, and more accurate and robust classification results.Comment: arXiv admin note: substantial text overlap with arXiv:1503.0625
European exchange trading funds trading with locally weighted support vector regression
In this paper, two different Locally Weighted Support Vector Regression (wSVR) algorithms are generated and applied to the task of forecasting and trading five European Exchange Traded Funds. The trading application covers the recent European Monetary Union debt crisis. The performance of the proposed models is benchmarked against traditional Support Vector Regression (SVR) models. The Radial Basis Function, the Wavelet and the Mahalanobis kernel are explored and tested as SVR kernels. Finally, a novel statistical SVR input selection procedure is introduced based on a principal component analysis and the Hansen, Lunde, and Nason (2011) model confidence test. The results demonstrate the superiority of the wSVR models over the traditional SVRs and of the v-SVR over the ε-SVR algorithms. We note that the performance of all models varies and considerably deteriorates in the peak of the debt crisis. In terms of the kernels, our results do not confirm the belief that the Radial Basis Function is the optimum choice for financial series
- …