17,048 research outputs found
Feature and Variable Selection in Classification
The amount of information in the form of features and variables avail- able
to machine learning algorithms is ever increasing. This can lead to classifiers
that are prone to overfitting in high dimensions, high di- mensional models do
not lend themselves to interpretable results, and the CPU and memory resources
necessary to run on high-dimensional datasets severly limit the applications of
the approaches. Variable and feature selection aim to remedy this by finding a
subset of features that in some way captures the information provided best. In
this paper we present the general methodology and highlight some specific
approaches.Comment: Part of master seminar in document analysis held by Marcus
Eichenberger-Liwick
Fully Bayesian Logistic Regression with Hyper-Lasso Priors for High-dimensional Feature Selection
High-dimensional feature selection arises in many areas of modern science.
For example, in genomic research we want to find the genes that can be used to
separate tissues of different classes (e.g. cancer and normal) from tens of
thousands of genes that are active (expressed) in certain tissue cells. To this
end, we wish to fit regression and classification models with a large number of
features (also called variables, predictors). In the past decade, penalized
likelihood methods for fitting regression models based on hyper-LASSO
penalization have received increasing attention in the literature. However,
fully Bayesian methods that use Markov chain Monte Carlo (MCMC) are still in
lack of development in the literature. In this paper we introduce an MCMC
(fully Bayesian) method for learning severely multi-modal posteriors of
logistic regression models based on hyper-LASSO priors (non-convex penalties).
Our MCMC algorithm uses Hamiltonian Monte Carlo in a restricted Gibbs sampling
framework; we call our method Bayesian logistic regression with hyper-LASSO
(BLRHL) priors. We have used simulation studies and real data analysis to
demonstrate the superior performance of hyper-LASSO priors, and to investigate
the issues of choosing heaviness and scale of hyper-LASSO priors.Comment: 33 pages. arXiv admin note: substantial text overlap with
arXiv:1308.469
Ranking relations using analogies in biological and information networks
Analogical reasoning depends fundamentally on the ability to learn and
generalize about relations between objects. We develop an approach to
relational learning which, given a set of pairs of objects
,
measures how well other pairs A:B fit in with the set . Our work
addresses the following question: is the relation between objects A and B
analogous to those relations found in ? Such questions are
particularly relevant in information retrieval, where an investigator might
want to search for analogous pairs of objects that match the query set of
interest. There are many ways in which objects can be related, making the task
of measuring analogies very challenging. Our approach combines a similarity
measure on function spaces with Bayesian analysis to produce a ranking. It
requires data containing features of the objects of interest and a link matrix
specifying which relationships exist; no further attributes of such
relationships are necessary. We illustrate the potential of our method on text
analysis and information networks. An application on discovering functional
interactions between pairs of proteins is discussed in detail, where we show
that our approach can work in practice even if a small set of protein pairs is
provided.Comment: Published in at http://dx.doi.org/10.1214/09-AOAS321 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Incremental Sparse Bayesian Ordinal Regression
Ordinal Regression (OR) aims to model the ordering information between
different data categories, which is a crucial topic in multi-label learning. An
important class of approaches to OR models the problem as a linear combination
of basis functions that map features to a high dimensional non-linear space.
However, most of the basis function-based algorithms are time consuming. We
propose an incremental sparse Bayesian approach to OR tasks and introduce an
algorithm to sequentially learn the relevant basis functions in the ordinal
scenario. Our method, called Incremental Sparse Bayesian Ordinal Regression
(ISBOR), automatically optimizes the hyper-parameters via the type-II maximum
likelihood method. By exploiting fast marginal likelihood optimization, ISBOR
can avoid big matrix inverses, which is the main bottleneck in applying basis
function-based algorithms to OR tasks on large-scale datasets. We show that
ISBOR can make accurate predictions with parsimonious basis functions while
offering automatic estimates of the prediction uncertainty. Extensive
experiments on synthetic and real word datasets demonstrate the efficiency and
effectiveness of ISBOR compared to other basis function-based OR approaches
- …