35,843 research outputs found
Cluster membership probabilities from proper motions and multiwavelength photometric catalogues: I. Method and application to the Pleiades cluster
We present a new technique designed to take full advantage of the high
dimensionality (photometric, astrometric, temporal) of the DANCe survey to
derive self-consistent and robust membership probabilities of the Pleiades
cluster. We aim at developing a methodology to infer membership probabilities
to the Pleiades cluster from the DANCe multidimensional astro-photometric data
set in a consistent way throughout the entire derivation. The determination of
the membership probabilities has to be applicable to censored data and must
incorporate the measurement uncertainties into the inference procedure.
We use Bayes' theorem and a curvilinear forward model for the likelihood of
the measurements of cluster members in the colour-magnitude space, to infer
posterior membership probabilities. The distribution of the cluster members
proper motions and the distribution of contaminants in the full
multidimensional astro-photometric space is modelled with a
mixture-of-Gaussians likelihood. We analyse several representation spaces
composed of the proper motions plus a subset of the available magnitudes and
colour indices. We select two prominent representation spaces composed of
variables selected using feature relevance determination techniques based in
Random Forests, and analyse the resulting samples of high probability
candidates. We consistently find lists of high probability (p > 0.9975)
candidates with 1000 sources, 4 to 5 times more than obtained in the
most recent astro-photometric studies of the cluster.
The methodology presented here is ready for application in data sets that
include more dimensions, such as radial and/or rotational velocities, spectral
indices and variability.Comment: 14 pages, 4 figures, accepted by A&
Robust Feature Selection by Mutual Information Distributions
Mutual information is widely used in artificial intelligence, in a
descriptive way, to measure the stochastic dependence of discrete random
variables. In order to address questions such as the reliability of the
empirical value, one must consider sample-to-population inferential approaches.
This paper deals with the distribution of mutual information, as obtained in a
Bayesian framework by a second-order Dirichlet prior distribution. The exact
analytical expression for the mean and an analytical approximation of the
variance are reported. Asymptotic approximations of the distribution are
proposed. The results are applied to the problem of selecting features for
incremental learning and classification of the naive Bayes classifier. A fast,
newly defined method is shown to outperform the traditional approach based on
empirical mutual information on a number of real data sets. Finally, a
theoretical development is reported that allows one to efficiently extend the
above methods to incomplete samples in an easy and effective way.Comment: 8 two-column page
Learning From Labeled And Unlabeled Data: An Empirical Study Across Techniques And Domains
There has been increased interest in devising learning techniques that
combine unlabeled data with labeled data ? i.e. semi-supervised learning.
However, to the best of our knowledge, no study has been performed across
various techniques and different types and amounts of labeled and unlabeled
data. Moreover, most of the published work on semi-supervised learning
techniques assumes that the labeled and unlabeled data come from the same
distribution. It is possible for the labeling process to be associated with a
selection bias such that the distributions of data points in the labeled and
unlabeled sets are different. Not correcting for such bias can result in biased
function approximation with potentially poor performance. In this paper, we
present an empirical study of various semi-supervised learning techniques on a
variety of datasets. We attempt to answer various questions such as the effect
of independence or relevance amongst features, the effect of the size of the
labeled and unlabeled sets and the effect of noise. We also investigate the
impact of sample-selection bias on the semi-supervised learning techniques
under study and implement a bivariate probit technique particularly designed to
correct for such bias
Invariant Models for Causal Transfer Learning
Methods of transfer learning try to combine knowledge from several related
tasks (or domains) to improve performance on a test task. Inspired by causal
methodology, we relax the usual covariate shift assumption and assume that it
holds true for a subset of predictor variables: the conditional distribution of
the target variable given this subset of predictors is invariant over all
tasks. We show how this assumption can be motivated from ideas in the field of
causality. We focus on the problem of Domain Generalization, in which no
examples from the test task are observed. We prove that in an adversarial
setting using this subset for prediction is optimal in Domain Generalization;
we further provide examples, in which the tasks are sufficiently diverse and
the estimator therefore outperforms pooling the data, even on average. If
examples from the test task are available, we also provide a method to transfer
knowledge from the training tasks and exploit all available features for
prediction. However, we provide no guarantees for this method. We introduce a
practical method which allows for automatic inference of the above subset and
provide corresponding code. We present results on synthetic data sets and a
gene deletion data set
- …