52,285 research outputs found
Robust Feature Selection by Mutual Information Distributions
Mutual information is widely used in artificial intelligence, in a
descriptive way, to measure the stochastic dependence of discrete random
variables. In order to address questions such as the reliability of the
empirical value, one must consider sample-to-population inferential approaches.
This paper deals with the distribution of mutual information, as obtained in a
Bayesian framework by a second-order Dirichlet prior distribution. The exact
analytical expression for the mean and an analytical approximation of the
variance are reported. Asymptotic approximations of the distribution are
proposed. The results are applied to the problem of selecting features for
incremental learning and classification of the naive Bayes classifier. A fast,
newly defined method is shown to outperform the traditional approach based on
empirical mutual information on a number of real data sets. Finally, a
theoretical development is reported that allows one to efficiently extend the
above methods to incomplete samples in an easy and effective way.Comment: 8 two-column page
Improved Training for Self-Training by Confidence Assessments
It is well known that for some tasks, labeled data sets may be hard to
gather. Therefore, we wished to tackle here the problem of having insufficient
training data. We examined learning methods from unlabeled data after an
initial training on a limited labeled data set. The suggested approach can be
used as an online learning method on the unlabeled test set. In the general
classification task, whenever we predict a label with high enough confidence,
we treat it as a true label and train the data accordingly. For the semantic
segmentation task, a classic example for an expensive data labeling process, we
do so pixel-wise. Our suggested approaches were applied on the MNIST data-set
as a proof of concept for a vision classification task and on the ADE20K
data-set in order to tackle the semi-supervised semantic segmentation problem
People on Drugs: Credibility of User Statements in Health Communities
Online health communities are a valuable source of information for patients
and physicians. However, such user-generated resources are often plagued by
inaccuracies and misinformation. In this work we propose a method for
automatically establishing the credibility of user-generated medical statements
and the trustworthiness of their authors by exploiting linguistic cues and
distant supervision from expert sources. To this end we introduce a
probabilistic graphical model that jointly learns user trustworthiness,
statement credibility, and language objectivity. We apply this methodology to
the task of extracting rare or unknown side-effects of medical drugs --- this
being one of the problems where large scale non-expert data has the potential
to complement expert medical knowledge. We show that our method can reliably
extract side-effects and filter out false statements, while identifying
trustworthy users that are likely to contribute valuable medical information
- …