260,342 research outputs found
A hybrid algorithm for Bayesian network structure learning with application to multi-label learning
We present a novel hybrid algorithm for Bayesian network structure learning,
called H2PC. It first reconstructs the skeleton of a Bayesian network and then
performs a Bayesian-scoring greedy hill-climbing search to orient the edges.
The algorithm is based on divide-and-conquer constraint-based subroutines to
learn the local structure around a target variable. We conduct two series of
experimental comparisons of H2PC against Max-Min Hill-Climbing (MMHC), which is
currently the most powerful state-of-the-art algorithm for Bayesian network
structure learning. First, we use eight well-known Bayesian network benchmarks
with various data sizes to assess the quality of the learned structure returned
by the algorithms. Our extensive experiments show that H2PC outperforms MMHC in
terms of goodness of fit to new data and quality of the network structure with
respect to the true dependence structure of the data. Second, we investigate
H2PC's ability to solve the multi-label learning problem. We provide
theoretical results to characterize and identify graphically the so-called
minimal label powersets that appear as irreducible factors in the joint
distribution under the faithfulness condition. The multi-label learning problem
is then decomposed into a series of multi-class classification problems, where
each multi-class variable encodes a label powerset. H2PC is shown to compare
favorably to MMHC in terms of global classification accuracy over ten
multi-label data sets covering different application domains. Overall, our
experiments support the conclusions that local structural learning with H2PC in
the form of local neighborhood induction is a theoretically well-motivated and
empirically effective learning framework that is well suited to multi-label
learning. The source code (in R) of H2PC as well as all data sets used for the
empirical tests are publicly available.Comment: arXiv admin note: text overlap with arXiv:1101.5184 by other author
Domain Adaptation on Graphs by Learning Graph Topologies: Theoretical Analysis and an Algorithm
Traditional machine learning algorithms assume that the training and test
data have the same distribution, while this assumption does not necessarily
hold in real applications. Domain adaptation methods take into account the
deviations in the data distribution. In this work, we study the problem of
domain adaptation on graphs. We consider a source graph and a target graph
constructed with samples drawn from data manifolds. We study the problem of
estimating the unknown class labels on the target graph using the label
information on the source graph and the similarity between the two graphs. We
particularly focus on a setting where the target label function is learnt such
that its spectrum is similar to that of the source label function. We first
propose a theoretical analysis of domain adaptation on graphs and present
performance bounds that characterize the target classification error in terms
of the properties of the graphs and the data manifolds. We show that the
classification performance improves as the topologies of the graphs get more
balanced, i.e., as the numbers of neighbors of different graph nodes become
more proportionate, and weak edges with small weights are avoided. Our results
also suggest that graph edges between too distant data samples should be
avoided for good generalization performance. We then propose a graph domain
adaptation algorithm inspired by our theoretical findings, which estimates the
label functions while learning the source and target graph topologies at the
same time. The joint graph learning and label estimation problem is formulated
through an objective function relying on our performance bounds, which is
minimized with an alternating optimization scheme. Experiments on synthetic and
real data sets suggest that the proposed method outperforms baseline
approaches
Asymptotic Analysis of Generative Semi-Supervised Learning
Semisupervised learning has emerged as a popular framework for improving
modeling accuracy while controlling labeling cost. Based on an extension of
stochastic composite likelihood we quantify the asymptotic accuracy of
generative semi-supervised learning. In doing so, we complement
distribution-free analysis by providing an alternative framework to measure the
value associated with different labeling policies and resolve the fundamental
question of how much data to label and in what manner. We demonstrate our
approach with both simulation studies and real world experiments using naive
Bayes for text classification and MRFs and CRFs for structured prediction in
NLP.Comment: 12 pages, 9 figure
Semi-supervised Predictive Clustering Trees for (Hierarchical) Multi-label Classification
Semi-supervised learning (SSL) is a common approach to learning predictive
models using not only labeled examples, but also unlabeled examples. While SSL
for the simple tasks of classification and regression has received a lot of
attention from the research community, this is not properly investigated for
complex prediction tasks with structurally dependent variables. This is the
case of multi-label classification and hierarchical multi-label classification
tasks, which may require additional information, possibly coming from the
underlying distribution in the descriptive space provided by unlabeled
examples, to better face the challenging task of predicting simultaneously
multiple class labels.
In this paper, we investigate this aspect and propose a (hierarchical)
multi-label classification method based on semi-supervised learning of
predictive clustering trees. We also extend the method towards ensemble
learning and propose a method based on the random forest approach. Extensive
experimental evaluation conducted on 23 datasets shows significant advantages
of the proposed method and its extension with respect to their supervised
counterparts. Moreover, the method preserves interpretability and reduces the
time complexity of classical tree-based models
- …