66 research outputs found
A hybrid algorithm for Bayesian network structure learning with application to multi-label learning
We present a novel hybrid algorithm for Bayesian network structure learning,
called H2PC. It first reconstructs the skeleton of a Bayesian network and then
performs a Bayesian-scoring greedy hill-climbing search to orient the edges.
The algorithm is based on divide-and-conquer constraint-based subroutines to
learn the local structure around a target variable. We conduct two series of
experimental comparisons of H2PC against Max-Min Hill-Climbing (MMHC), which is
currently the most powerful state-of-the-art algorithm for Bayesian network
structure learning. First, we use eight well-known Bayesian network benchmarks
with various data sizes to assess the quality of the learned structure returned
by the algorithms. Our extensive experiments show that H2PC outperforms MMHC in
terms of goodness of fit to new data and quality of the network structure with
respect to the true dependence structure of the data. Second, we investigate
H2PC's ability to solve the multi-label learning problem. We provide
theoretical results to characterize and identify graphically the so-called
minimal label powersets that appear as irreducible factors in the joint
distribution under the faithfulness condition. The multi-label learning problem
is then decomposed into a series of multi-class classification problems, where
each multi-class variable encodes a label powerset. H2PC is shown to compare
favorably to MMHC in terms of global classification accuracy over ten
multi-label data sets covering different application domains. Overall, our
experiments support the conclusions that local structural learning with H2PC in
the form of local neighborhood induction is a theoretically well-motivated and
empirically effective learning framework that is well suited to multi-label
learning. The source code (in R) of H2PC as well as all data sets used for the
empirical tests are publicly available.Comment: arXiv admin note: text overlap with arXiv:1101.5184 by other author
Making decision trees feasible in ultrahigh feature and label dimensions
©2017 Weiwei Liu and Ivor W. Tsang. Due to the non-linear but highly interpretable representations, decision tree (DT) models have significantly attracted a lot of attention of researchers. However, it is difficult to understand and interpret DT models in ultrahigh dimensions and DT models usually suffer from the curse of dimensionality and achieve degenerated performance when there are many noisy features. To address these issues, this paper first presents a novel data-dependent generalization error bound for the perceptron decision tree (PDT), which provides the theoretical justification to learn a sparse linear hyperplane in each decision node and to prune the tree. Following our analysis, we introduce the notion of budget-aware classifier (BAC) with a budget constraint on the weight coefficients, and propose a supervised budgeted tree (SBT) algorithm to achieve non-linear prediction performance. To avoid generating an unstable and complicated decision tree and improve the generalization of the SBT, we present a pruning strategy by learning classifiers to minimize cross-validation errors on each BAC. To deal with ultrahigh label dimensions, based on three important phenomena of real-world data sets from a variety of application domains, we develop a sparse coding tree framework for multi-label annotation problems and provide the theoretical analysis. Extensive empirical studies verify that 1) SBT is easy to understand and interpret in ultrahigh dimensions and is more resilient to noisy features. 2) Compared with state-of-the-art algorithms, our proposed sparse coding tree framework is more efficient, yet accurate in ultrahigh label and feature dimensions
Sparse Predictive Modeling : A Cost-Effective Perspective
Many real life problems encountered in industry, economics or engineering are complex and difficult to model by conventional mathematical methods. Machine learning provides a wide variety of methods and tools for solving such problems by learning mathematical models from data. Methods from the field have found their way to applications such as medical diagnosis, financial forecasting, and web-search engines. The predictions made by a learned model are based on a vector of feature values describing the input to the model. However, predictions do not come for free in real world applications, since the feature values of the input have to be bought, measured or produced before the model can be used. Feature selection is a process of eliminating irrelevant and redundant features from the model. Traditionally, it has been applied for achieving interpretable and more accurate models, while the possibility of lowering prediction costs has received much less attention in the literature.
In this thesis we consider novel feature selection techniques for reducing prediction costs. The contributions of this thesis are as follows. First, we propose several cost types characterizing the cost of performing prediction with a trained model. Particularly, we consider costs emerging from multitarget prediction problems as well as a number of cost types arising when the feature extraction process is structured. Second, we develop greedy regularized least-squares methods to maximize the predictive performance of the models under given budget constraints. Empirical evaluations are performed on numerous benchmark data sets as well as on a novel water quality analysis application. The results demonstrate that in settings where the considered cost types apply, the proposed methods lead to substantial cost savings compared to conventional methods
Automatic refinement of large-scale cross-domain knowledge graphs
Knowledge graphs are a way to represent complex structured and unstructured information
integrated into an ontology, with which one can reason about the existing
information to deduce new information or highlight inconsistencies. Knowledge
graphs are divided into the terminology box (TBox), also known as ontology, and
the assertions box (ABox). The former consists of a set of schema axioms defining
classes and properties which describe the data domain. Whereas the ABox consists
of a set of facts describing instances in terms of the TBox vocabulary.
In the recent years, there have been several initiatives for creating large-scale
cross-domain knowledge graphs, both free and commercial, with DBpedia, YAGO,
and Wikidata being amongst the most successful free datasets. Those graphs are
often constructed with the extraction of information from semi-structured knowledge,
such as Wikipedia, or unstructured text from the web using NLP methods. It
is unlikely, in particular when heuristic methods are applied and unreliable sources
are used, that the knowledge graph is fully correct or complete. There is a tradeoff
between completeness and correctness, which is addressed differently in each
knowledge graph’s construction approach.
There is a wide variety of applications for knowledge graphs, e.g. semantic
search and discovery, question answering, recommender systems, expert systems
and personal assistants. The quality of a knowledge graph is crucial for its applications.
In order to further increase the quality of such large-scale knowledge graphs,
various automatic refinement methods have been proposed. Those methods try to
infer and add missing knowledge to the graph, or detect erroneous pieces of information.
In this thesis, we investigate the problem of automatic knowledge graph
refinement and propose methods that address the problem from two directions, automatic
refinement of the TBox and of the ABox.
In Part I we address the ABox refinement problem. We propose a method for
predicting missing type assertions using hierarchical multilabel classifiers and ingoing/
outgoing links as features. We also present an approach to detection of relation
assertion errors which exploits type and path patterns in the graph. Moreover,
we propose an approach to correction of relation errors originating from confusions
between entities. Also in the ABox refinement direction, we propose a knowledge
graph model and process for synthesizing knowledge graphs for benchmarking
ABox completion methods.
In Part II we address the TBox refinement problem. We propose methods for inducing flexible relation constraints from the ABox, which are expressed using
SHACL.We introduce an ILP refinement step which exploits correlations between
numerical attributes and relations in order to the efficiently learn Horn rules with
numerical attributes. Finally, we investigate the introduction of lexical information
from textual corpora into the ILP algorithm in order to improve quality of induced
class expressions
Data Mining
The availability of big data due to computerization and automation has generated an urgent need for new techniques to analyze and convert big data into useful information and knowledge. Data mining is a promising and leading-edge technology for mining large volumes of data, looking for hidden information, and aiding knowledge discovery. It can be used for characterization, classification, discrimination, anomaly detection, association, clustering, trend or evolution prediction, and much more in fields such as science, medicine, economics, engineering, computers, and even business analytics. This book presents basic concepts, ideas, and research in data mining
An Empirical Evaluation of Constrained Feature Selection
While feature selection helps to get smaller and more understandable prediction models, most existing feature-selection techniques do not consider domain knowledge. One way to use domain knowledge is via constraints on sets of selected features. However, the impact of constraints, e.g., on the predictive quality of selected features, is currently unclear. This article is an empirical study that evaluates the impact of propositional and arithmetic constraints on filter feature selection. First, we systematically generate constraints from various types, using datasets from different domains. As expected, constraints tend to decrease the predictive quality of feature sets, but this effect is non-linear. So we observe feature sets both adhering to constraints and with high predictive quality. Second, we study a concrete setting in materials science. This part of our study sheds light on how one can analyze scientific hypotheses with the help of constraints
- …