9 research outputs found
Inductive queries for a drug designing robot scientist
It is increasingly clear that machine learning algorithms need to be integrated in an iterative scientific discovery loop, in which data is queried repeatedly by means of inductive queries and where the computer provides guidance to the experiments that are being performed. In this chapter, we summarise several key challenges in achieving this integration of machine learning and data mining algorithms in methods for the discovery of Quantitative Structure Activity Relationships (QSARs). We introduce the concept of a robot scientist, in which all steps of the discovery process are automated; we discuss the representation of molecular data such that knowledge discovery tools can analyse it, and we discuss the adaptation of machine learning and data mining algorithms to guide QSAR experiments
MRDTL: a multi-relational decision tree learning algorithm
Many real-world data sets are organized in relational databases consisting of multiple tables and associations. Other types of data such as in bioinformatics, computational biology, HTML and XML documents require reasoning about the structure of the objects. However, most of the existing approaches to machine learning typically assume that the data are stored in a single table, and use a propositional (as opposed to relational) language for discovering predictive models. Hence, there is a need for data mining algorithms for discovery of a-priori unknown relationships from multi-relational data. This thesis explores a new framework for multi-relational data mining. It describes experiments with an implementation of a Multi-Relational Decision Tree Learning (MRDTL) algorithm for induction of decision trees from relational databases based on an approach suggested by Knobbe et al., 1999. Our experiments with widely used benchmark data sets (e.g., the carcinogenesis data) show that the performance of MRDTL is competitive with that of other algorithms for learning classifiers from multiple relations including Progol (Muggleton, 1995) FOIL (Quinlan, 1993), Tilde (Blockeel, 1998). Preliminary results indicate that MRDTL, when augmented with principled methods for handling missing attribute values, is likely to be competitive with the state-of-the-art algorithms for learning classifiers from multiple relations on real-world data sets drawn from bioinformatics applications (prediction of gene localization and gene function) used in the KDD Cup 2001 data mining competition (Cheng et al., 2002)
OWL-Miner: Concept Induction in OWL Knowledge Bases
The Resource Description Framework (RDF) and Web Ontology
Language (OWL)
have been widely used in recent years, and automated methods for
the analysis of
data and knowledge directly within these formalisms are of
current interest. Concept
induction is a technique for discovering descriptions of data,
such as inducing OWL
class expressions to describe RDF data. These class expressions
capture patterns in
the data which can be used to characterise interesting clusters
or to act as classifica-
tion rules over unseen data. The semantics of OWL is underpinned
by Description
Logics (DLs), a family of expressive and decidable fragments of
first-order logic.
Recently, methods of concept induction which are well studied in
the field of
Inductive Logic Programming have been applied to the related
formalism of DLs.
These methods have been developed for a number of purposes
including unsuper-
vised clustering and supervised classification. Refinement-based
search is a concept
induction technique which structures the search space of DL
concept/OWL class
expressions and progressively generalises or specialises
candidate concepts to cover
example data as guided by quality criteria such as accuracy.
However, the current
state-of-the-art in this area is limited in that such methods:
were not primarily de-
signed to scale over large RDF/OWL knowledge bases; do not
support class lan-
guages as expressive as OWL2-DL; or, are limited to one purpose,
such as learning
OWL classes for integration into ontologies. Our work addresses
these limitations
by increasing the efficiency of these learning methods whilst
permitting a concept
language up to the expressivity of OWL2-DL classes. We describe
methods which
support both classification (predictive induction) and subgroup
discovery (descrip-
tive induction), which, in this context, are fundamentally
related.
We have implemented our methods as the system called OWL-Miner
and show
by evaluation that our methods outperform state-of-the-art
systems for DL learning
in both the quality of solutions found and the speed in which
they are computed.
Furthermore, we achieve the best ever ten-fold cross validation
accuracy results on
the long-standing benchmark problem of carcinogenesis. Finally,
we present a case
study on ongoing work in the application of OWL-Miner to a
real-world problem
directed at improving the efficiency of biological macromolecular
crystallisation