9 research outputs found

    Inductive queries for a drug designing robot scientist

    Get PDF
    It is increasingly clear that machine learning algorithms need to be integrated in an iterative scientific discovery loop, in which data is queried repeatedly by means of inductive queries and where the computer provides guidance to the experiments that are being performed. In this chapter, we summarise several key challenges in achieving this integration of machine learning and data mining algorithms in methods for the discovery of Quantitative Structure Activity Relationships (QSARs). We introduce the concept of a robot scientist, in which all steps of the discovery process are automated; we discuss the representation of molecular data such that knowledge discovery tools can analyse it, and we discuss the adaptation of machine learning and data mining algorithms to guide QSAR experiments

    MRDTL: a multi-relational decision tree learning algorithm

    Get PDF
    Many real-world data sets are organized in relational databases consisting of multiple tables and associations. Other types of data such as in bioinformatics, computational biology, HTML and XML documents require reasoning about the structure of the objects. However, most of the existing approaches to machine learning typically assume that the data are stored in a single table, and use a propositional (as opposed to relational) language for discovering predictive models. Hence, there is a need for data mining algorithms for discovery of a-priori unknown relationships from multi-relational data. This thesis explores a new framework for multi-relational data mining. It describes experiments with an implementation of a Multi-Relational Decision Tree Learning (MRDTL) algorithm for induction of decision trees from relational databases based on an approach suggested by Knobbe et al., 1999. Our experiments with widely used benchmark data sets (e.g., the carcinogenesis data) show that the performance of MRDTL is competitive with that of other algorithms for learning classifiers from multiple relations including Progol (Muggleton, 1995) FOIL (Quinlan, 1993), Tilde (Blockeel, 1998). Preliminary results indicate that MRDTL, when augmented with principled methods for handling missing attribute values, is likely to be competitive with the state-of-the-art algorithms for learning classifiers from multiple relations on real-world data sets drawn from bioinformatics applications (prediction of gene localization and gene function) used in the KDD Cup 2001 data mining competition (Cheng et al., 2002)

    OWL-Miner: Concept Induction in OWL Knowledge Bases

    Get PDF
    The Resource Description Framework (RDF) and Web Ontology Language (OWL) have been widely used in recent years, and automated methods for the analysis of data and knowledge directly within these formalisms are of current interest. Concept induction is a technique for discovering descriptions of data, such as inducing OWL class expressions to describe RDF data. These class expressions capture patterns in the data which can be used to characterise interesting clusters or to act as classifica- tion rules over unseen data. The semantics of OWL is underpinned by Description Logics (DLs), a family of expressive and decidable fragments of first-order logic. Recently, methods of concept induction which are well studied in the field of Inductive Logic Programming have been applied to the related formalism of DLs. These methods have been developed for a number of purposes including unsuper- vised clustering and supervised classification. Refinement-based search is a concept induction technique which structures the search space of DL concept/OWL class expressions and progressively generalises or specialises candidate concepts to cover example data as guided by quality criteria such as accuracy. However, the current state-of-the-art in this area is limited in that such methods: were not primarily de- signed to scale over large RDF/OWL knowledge bases; do not support class lan- guages as expressive as OWL2-DL; or, are limited to one purpose, such as learning OWL classes for integration into ontologies. Our work addresses these limitations by increasing the efficiency of these learning methods whilst permitting a concept language up to the expressivity of OWL2-DL classes. We describe methods which support both classification (predictive induction) and subgroup discovery (descrip- tive induction), which, in this context, are fundamentally related. We have implemented our methods as the system called OWL-Miner and show by evaluation that our methods outperform state-of-the-art systems for DL learning in both the quality of solutions found and the speed in which they are computed. Furthermore, we achieve the best ever ten-fold cross validation accuracy results on the long-standing benchmark problem of carcinogenesis. Finally, we present a case study on ongoing work in the application of OWL-Miner to a real-world problem directed at improving the efficiency of biological macromolecular crystallisation

    Feature construction with version spaces for biochemical applications

    No full text
    status: publishe
    corecore