14 research outputs found

    Automated Construction of Relational Attributes ACORA: A Progress Report

    Get PDF
    Data mining research has not only development a large number of algorithms, but also enhanced our knowledge and understanding of their applicability and performance. However, the application of data mining technology in business environments is still no very common, despite the fact that organizations have access to large amounts of data and make decisions that could profit from data mining on a daily basis. One of the reasons is the mismatch between data representation for data storage and data analysis. Data are most commonly stored in multi-table relational databases whereas data mining methods require that the data be represented as a simple feature vector. This work presents a general framework for feature construction from multiple relational tables for data mining applications. The second part describes our prototype implementation ACORA (Automated Construction of Relational Features).Information Systems Working Papers Serie

    A MODULAR APPROACH TO RELATIONAL DATA MINING

    Get PDF

    Multi-relational data mining

    Get PDF
    An important aspect of data mining algorithms and systems is that they should scale well to large databases. A consequence of this is that most data mining tools are based on machine learning algorithms that work on data in attribute-value format. Experience has proven that such 'single-table' mining algorithms indeed scale well. The downside of this format is, however, that more complex patterns are simply not expressible in this format and, thus, cannot be discovered. One way to enlarge the expressiveness is to generalize, as in ILP, from one-table mining to multiple table mining, i.e., to support mining on full relational databases. The key step in such a generalization is to ensure that the search space does not explode and that efficiency and, thus, scalability are maintained. In this paper we present a framework and an architecture that provide such a generalization. In this framework the semantic information in the database schema, e.g., foreign keys, are exploited to prune the search space and, in the architecture, database primitives are defined to ensure efficiency. Moreover, the framework induces a canonical generalization of algorithms, i.e., if the generalized algorithms are run on a single table database, they give the same results as their single-table counterparts. The framework is illustrated by the Warmr algorithm, which is a multi-relational generalization of the Apriori algorithm

    Automated Construction of Relational Attributes ACORA: A Progress Report

    Get PDF
    Data mining research has not only development a large number of algorithms, but also enhanced our knowledge and understanding of their applicability and performance. However, the application of data mining technology in business environments is still no very common, despite the fact that organizations have access to large amounts of data and make decisions that could profit from data mining on a daily basis. One of the reasons is the mismatch between data representation for data storage and data analysis. Data are most commonly stored in multi-table relational databases whereas data mining methods require that the data be represented as a simple feature vector. This work presents a general framework for feature construction from multiple relational tables for data mining applications. The second part describes our prototype implementation ACORA (Automated Construction of Relational Features).Information Systems Working Papers Serie

    MRDTL: a multi-relational decision tree learning algorithm

    Get PDF
    Many real-world data sets are organized in relational databases consisting of multiple tables and associations. Other types of data such as in bioinformatics, computational biology, HTML and XML documents require reasoning about the structure of the objects. However, most of the existing approaches to machine learning typically assume that the data are stored in a single table, and use a propositional (as opposed to relational) language for discovering predictive models. Hence, there is a need for data mining algorithms for discovery of a-priori unknown relationships from multi-relational data. This thesis explores a new framework for multi-relational data mining. It describes experiments with an implementation of a Multi-Relational Decision Tree Learning (MRDTL) algorithm for induction of decision trees from relational databases based on an approach suggested by Knobbe et al., 1999. Our experiments with widely used benchmark data sets (e.g., the carcinogenesis data) show that the performance of MRDTL is competitive with that of other algorithms for learning classifiers from multiple relations including Progol (Muggleton, 1995) FOIL (Quinlan, 1993), Tilde (Blockeel, 1998). Preliminary results indicate that MRDTL, when augmented with principled methods for handling missing attribute values, is likely to be competitive with the state-of-the-art algorithms for learning classifiers from multiple relations on real-world data sets drawn from bioinformatics applications (prediction of gene localization and gene function) used in the KDD Cup 2001 data mining competition (Cheng et al., 2002)

    Inducci贸n de conocimiento con incertidumbre en bases de datos relacionales borrosas

    Get PDF
    Este trabajo presenta un sistema para aprendizaje de definiciones l贸gicas con incertidumbre, a partir de una base de datos relacional borrosa. El campo de inter茅s se centra, por tanto, en la programaci贸n l贸gica inductiva, introduciendo algunas interesantes aportaciones, principalmente en lo que se refiere a la entrada de datos y a los resultados producidos: Los datos de entrada pertenecen a una base de datos relacional borrosa. Por tanto, vienen expresados en forma de tablas de tuplas (relaciones), en las que las tuplas pueden llevar asociado un grado de pertenencia a la relaci贸n correspondiente. Se trata, por tanto, de relaciones borrosas, directamente identificables con conceptos borrosos (tan comunes en la realidad vista desde un punto de vista humano), y no de relaciones ordinarias con atributos borrosos (tal y como se entiende la "borrosidad" en muchos sistemas existentes). Los datos de salida vienen expresados en forma de definiciones l贸gicas de una relaci贸n (ordinaria o borrosa), que consta de una cl谩usula de Horn o de la disyunci贸n de varias. Estas cl谩usulas de Horn se construyen mediante literales, aplicados sobre variables (generalmente), y asociados a relaciones borrosas u ordinarias. Los literales borrosos pueden ser modificados, adem谩s, por el empleo de etiquetas ling眉铆sticas. Por tanto, se combina, en estas definiciones, la l贸gica de predicados con la l贸gica borrosa, en lo que podemos denominar "l贸gica borrosa de predicados", lo que constituye una aportaci贸n dentro de la inducci贸n autom谩tica de conocimiento. Adem谩s, las definiciones inducidas llevan asociado un factor de incertidumbre, como hacen otros sistemas ya existentes. El punto de partida del trabajo lo constituye un sistema de inducci贸n de definiciones l贸gicas bien conocido: FOIL, creado por Quinlan en 1990, basado en la l贸gica de predicados. Sobre este sistema inicial se realizan, adem谩s de las extensiones para l贸gica borrosa ya mencionadas, otra serie de modificaciones y ampliaciones enfocadas a mejorar la inducci贸n de conocimiento. Estas mejoras se realizan, principalmente, en su parte heur铆stica, al definir una funci贸n de evaluaci贸n de literales, basada en medidas de inter茅s, que permite corregir algunas deficiencias del sistema original y aumentar la calidad de las reglas inducidas. Otras modificaciones se orientan hacia la introducci贸n de conocimiento de base, mediante relaciones definidas intensionalmente, de modo similar a otros sistemas como FOCL. Como resultado tangible de la tesis, se ha desarrollado y probado un sistema, FZFOIL, disponible p煤blicamente bajo la licencia GNU

    Data Mining zur Unterst眉tzung betrieblicher Entscheidungsprozesse

    Get PDF
    Data Mining ist als Anwendung von Algorithmen zur Ermittlung vonDatenmustern aus gro脽en Datenbest盲nden bekannt. Diese Dissertationweitetdie in der Literatur zumeist rein technisch gef眉hrte Diskussion vonData-Mining-Verfahren auf deren betriebswirtschaftlicheAnwendungspotentiale aus. Sie untersucht die Unterst眉tzungsm枚glichkeitenbetrieblicher Entscheidungsprozesse durch Data-Mining-Verfahren.Zun盲chstwird ein formaler 'Baukasten' zur Entwicklung neuerData-Mining-Verfahreneingef眉hrt, der die Gestaltungsm枚glichkeiten von Data-Mining-Modelltypenund ?Suchverfahren sowie die Bewertung der Interessantheit von枚konomischenModellen umfasst. Aus der Betrachtung betriebswirtschaftlicherData-Mining-Anwendungen wird ein generelles Schema zur Unterst眉tzung vonEntscheidungsprozessen per Data Mining abgeleitet. Der Modelltyp desEntscheidungsmodells wird genauer betrachtet und einData-Mining-Verfahrenzur Generierung von Entscheidungsmodellen entwickelt. Abschlie脽end wirddasVerfahren an Testdaten evaluiert und auf eine Problemstellung zurSelektionvon Kunden f眉r eine Direktmarketingaktion im Versicherungsmarktangewendet
    corecore