8 research outputs found

    Scalable mining for classification rules in relational databases

    Get PDF
    doi:10.1214/lnms/1196285404Data mining is a process of discovering useful patterns (knowledge) hidden in extremely large datasets. Classification is a fundamental data mining function, and some other functions can be reduced to it. In this paper we propose a novel classification algorithm (classifier) called MIND (MINing in Databases). MIND can be phrased in such a way that its implementation is very easy using the extended relational calculus SQL, and this in turn allows the classifier to be built into a relational database system directly. MIND is truly scalable with respect to I/O efficiency, which is important since scalability is a key requirement for any data mining algorithm. We have built a prototype of MIND in the relational database management system DB2 and have benchmarked its performance. We describe the working prototype and report the measured performance with respect to the previous method of choice. MIND scales not only with the size of datasets but also with the number of processors on an IBM SP2 computer system. Even on uniprocessors, MIND scales well beyond dataset sizes previously published for classifiers.We also give some insights that may have an impact on the evolution of the extended relational calculus SQL

    Método Tres-Pasos para integrar fuertemente tareas de minería de datos en un sistema de base de datos relacional

    Get PDF
    In this paper, a result of the research project that aimed to define new algebraic operators and new SQL primitives for knowledge discovery in a tightly coupled architecture with a Relational Database Management System (RDBMS) is presented. In order to facilitate the tight coupling and to support the data mining tasks into the RDBMS engine, the three-step approach is proposed. In the first step, the relational algebra is extended with new algebraic operators to facilitate more expensive computationally processes of data mining tasks. In the next step and with the aim that the SQL language is relationally complete, these operators are defined as new primitives in the SELECT clause. In the last step, these primitives are unified into new SQL operator that runs a specific data mining task. Applying this method, new algebraic operators, new SQL primitives and new SQL operators for association and classification tasks were defined and were implemented into the PostgreSQL DBMS engine, giving it the capacity to discover association and classification rules efficiently.En este artículo se presenta uno de los resultados del proyecto de investigación cuyo objetivo fue definir nuevosoperadores algebraicos y nuevas primitivas SQL para el Descubrimiento de Conocimiento en una arquitecturafuertemente acoplada con un Sistema Gestor de Bases de Datos Relacional (SGBDR). Se propone el método trespasoscon el fin de facilitar el acoplamiento fuerte y soportar tareas de minería de datos al interior del motor de unSGBDR. En el primer paso, se extiende el álgebra relacional con nuevos operadores algebraicos que faciliten losprocesos computacionales más costosos de las tareas de minería de datos. En el siguiente paso y con el fin de queel lenguaje SQL sea relacionalmente completo, estos operadores son definidos como nuevas primitivas SQL en lacláusula SELECT. En el último paso, estas primitivas son unificadas en un nuevo operador SQL que ejecuta unatarea específica de minería de datos. Aplicando este método, se definieron nuevos operadores algebraicos, nuevasprimitivas y operadores SQL para las tareas de Asociación y Clasificación y fueron implementados al interiordel motor del SGBD PostgreSQL, dotándolo de la capacidad para descubrir reglas de asociación y clasificacióneficientemente

    Decision Tables: Scalable Classification Exploring RDBMS Capabilities

    Get PDF
    In this paper, we report our success in building efficient scalable classifiers in the form of decision tables by exploring capabilities of modern relational database management systems. In addition to high classification accuracy, the unique features of the approach include its high training speed, linear scalability, and simplicity in implementation. More importantly, the major computation required in the approach can be implemented using standard functions provided by the modern relational DBMS. This not only makes implementation of the classifier extremely easy, further performance improvement is also expected when better processing strategies for those computations are developed and implemented in RDBMS. The novel classification approach based on grouping and counting and its implementation on top of RDBMS is described. The result

    Modelo para la identificación de relaciones entre la información sobre los graduados de los programas de Maestría y Doctorado de la Universidad Nacional de Colombia y su tiempo de permanencia

    Get PDF
    Durante esta investigación se llevó a cabo un proceso de descubrimiento de conocimiento (KDD por sus siglas en inglés) aplicado a datos educativos, más específicamente, a datos académicos y no académicos de los graduados de programas de posgrado (maestrías, especializaciones y doctorados) de todas las sedes la Universidad Nacional de Colombia. Se hizo énfasis en la etapa de minería de datos, teniendo en cuenta los siguientes objetivos principales, el primero buscar relaciones entre estos datos académicos y no académicos de los graduados con respecto al tiempo que les tomó graduarse, y el segundo desarrollar un modelo que pudiera dar una probabilidad de lo que puede suceder con los estudiantes que hasta ahora están ingresando a estos programas de posgrado. Finalmente, se implementó una visualización de los resultados, tanto de los patrones relacionales como de las predicciones, en el Sistema de Autoevaluación de los Programas de Posgrado para que los programas cuando estén planteando su plan de mejoramiento puedan observar y tomar decisiones de acuerdo a la situación de cada uno de ellos.Abstract: In this research, a knowledge discovery process (KDD) was applied to educational data, more specifically, to academic and non-academic data of graduates of Universidad Nacional de Colombia. Focus was put on the data mining stage, keeping into account the following main goals, the first one was to look for relationships between academic and non-academic data of the graduates with respect to the time it took them to graduate, and the second one was to develop a model that could give a probability of what could happen with the students who were just admitted to the graduate programs. Finally, a visualization of the results, both of the relational patterns and of the predictions, was implemented in the Auto-Evaluation System of the Graduate Programs at the Universidad Nacional de Colombia so that the programs could use it when they are preparing their improvement plans.Maestrí

    Data mining and database systems: integrating conceptual clustering with a relational database management system.

    Get PDF
    Many clustering algorithms have been developed and improved over the years to cater for large scale data clustering. However, much of this work has been in developing numeric based algorithms that use efficient summarisations to scale to large data sets. There is a growing need for scalable categorical clustering algorithms as, although numeric based algorithms can be adapted to categorical data, they do not always produce good results. This thesis presents a categorical conceptual clustering algorithm that can scale to large data sets using appropriate data summarisations. Data mining is distinguished from machine learning by the use of larger data sets that are often stored in database management systems (DBMSs). Many clustering algorithms require data to be extracted from the DBMS and reformatted for input to the algorithm. This thesis presents an approach that integrates conceptual clustering with a DBMS. The presented approach makes the algorithm main memory independent and supports on-line data mining

    Data mining and database systems : integrating conceptual clustering with a relational database management system

    Get PDF
    Many clustering algorithms have been developed and improved over the years to cater for large scale data clustering. However, much of this work has been in developing numeric based algorithms that use efficient summarisations to scale to large data sets. There is a growing need for scalable categorical clustering algorithms as, although numeric based algorithms can be adapted to categorical data, they do not always produce good results. This thesis presents a categorical conceptual clustering algorithm that can scale to large data sets using appropriate data summarisations. Data mining is distinguished from machine learning by the use of larger data sets that are often stored in database management systems (DBMSs). Many clustering algorithms require data to be extracted from the DBMS and reformatted for input to the algorithm. This thesis presents an approach that integrates conceptual clustering with a DBMS. The presented approach makes the algorithm main memory independent and supports on-line data mining.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Scalable mining for classification rules in relational databases

    No full text

    Scalable Mining for Classification Rules in Relational Databases

    No full text
    Classification is a key function of many "business intelligence" toolkits and a fundamental building block in data mining. Immense data may be needed to train a classifier for good accuracy. The state-of-art classifiers [21, 25] need an in-memory data structure of size O(N), where N is the size of the training data, to achieve efficiency. For large data sets, such a data structure will not fit in the internal memory. The best previously known classifier does a quadratic number of I/Os for large N . In this paper, we propose a novel classification algorithm (classifier) called MIND (MINing in Databases). MIND can be phrased in such a way that its implementation is very easy using the extended relational calculus SQL, and this in turn allows the classifier to be built into a relational database system directly. MIND is truly scalable with respect to I/O efficiency, which is important since scalability is a key requirement for any data mining algorithm. We built a prototype of MIND in the..
    corecore