13 research outputs found

    BAC: A bagged associative classifier for big data frameworks

    Get PDF
    Big Data frameworks allow powerful distributed computations extending the results achievable on a single machine. In this work, we present a novel distributed associative classifier, named BAC, based on ensemble techniques. Ensembles are a popular approach that builds several models on different subsets of the original dataset, eventually voting to provide a unique classification outcome. Experiments on Apache Spark and preliminary results showed the capability of the proposed ensemble classifier to obtain a quality comparable with the single-machine version on popular real-world datasets, and overcome their scalability limits on large synthetic datasets

    I-prune: Item selection for associative classification

    Get PDF
    Associative classification is characterized by accurate models and high model generation time. Most time is spent in extracting and postprocessing a large set of irrelevant rules, which are eventually pruned.We propose I-prune, an item-pruning approach that selects uninteresting items by means of an interestingness measure and prunes them as soon as they are detected. Thus, the number of extracted rules is reduced and model generation time decreases correspondingly. A wide set of experiments on real and synthetic data sets has been performed to evaluate I-prune and select the appropriate interestingness measure. The experimental results show that I-prune allows a significant reduction in model generation time, while increasing (or at worst preserving) model accuracy. Experimental evaluation also points to the chi-square measure as the most effective interestingness measure for item pruning

    A review of associative classification mining

    Get PDF
    Associative classification mining is a promising approach in data mining that utilizes the association rule discovery techniques to construct classification systems, also known as associative classifiers. In the last few years, a number of associative classification algorithms have been proposed, i.e. CPAR, CMAR, MCAR, MMAC and others. These algorithms employ several different rule discovery, rule ranking, rule pruning, rule prediction and rule evaluation methods. This paper focuses on surveying and comparing the state-of-the-art associative classification techniques with regards to the above criteria. Finally, future directions in associative classification, such as incremental learning and mining low-quality data sets, are also highlighted in this paper

    A MapReduce solution for associative classification of big data

    Get PDF
    Associative classifiers have proven to be very effective in classification problems. Unfortunately, the algorithms used for learning these classifiers are not able to adequately manage big data because of time complexity and memory constraints. To overcome such drawbacks, we propose a distributed association rule-based classification scheme shaped according to the MapReduce programming model. The scheme mines classification association rules (CARs) using a properly enhanced, distributed version of the well-known FP-Growth algorithm. Once CARs have been mined, the proposed scheme performs a distributed rule pruning. The set of survived CARs is used to classify unlabeled patterns. The memory usage and time complexity for each phase of the learning process are discussed, and the scheme is evaluated on seven real-world big datasets on the Hadoop framework, characterizing its scalability and achievable speedup on small computer clusters. The proposed solution for associative classifiers turns to be suitable to practically address big datasets even with modest hardware support. Comparisons with two state-of-the-art distributed learning algorithms are also discussed in terms of accuracy, model complexity, and computation time

    Fuzzy-Granular Based Data Mining for Effective Decision Support in Biomedical Applications

    Get PDF
    Due to complexity of biomedical problems, adaptive and intelligent knowledge discovery and data mining systems are highly needed to help humans to understand the inherent mechanism of diseases. For biomedical classification problems, typically it is impossible to build a perfect classifier with 100% prediction accuracy. Hence a more realistic target is to build an effective Decision Support System (DSS). In this dissertation, a novel adaptive Fuzzy Association Rules (FARs) mining algorithm, named FARM-DS, is proposed to build such a DSS for binary classification problems in the biomedical domain. Empirical studies show that FARM-DS is competitive to state-of-the-art classifiers in terms of prediction accuracy. More importantly, FARs can provide strong decision support on disease diagnoses due to their easy interpretability. This dissertation also proposes a fuzzy-granular method to select informative and discriminative genes from huge microarray gene expression data. With fuzzy granulation, information loss in the process of gene selection is decreased. As a result, more informative genes for cancer classification are selected and more accurate classifiers can be modeled. Empirical studies show that the proposed method is more accurate than traditional algorithms for cancer classification. And hence we expect that genes being selected can be more helpful for further biological studies

    Nuevos retos en clasificación asociativa: Big Data y aplicaciones

    Get PDF
    La clasificación asociativa surge como resultado de la unión de dos importantes ámbitos del aprendizaje automático. Por un lado la tarea descriptiva de extracción de reglas de asociación, como mecanismo para obtener información previamente desconocida e interesante de un conjunto de datos, combinado con una tarea predictiva, como es la clasificación, que permite en base a un conjunto de variables explicativas y previamente conocidas realizar una predicción sobre una variable de interés o predictiva. Los objetivos de esta tesis doctoral son los siguientes: 1) El estudio y el análisis del estado del arte de tanto la extracción de reglas de asociación como de la clasificación asociativa; 2) La propuesta de nuevos modelos de clasificación asociativa así como de extracción de reglas de asociación teniendo en cuenta la obtención de modelos que sean precisos, interpretables, eficientes así como flexibles para poder introducir conocimiento subjetivo en éstos. 3) Adicionalmente, y dado la gran cantidad de datos que cada día se genera en las últimas décadas, se prestará especial atención al tratamiento de grandes cantidades datos, también conocido como Big Data. En primer lugar, se ha analizado el estado del arte tanto de clasificación asociativa como de la extracción de reglas de asociación. En este sentido, se ha realizado un estudio y análisis exhaustivo de la bibliografía de los trabajos relacionados para poder conocer con gran nivel de detalle el estado del arte. Como resultado, se ha permitido sentar las bases para la consecución de los demás objetivos así como detectar que dentro de la clasificación asociativa se requería de algún mecanismo que facilitara la unificación de comparativas así como que fueran lo más completas posibles. Para tal fin, se ha propuesto una herramienta de software que cuenta con al menos un algoritmo de todas las categorías que componen la taxonomía actual. Esto permitirá dentro de las investigaciones del área, realizar comparaciones más diversas y completas que hasta el momento se consideraba una tarea en el mejor de los casos muy ardua, al no estar disponibles muchos de los algoritmos en un formato ejecutable ni mucho menos como código abierto. Además, esta herramienta también dispone de un conjunto muy diverso de métricas que permite cuantificar la calidad de los resultados desde diferentes perspectivas. Esto permite conseguir clasificadores lo más completos posibles, así como para unificar futuras comparaciones con otras propuestas. En segundo lugar, y como resultado del análisis previo, se ha detectado que las propuestas actuales no permiten escalar, ni horizontalmente, ni verticalmente, las metodologías sobre conjuntos de datos relativamente grandes. Dado el creciente interés, tanto del mundo académico como del industrial, de aumentar la capacidad de cómputo a ingentes cantidades de datos, se ha considerado interesante continuar esta tesis doctoral realizando un análisis de diferentes propuestas sobre Big Data. Para tal fin, se ha comenzado realizando un análisis pormenorizado de los últimos avances para el tratamiento de tal cantidad de datos. En este respecto, se ha prestado especial atención a la computación distribuida ya que ha demostrado ser el único procedimiento que permite el tratamiento de grandes cantidades de datos sin la realización de técnicas de muestreo. En concreto, se ha prestado especial atención a las metodologías basadas en MapReduce que permite la descomposición de problemas complejos en fracciones divisibles y paralelizables, que posteriormente pueden ser agrupadas para obtener el resultado final. Como resultado de este objetivo se han propuesto diferentes algoritmos que permiten el tratamiento de grandes cantidades de datos, sin la pérdida de precisión ni interpretabilidad. Todos los algoritmos propuestos se han diseñado para que puedan funcionar sobre las implementaciones de código abierto más conocidas de MapReduce. En tercer y último lugar, se ha considerado interesante realizar una propuesta que mejore el estado del arte de la clasificación asociativa. Para tal fin, y dado que las reglas de asociación son la base y factores determinantes para los clasificadores asociativos, se ha comenzado realizando una nueva propuesta para la extracción de reglas de asociación. En este aspecto, se ha combinado el uso de los últimos avances en computación distribuida, como MapReduce, con los algoritmos evolutivos que han demostrado obtener excelentes resultados en el área. En particular, se ha hecho uso de programación genética gramatical por su flexibilidad para codificar las soluciones, así como introducir conocimiento subjetivo en el proceso de búsqueda a la vez que permiten aliviar los requisitos computacionales y de memoria. Este nuevo algoritmo, supone una mejora significativa de la extracción de reglas de asociación ya que ha demostrado obtener mejores resultados que las propuestas existentes sobre diferentes tipos de datos así como sobre diferentes métricas de interés, es decir, no sólo obtiene mejores resultados sobre Big Data, sino que se ha comparado en su versión secuencial con los algoritmos existentes. Una vez que se ha conseguido este algoritmo que permite extraer excelentes reglas de asociación, se ha adaptado para la obtención de reglas de asociación de clase así como para obtener un clasificador a partir de tales reglas. De nuevo, se ha hecho uso de programación genética gramatical para la obtención del clasificador de forma que se permite al usuario no sólo introducir conocimiento subjetivo en las propias formas de las reglas, sino también en la forma final del clasificador. Esta nueva propuesta también se ha comparado con los algoritmos existentes de clasificación asociativa forma secuencial para garantizar que consigue diferencias significativas respecto a éstos en términos de exactitud, interpretabilidad y eficiencia. Adicionalmente, también se ha comparado con otras propuestas específicas de Big Data demostrado obtener excelentes resultados a la vez que mantiene un compromiso entre los objetivos conflictivos de interpretabilidad, exactitud y eficiencia. Esta tesis doctoral se ha desarrollado bajo un entorno experimental apropiado, haciendo uso de diversos conjunto de datos incluyendo tanto datos de pequeña dimensionalidad como Big Data. Además, todos los conjuntos de datos usados están publicados libremente y conforman un conglomerado de diversas dimensionalidades, número de instancias y de clases. Todos los resultados obtenidos se han comparado con el estado de arte correspondiente, y se ha hecho uso de tests estadísticos no paramétricos para comprobar que las diferencias encontradas son significativas desde un punto de vista estadístico, y no son fruto del azar. Adicionalmente, todas las comparaciones realizadas consideran diferentes perspectivas, es decir, se ha analizado rendimiento, eficiencia, precisión así como interpretabilidad en cada uno de los estudios.This Doctoral Thesis aims at solving the challenging problem of associative classification and its application on very large datasets. First, associative classification state-of-art has been studied and analyzed, and a new tool covering the whole taxonomy of algorithms as well as providing many different measures has been proposed. The goal of this tool is two-fold: 1) unification of comparisons, since existing works compare with very different measures; 2) providing a unique tool which has at least one algorithm of each category forming the taxonomy. This tool is a very important advancement in the field, since until the moment the whole taxonomy has not been covered due to that many algorithms have not been released as open source nor they were available to be run. Second, AC has been analyzed on very large quantities of data. In this regard, many different platforms for distributed computing have been studied and different proposals have been developed on them. These proposals enable to deal with very large data in a efficient way scaling up the load on very different compute nodes. Third, as one of the most important part of the associative classification is to extract high quality rules, it has been proposed a novel grammar-guided genetic programming algorithm which enables to obtain interesting association rules with regard to different metrics and in different kinds of data, including truly Big Data datasets. This proposal has proved to obtain very good results in terms of both quality and interpretability, at the same time of providing a very flexible way of representing the solutions and enabling to introduce subjective knowledge in the search process. Then, a novel algorithm has been proposed for associative classification using a non-trivial adaptation of the aforementioned algorithm to obtain the rules forming the classifier. This methodology is also based on grammar-guided genetic programming enabling user not only to constrain the form of the rules, but the final form of the classifier. Results have proved that this algorithm obtains very accurate classifiers at the same time of maintaining a good level of interpretability. All the methodologies proposed along this Thesis has been evaluated using a proper experimental framework, using a varied set of datasets including both classical and Big Data dataset, and analyzing different metrics to quantify the quality of the algorithms with regard to different perspectives. Results have been compared with state-of-the-art and they have been verified by means of non-parametric statistical tests proving that the proposed methods overcome to existing approaches

    Classification algorithms for Big Data with applications in the urban security domain

    Get PDF
    A classification algorithm is a versatile tool, that can serve as a predictor for the future or as an analytical tool to understand the past. Several obstacles prevent classification from scaling to a large Volume, Velocity, Variety or Value. The aim of this thesis is to scale distributed classification algorithms beyond current limits, assess the state-of-practice of Big Data machine learning frameworks and validate the effectiveness of a data science process in improving urban safety. We found in massive datasets with a number of large-domain categorical features a difficult challenge for existing classification algorithms. We propose associative classification as a possible answer, and develop several novel techniques to distribute the training of an associative classifier among parallel workers and improve the final quality of the model. The experiments, run on a real large-scale dataset with more than 4 billion records, confirmed the quality of the approach. To assess the state-of-practice of Big Data machine learning frameworks and streamline the process of integration and fine-tuning of the building blocks, we developed a generic, self-tuning tool to extract knowledge from network traffic measurements. The result is a system that offers human-readable models of the data with minimal user intervention, validated by experiments on large collections of real-world passive network measurements. A good portion of this dissertation is dedicated to the study of a data science process to improve urban safety. First, we shed some light on the feasibility of a system to monitor social messages from a city for emergency relief. We then propose a methodology to mine temporal patterns in social issues, like crimes. Finally, we propose a system to integrate the findings of Data Science on the citizenry’s perception of safety and communicate its results to decision makers in a timely manner. We applied and tested the system in a real Smart City scenario, set in Turin, Italy

    LC an effective classification based association rule mining algorithm

    Get PDF
    Classification using association rules is a research field in data mining that primarily uses association rule discovery techniques in classification benchmarks. It has been confirmed by many research studies in the literature that classification using association tends to generate more predictive classification systems than traditional classification data mining techniques like probabilistic, statistical and decision tree. In this thesis, we introduce a novel data mining algorithm based on classification using association called “Looking at the Class” (LC), which can be used in for mining a range of classification data sets. Unlike known algorithms in classification using the association approach such as Classification based on Association rule (CBA) system and Classification based on Predictive Association (CPAR) system, which merge disjoint items in the rule learning step without anticipating the class label similarity, the proposed algorithm merges only items with identical class labels. This saves too many unnecessary items combining during the rule learning step, and consequently results in large saving in computational time and memory. Furthermore, the LC algorithm uses a novel prediction procedure that employs multiple rules to make the prediction decision instead of a single rule. The proposed algorithm has been evaluated thoroughly on real world security data sets collected using an automated tool developed at Huddersfield University. The security application which we have considered in this thesis is about categorizing websites based on their features to legitimate or fake which is a typical binary classification problem. Also, experimental results on a number of UCI data sets have been conducted and the measures used for evaluation is the classification accuracy, memory usage, and others. The results show that LC algorithm outperformed traditional classification algorithms such as C4.5, PART and Naïve Bayes as well as known classification based association algorithms like CBA with respect to classification accuracy, memory usage, and execution time on most data sets we consider
    corecore