65 research outputs found

    Towards a semantic and statistical selection of association rules

    Full text link
    The increasing growth of databases raises an urgent need for more accurate methods to better understand the stored data. In this scope, association rules were extensively used for the analysis and the comprehension of huge amounts of data. However, the number of generated rules is too large to be efficiently analyzed and explored in any further process. Association rules selection is a classical topic to address this issue, yet, new innovated approaches are required in order to provide help to decision makers. Hence, many interesting- ness measures have been defined to statistically evaluate and filter the association rules. However, these measures present two major problems. On the one hand, they do not allow eliminating irrelevant rules, on the other hand, their abun- dance leads to the heterogeneity of the evaluation results which leads to confusion in decision making. In this paper, we propose a two-winged approach to select statistically in- teresting and semantically incomparable rules. Our statis- tical selection helps discovering interesting association rules without favoring or excluding any measure. The semantic comparability helps to decide if the considered association rules are semantically related i.e comparable. The outcomes of our experiments on real datasets show promising results in terms of reduction in the number of rules

    Contributions à l’Optimisation de Requêtes Multidimensionnelles

    Get PDF
    Analyser les données consiste à choisir un sous-ensemble des dimensions qui les décriventafin d'en extraire des informations utiles. Or, il est rare que l'on connaisse a priori les dimensions"intéressantes". L'analyse se transforme alors en une activité exploratoire où chaque passe traduit par une requête. Ainsi, il devient primordiale de proposer des solutions d'optimisationde requêtes qui ont une vision globale du processus plutôt que de chercher à optimiser chaque requêteindépendamment les unes des autres. Nous présentons nos contributions dans le cadre de cette approcheexploratoire en nous focalisant sur trois types de requêtes: (i) le calcul de bordures,(ii) les requêtes dites OLAP (On Line Analytical Processing) dans les cubes de données et (iii) les requêtesde préférence type skyline

    Hybrid ASP-based Approach to Pattern Mining

    Full text link
    Detecting small sets of relevant patterns from a given dataset is a central challenge in data mining. The relevance of a pattern is based on user-provided criteria; typically, all patterns that satisfy certain criteria are considered relevant. Rule-based languages like Answer Set Programming (ASP) seem well-suited for specifying such criteria in a form of constraints. Although progress has been made, on the one hand, on solving individual mining problems and, on the other hand, developing generic mining systems, the existing methods either focus on scalability or on generality. In this paper we make steps towards combining local (frequency, size, cost) and global (various condensed representations like maximal, closed, skyline) constraints in a generic and efficient way. We present a hybrid approach for itemset, sequence and graph mining which exploits dedicated highly optimized mining systems to detect frequent patterns and then filters the results using declarative ASP. To further demonstrate the generic nature of our hybrid framework we apply it to a problem of approximately tiling a database. Experiments on real-world datasets show the effectiveness of the proposed method and computational gains for itemset, sequence and graph mining, as well as approximate tiling. Under consideration in Theory and Practice of Logic Programming (TPLP).Comment: 29 pages, 7 figures, 5 table

    Ranking and selecting association rules based on dominance relationship

    Get PDF
    The huge number of association rules represent the main obstacle that a decision maker faces. In order to bypass this obstacle, an efficient selection of rules must be performed. Since selection is necessarily based on evaluation, many interestingness measures have been proposed. However, the abundance of these measures caused a new problem which is the heterogeneity of the evaluation results and this created confusion to the decision. In this scope, we propose a novel approach to discover interesting association rules without favouring or excluding any measure by adopting the notion of dominance between rules. Our approach bypasses the problem of measure heterogeneity and find a compromise between their evaluations and also bypasses another non-trivial problem which is the threshold value specification

    Constraint Programming for Multi-criteria Conceptual Clustering

    Get PDF
    International audienceA conceptual clustering is a set of formal concepts (i.e., closed itemsets) that defines a partition of a set of transactions. Finding a conceptual clustering is an N P-complete problem for which Constraint Programming (CP) and Integer Linear Programming (ILP) approaches have been recently proposed. We introduce new CP models to solve this problem: a pure CP model that uses set constraints, and an hybrid model that uses a data mining tool to extract formal concepts in a preprocessing step and then uses CP to select a subset of formal concepts that defines a partition. We compare our new models with recent CP and ILP approaches on classical machine learning instances. We also introduce a new set of instances coming from a real application case, which aims at extracting setting concepts from an Enterprise Resource Planning (ERP) software. We consider two classic criteria to optimize, i.e., the frequency and the size. We show that these criteria lead to extreme solutions with either very few small formal concepts or many large formal concepts, and that compromise clusterings may be obtained by computing the Pareto front of non dominated clusterings

    Mining Privacy-Preserving Association Rules based on Parallel Processing in Cloud Computing

    Full text link
    With the onset of the Information Era and the rapid growth of information technology, ample space for processing and extracting data has opened up. However, privacy concerns may stifle expansion throughout this area. The challenge of reliable mining techniques when transactions disperse across sources is addressed in this study. This work looks at the prospect of creating a new set of three algorithms that can obtain maximum privacy, data utility, and time savings while doing so. This paper proposes a unique double encryption and Transaction Splitter approach to alter the database to optimize the data utility and confidentiality tradeoff in the preparation phase. This paper presents a customized apriori approach for the mining process, which does not examine the entire database to estimate the support for each attribute. Existing distributed data solutions have a high encryption complexity and an insufficient specification of many participants' properties. Proposed solutions provide increased privacy protection against a variety of attack models. Furthermore, in terms of communication cycles and processing complexity, it is much simpler and quicker. Proposed work tests on top of a realworld transaction database demonstrate that the aim of the proposed method is realistic

    Determinação das regras de associação de variáveis de tempo ponderadas baseadas em utilidades mediante a aplicação de uma árvore de padrões frequentes

    Get PDF
    Introduction: The present research was conducted at Birla Institute of Technology, off Campus in Noida, India, in 2017. Methods: To assess the efficiency of the proposed approach for information mining a method and an algorithm were proposed for mining time-variant weighted, utility-based association rules using fp-tree. Results: A method is suggested to find association rules on time-oriented frequency-weighted, utility-based data, employing a hierarchy for pulling-out item-sets and establish their association. Conclusions: The dimensions adopted while developing the approach compressed a large time-variant dataset to a smaller data structure at the same time fp-tree was kept away from the repetitive dataset, which finally gave us a noteworthy advantage in articulations of time and memory use. Originality: In the current period, high utility recurrent-pattern pulling-out is one of the mainly noteworthy study areas in time-variant information mining due to its capability to account for the frequency rate of item-sets and assorted utility rates of every item-set. This research contributes to maintain it at a corresponding level, which ensures to avoid generating a big amount of candidate-sets, which ensures further development of less execution time and search spaces. Limitations: The research results demonstrated that the projected approach was efficient on tested datasets with pre-defined weight and utility calculations.Introducción: la presente investigación se realizó en el Birla Institute of Technology, fuera del campus en Noida, India, en 2017. Métodos: para evaluar la eficacia del enfoque propuesto para la minería de información, se propusieron un método y un algoritmo para minar las reglas de asociación basadas en la utilidad ponderada en el tiempo usando un árbol de patrones frecuentes (fp). Resultados: se sugiere un método para encontrar reglas de asociación en datos basados en la utilidad ponderada en frecuencia orientada al tiempo, que emplea una jerarquía para extraer conjuntos de elementos y establecer su asociación. Conclusiones: las dimensiones adoptadas al desarrollar el enfoque comprimieron un gran conjunto de datos de variante de tiempo hasta alcanzar una estructura de datos más pequeña. A su vez, el árbol fp se mantuvo alejado del conjunto de datos repetitivos, lo que finalmente generó una ventaja considerable en tiempo y uso de memoria. Originalidad: en la actualidad, la extracción de patrones recurrentes de alta utilidad es una de las áreas de estudio más desarrollada en la minería de información con respecto a la variable temporal debido a su capacidad de dar cuenta de la frecuencia de los conjuntos de elementos y las tasas de servicios varios de cada conjunto de elementos. Esta investigación contribuye a mantener el estudio sobre el tema a un buen nivel, lo que permite evitar generar una gran cantidad de conjuntos posibles, y por ende garantiza mayor desarrollo en menores tiempos de ejecución y espacios de búsqueda. Limitaciones: Los resultados de la investigación demostraron que la aproximación fue eficiente en conjuntos de datos probados con cálculos predefinidos de peso y utilidad.Introdução: esta pesquisa foi realizada no Instituto Birla de Tecnologia e Ciência, fora do campus, em Noida, na Índia, em 2017. Métodos: para avaliar a eficácia do enfoque proposto para mineração de informação, foram propostos um método e um algoritmo para minerar as regras de associação baseadas na utilidade ponderada no tempo usando uma árvore de padrões frequentes (fp).Resultados: é recomendado um método para encontrar regras de associação nos dados baseados na utilidade ponderada em frequência orientada ao tempo, que emprega uma hierarquia para extrair conjuntos de elementos e estabelecer a associação entre eles.Conclusões: as dimensões utilizadas ao desenvolver o enfoque comprimiram um grande conjunto de dados de variante de tempo até alcançar uma estrutura de dados menor, enquanto isso, a árvore fp se manteve distante do conjunto de dados repetitivos, o que finalmente gerou uma vantagem considerável em tempo e uso de memória.Originalidade: na atualidade, a extração de padrões recorrentes de alta utilidade é uma das áreas de estudo mais desenvolvidas na mineração de informação com respeito à variável temporal, devido a sua capacidade de dar conta da frequência dos conjuntos de elementos e das taxas de serviços vários de cada conjunto de elementos. Esta pesquisa ajuda a manter o estudo desse tema em um nível avançado, o que garante evitar gerar uma grande quantidade de conjuntos possíveis e, dessa forma, um maior desenvolvimento em um menor tempo de execução e espaço de busca.Limitações: os resultados da pesquisa demonstraram que a aproximação foi eficiente em conjuntos de dados provados com cálculos predefinidos de peso e utilidade

    Unified Framework for Data Mining using Frequent Model Tree

    Get PDF
    Abstract: Data mining is the science of discovering hidden patterns from data. Over the past years, a plethora of data mining algorithms has been developed to carry out various data mining tasks such as classification, clustering, association mining and regression. All the methods are ad-hoc in nature, and there exists no unifying framework which unites all the data mining tasks. This study proposes such a framework which describes a data modelling technique to model data in a manner that can be used to accomplish all kinds of data mining tasks. This study proposed a novel algorithm known as Frequent Model (FM)-Growth, based on Frequent pattern (FP)-Growth algorithm. The algorithm is used to find frequent patterns or models from data. These models will then be used to carry out various data mining tasks such as classification, clustering. The advantage of these frequent models is that they can be used as it is with any data mining task irrespective of the nature of the task. The algorithm is carried out in two stages. In the first stage, we grow the FM-tree from the data and in the second stage, we extract the frequent models from the FM-tree. The accuracy of the proposed algorithm is high. However, the algorithm is computationally expensive when searching for frequent models in high volume and high dimensional data. The reason of expensiveness is that it needs to travel all the nodes of a tree. The study suggests measures to be taken to improve the efficiency of the overall process using dictionary data structure.Keywords: Data Mining, Frequent Pattern Recognition Unified Framework, Classification, Clustering, FPGrowth tree

    Computing Closed Skycubes

    Get PDF
    International audienceIn this paper, we tackle the problem of efficient skycube computation. We introduce a novel approach significantly reducing domination tests for a given subspace and the number of subspaces searched. Technically, we identify two types of skyline points that can be directly derived without using any domination tests. Moreover, based on formal concept analysis, we introduce two closure operators that enable a concise representation of skyline cubes. We show that this concise representation is easy to compute and develop an efficient algorithm, which only needs to search a small portion of the huge search space. We show with empirical results the merits of our approach
    corecore