19 research outputs found

    A multithreaded hybrid framework for mining frequent itemsets

    Get PDF
    Mining frequent itemsets is an area of data mining that has beguiled several researchers in recent years. Varied data structures such as Nodesets, DiffNodesets, NegNodesets, N-lists, and Diffsets are among a few that were employed to extract frequent items. However, most of these approaches fell short either in respect of run time or memory. Hybrid frameworks were formulated to repress these issues that encompass the deployment of two or more data structures to facilitate effective mining of frequent itemsets. Such an approach aims to exploit the advantages of either of the data structures while mitigating the problems of relying on either of them alone. However, limited efforts have been made to reinforce the efficiency of such frameworks. To address these issues this paper proposes a novel multithreaded hybrid framework comprising of NegNodesets and N-list structure that uses the multicore feature of today’s processors. While NegNodesets offer a concise representation of itemsets, N-lists rely on List intersection thereby speeding up the mining process. To optimize the extraction of frequent items a hash-based algorithm has been designed here to extract the resultant set of frequent items which further enhances the novelty of the framework

    Fouille et classement d'ensembles fermés dans des données transactionnelles de grande échelle.

    Get PDF
    The recent increase of data volumes raises new challenges for itemset miningalgorithms. In this thesis, we focus on transactional datasets (collections of itemssets, for example supermarket tickets) containing at least a million transactionsover hundreds of thousands items. These datasets usually follow a “long tail”distribution: a few items are very frequent, and most items appear rarely. Suchdistributions are often truncated by existing itemset mining algorithms, whoseresults concern only a very small portion of the available items (the most frequents,usually). Thus, existing methods fail to concisely provide relevant insights on largedatasets. We therefore introduce a new semantics which is more intuitive for theanalyst: browsing associations per item, for any item, and less than a hundredassociations at once.To address the items’ coverage challenge, our first contribution is the item-centric mining problem. It consists in computing, for each item in the dataset,the k most frequent closed itemsets containing this item. We present an algorithmto solve it, TopPI. We show that TopPI computes efficiently interesting resultsover our datasets, outperforming simpler solutions or emulations based on existingalgorithms, both in terms of run-time and result completeness. We also show andempirically validate how TopPI can be parallelized, on multi-core machines andon Hadoop clusters, in order to speed-up computation on large scale datasets.Our second contribution is CAPA, a framework allowing us to study whichexisting measures of association rules’ quality are relevant to rank results. Thisconcerns results obtained from TopPI or from j LCM, our implementation of astate-of-the-art frequent closed itemsets mining algorithm (LCM). Our quantita-tive study shows that the 39 quality measures we compare can be grouped into5 families, based on the similarity of the rankings they produce. We also involvemarketing experts in a qualitative study, in order to discover which of the 5 familieswe propose highlights the most interesting associations for their domain.Our close collaboration with Intermarché, one of our industrial partners in theDatalyse project, allows us to show extensive experiments on real, nation-widesupermarket data. We present a complete analytics workflow addressing this usecase. We also experiment on Web data. Our contributions can be relevant invarious other fields, thanks to the genericity of transactional datasets.Altogether our contributions allow analysts to discover associations of interestin modern datasets. We pave the way for a more reactive discovery of items’ asso-ciations in large-scale datasets, whether on highly dynamic data or for interactiveexploration systems.Les algorithmes actuels pour la fouille d’ensembles fréquents sont dépassés parl’augmentation des volumes de données. Dans cette thèse nous nous intéressonsplus particulièrement aux données transactionnelles (des collections d’ensemblesd’objets, par exemple des tickets de caisse) qui contiennent au moins un mil-lion de transactions portant sur au moins des centaines de milliers d’objets. Lesjeux de données de cette taille suivent généralement une distribution dite en“longue traine”: alors que quelques objets sont très fréquents, la plupart sontrares. Ces distributions sont le plus souvent tronquées par les algorithmes defouille d’ensembles fréquents, dont les résultats ne portent que sur une infimepartie des objets disponibles (les plus fréquents). Les méthodes existantes ne per-mettent donc pas de découvrir des associations concises et pertinentes au seind’un grand jeu de données. Nous proposons donc une nouvelle sémantique, plusintuitive pour l’analyste: parcourir les associations par objet, au plus une centaineà la fois, et ce pour chaque objet présent dans les données.Afin de parvenir à couvrir tous les objets, notre première contribution consisteà définir la fouille centrée sur les objets. Cela consiste à calculer, pour chaqueobjet trouvé dans les données, les k ensembles d’objets les plus fréquents qui lecontiennent. Nous présentons un algorithme effectuant ce calcul, TopPI. Nousmontrons que TopPI calcule efficacement des résultats intéressants sur nos jeuxde données. Il est plus performant que des solutions naives ou des émulationsreposant sur des algorithmes existants, aussi bien en termes de rapidité que decomplétude des résultats. Nous décrivons et expérimentons deux versions par-allèles de TopPI (l’une sur des machines multi-coeurs, l’autre sur des grappesHadoop) qui permettent d’accélerer le calcul à grande échelle.Notre seconde contribution est CAPA, un système permettant d’étudier quellemesure de qualité des règles d’association serait la plus appropriée pour trier nosrésultats. Cela s’applique aussi bien aux résultats issus de TopPI que de j LCM,notre implémentation d’un algorithme récent de fouille d’ensembles fréquents fer-més (LCM). Notre étude quantitative montre que les 39 mesures que nous com-parons peuvent être regroupées en 5 familles, d’après la similarité des classementsde règles qu’elles produisent. Nous invitons aussi des experts en marketing à par-ticiper à une étude qualitative, afin de déterminer laquelle des 5 familles que nousproposons met en avant les associations d’objets les plus pertinentes dans leurdomaine.Notre collaboration avec Intermarché, partenaire industriel dans le cadre duprojet Datalyse, nous permet de présenter des expériences complètes et por-tant sur des données réelles issues de supermarchés dans toute la France. Nousdécrivons un flux d’analyse complet, à même de répondre à cette application. Nousprésentons également des expériences portant sur des données issues d’Internet;grâce à la généricité du modèle des ensembles d’objets, nos contributions peuvents’appliquer dans d’autres domaines.Nos contributions permettent donc aux analystes de découvrir des associations d’objets au milieu de grandes masses de données. Nos travaux ouvrent aussi lavoie vers la fouille d’associations interactive à large échelle, afin d’analyser desdonnées hautement dynamiques ou de réduire la portion du fichier à analyser àcelle qui intéresse le plus l’analyste

    High Performance Frequent Subgraph Mining on Transactional Datasets

    Get PDF
    Graph data mining has been a crucial as well as inevitable area of research. Large amounts of graph data are produced in many areas, such as Bioinformatics, Cheminformatics, Social Networks, and Web etc. Scalable graph data mining methods are getting increasingly popular and necessary due to increased graph complexities. Frequent subgraph mining is one such area where the task is to find overly recurring patterns/subgraphs. To tackle this problem, many main memory-based methods were proposed, which proved to be inefficient as the data size grew exponentially over time. In the past few years several research groups have attempted to handle the frequent subgraph mining (FSM) problem in multiple ways. Many authors have tried to achieve better performance using Graphic Processing Units (GPUs) which has multi-fold improvement over in-memory while dealing with large datasets. Later, Google\u27s MapReduce model with the Hadoop framework proved to be a major breakthrough in high performance large batch processing. Although MapReduce came with many benefits, its disk I/O and non-iterative style model could not help much for FSM domain since subgraph mining process is an iterative approach. In recent years, Spark has emerged to be the De Facto industry standard with its distributed in-memory computing capability. This is a right fit solution for iterative style of programming as well. In this work, we cover how high-performance computing has helped in improving the performance tremendously in the transactional directed and undirected aspect of graphs and performance comparisons of various FSM techniques are done based on experimental results

    Exploring Decomposition for Solving Pattern Mining Problems

    Get PDF
    This article introduces a highly efficient pattern mining technique called Clustering-based Pattern Mining (CBPM). This technique discovers relevant patterns by studying the correlation between transactions in the transaction database based on clustering techniques. The set of transactions is first clustered, such that highly correlated transactions are grouped together. Next, we derive the relevant patterns by applying a pattern mining algorithm to each cluster. We present two different pattern mining algorithms, one applying an approximation-based strategy and another based on an exact strategy. The approximation-based strategy takes into account only the clusters, whereas the exact strategy takes into account both clusters and shared items between clusters. To boost the performance of the CBPM, a GPU-based implementation is investigated. To evaluate the CBPM framework, we perform extensive experiments on several pattern mining problems. The results from the experimental evaluation show that the CBPM provides a reduction in both the runtime and memory usage. Also, CBPM based on the approximate strategy provides good accuracy, demonstrating its effectiveness and feasibility. Our GPU implementation achieves significant speedup of up to 552Ă— on a single GPU using big transaction databases.publishedVersio

    Distributed frequent hierarchical pattern mining for robust and efficient large-scale association discovery

    Get PDF
    Field of study: Computer science.Dr. Chi-Ren Shyu, Dissertation Supervisor.Includes vita."May 2017."Frequent pattern mining is a classic data mining technique, generally applicable to a wide range of application domains, and a mature area of research. The fundamental challenge arises from the combinatorial nature of frequent itemsets, scaling exponentially with respect to the number of unique items. Apriori-based and FPTree-based algorithms have dominated the space thus far. Initial phases of this research relied on the Apriori algorithm and utilized a distributed computing environment; we proposed the Cartesian Scheduler to manage Apriori's candidate generation process. To address the limitation of bottom-up frequent pattern mining algorithms such as Apriori and FPGrowth, we propose the Frequent Hierarchical Pattern Tree (FHPTree): a tree structure and new frequent pattern mining paradigm. The classic problem is redefined as frequent hierarchical pattern mining where the goal is to detect frequent maximal pattern covers. Under the proposed paradigm, compressed representations of maximal patterns are mined using a top-down FHPTree traversal, FHPGrowth, which detects large patterns before their subsets, thus yielding significant reductions in computation time. The FHPTree memory footprint is small; the number of nodes in the structure scales linearly with respect to the number of unique items. Additionally, the FHPTree serves as a persistent, dynamic data structure to index frequent patterns and enable efficient searches. When the search space is exponential, efficient targeted mining capabilities are paramount; this is one of the key contributions of the FHPTree. This dissertation will demonstrate the performance of FHPGrowth, achieving a 300x speed up over state-of-the-art maximal pattern mining algorithms and approximately a 2400x speedup when utilizing FHPGrowth in a distributed computing environment. In addition, we allude to future research opportunities, and suggest various modifications to further optimize the FHPTree and FHPGrowth. Moreover, the methods we offer will have an impact on other data mining research areas including contrast set mining as well as spatial and temporal mining.Includes bibliographical references (pages 121-133)

    End-to-End Entity Resolution for Big Data: A Survey

    Get PDF
    One of the most important tasks for improving data quality and the reliability of data analytics results is Entity Resolution (ER). ER aims to identify different descriptions that refer to the same real-world entity, and remains a challenging problem. While previous works have studied specific aspects of ER (and mostly in traditional settings), in this survey, we provide for the first time an end-to-end view of modern ER workflows, and of the novel aspects of entity indexing and matching methods in order to cope with more than one of the Big Data characteristics simultaneously. We present the basic concepts, processing steps and execution strategies that have been proposed by different communities, i.e., database, semantic Web and machine learning, in order to cope with the loose structuredness, extreme diversity, high speed and large scale of entity descriptions used by real-world applications. Finally, we provide a synthetic discussion of the existing approaches, and conclude with a detailed presentation of open research directions

    Distributed pattern mining and data publication in life sciences using big data technologies

    Get PDF
    corecore