14 research outputs found

    Comparison of different algorithms for exploting the hidden trends in data sources

    Get PDF
    Thesis (Master)--Izmir Institute of Technology, Computer Engineering, Izmir, 2003Includes bibliographical references (leaves: 92-97)Text in English; Abstract: Turkish and English97 leavesThe growth of large-scale transactional databases, time-series databases and other kinds of databases has been giving rise to the development of several efficient algorithms that cope with the computationally expensive task of association rule mining.In this study, different algorithms, Apriori, FP-tree and CHARM, for exploiting the hidden trends such as frequent itemsets, frequent patterns, closed frequent itemsets respectively, were discussed and their performances were evaluated. The perfomances of the algorithms were measured at different support levels, and the algorithms were tested on different data sets (on both synthetic and real data sets). The algorihms were compared according to their, data preparation performances, mining performance, run time performances and knowledge extraction capabilities.The Apriori algorithm is the most prevalent algorithm of association rule mining which makes multiple passes over the database aiming at finding the set of frequent itemsets for each level. The FP-Tree algorithm is a scalable algorithm which finds the crucial information as regards the complete set of prefix paths, conditional pattern bases and frequent patterns by using a compact FP-Tree based mining method. The CHARM is a novel algorithm which brings remarkable improvements over existing association rule mining algorithms by proving the fact that mining the set of closed frequent itemsets is adequate instead of mining the set of all frequent itemsets.Related to our experimental results, we conclude that the Apriori algorithm demonstrates a good performance on sparse data sets. The Fp-tree algorithm extracts less association in comparison to Apriori, however it is completelty a feasable solution that facilitates mining dense data sets at low support levels. On the other hand, the CHARM algorithm is an appropriate algorithm for mining closed frequent itemsets (a substantial portion of frequent itemsets) on both sparse and dense data sets even at low levels of support

    An Efficient Load Balancing Multi-core Frequent Patterns Mining Algorithm

    Get PDF
    Abstract-Mining frequent pattern from transactional database is an important problem in data mining. Many methods have been proposed to solve this problem. However, the computation time still increase significantly while the data size grows. Therefore, parallel computing is a good strategy to solve this problem. Researchers have proposed various parallel and distributed algorithms on cluster system, grid system. However, the construction and maintenance cost is pretty high. In this paper, a multi-core load balancing frequent pattern mining algorithm is presented. The main goal of the proposed algorithm is to reduce the massive duplicated candidates generated in previous method. In order to verify the performance, we also implemented the proposed algorithm as well as previous methods for comparison. The experimental results showed that our method could reduce the computation time dramatically with more threads. Moreover, we could observe that the workload was equally dispatched to each computing unit

    Annales Mathematicae et Informaticae (48.)

    Get PDF

    Mining localized co-expressed gene patterns from microarray data

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Efficient and Effective Methodologies for Exploring and Prediction Movement Patterns in Large Networks

    Get PDF
    In the era of Big Data the prevalence of networks of all kinds has grown dramatically, and analysing (mining) such networks to support decision-making processes has become an extremely important subject for research, typically with a view to some social and/or economic gain. This thesis describes research work within the theme of Movement Pattern Mining (MPM) as applied to large network data. MPM is a type of frequent pattern mining that provides observation into how information is exchanged between objects in large networks. In the context of the work described in this thesis, the focus is on how the concept of Movement Patterns (MPs) can be extracted from large networks efficiently and effectively, and how such movement patterns can best be utilised so as to predict future movement. The work describes how, by utilising big data facilities like Share/Distribute Memory Systems and Hadoop/MapReduce, novel data mining based techniques can be used, not only to extract MPs from large networks, but also how they can be utilised for prediction purposes. To this end, the works in this thesis are divided into two parts. The first part is concerned with an investigation of an efficient mechanism for MPM. The second part is concerned with the utilisation of the extracted MPs in the context of prediction. For evaluation purposes, two large network datasets were used: The Great Britain Cattle Tracking System database and the Jiayuan Social Network. The evaluation indicates that an efficient and effective mechanism for identifying and extracting MPs form large networks, and subsequently using then MPs for prediction purposes, has been established

    A new strategy for case-based reasoning retrieval using classification based on association

    Get PDF
    Cased Based Reasoning (CBR) is an important area of research in the field of Artificial Intelli-gence. It aims to solve new problems by adapting solutions, that were used to solve previous similar ones. Among the four typical phases - retrieval, reuse, revise and retain, retrieval is a key phase in CBR approach, as the retrieval of wrong cases can lead to wrong decisions. To ac-complish the retrieval process, a CBR system exploits Similarity-Based Retrieval (SBR). How-ever, SBR tends to depend strongly on similarity knowledge, ignoring other forms of knowledge, that can further improve retrieval performance.The aim of this study is to integrate class association rules (CARs) as a special case of associa-tion rules (ARs), to discover a set (of rules) that can form an accurate classifier in a database. It is an efficient method when used to build a classifier, where the target is pre-determined. The proposition for this research is to answer the question of whether CARs can be integrated into a CBR system. A new strategy is proposed that suggests and uses mining class association rules from previous cases, which could strengthen similarity based retrieval (SBR). The propo-sition question can be answered by adapting the pattern of CARs, to be compared with the end of the Retrieval phase. Previous experiments and their results to date, show a link between CARs and CBR cases. This link has been developed to achieve the aim and objectives.A novel strategy, Case-Based Reasoning using Association Rules (CBRAR) is proposed to improve the performance of the SBR and to disambiguate wrongly retrieved cases in CBR. CBRAR uses CARs to generate an optimum frequent pattern tree (FP-tree) which holds a val-ue of each node. The possible advantage offered is that more efficient results can be gained, when SBR returns uncertain answers. In addition, CBRAR has been evaluated using two sources of CBR frameworks - Jcolibri and Free CBR. With the experimental evaluation on real datasets indicating that the proposed CBRAR is a better approach when compared to CBR systems, offering higher accuracy and lower error rate

    Fouille et classement d'ensembles fermés dans des données transactionnelles de grande échelle.

    Get PDF
    The recent increase of data volumes raises new challenges for itemset miningalgorithms. In this thesis, we focus on transactional datasets (collections of itemssets, for example supermarket tickets) containing at least a million transactionsover hundreds of thousands items. These datasets usually follow a “long tail”distribution: a few items are very frequent, and most items appear rarely. Suchdistributions are often truncated by existing itemset mining algorithms, whoseresults concern only a very small portion of the available items (the most frequents,usually). Thus, existing methods fail to concisely provide relevant insights on largedatasets. We therefore introduce a new semantics which is more intuitive for theanalyst: browsing associations per item, for any item, and less than a hundredassociations at once.To address the items’ coverage challenge, our first contribution is the item-centric mining problem. It consists in computing, for each item in the dataset,the k most frequent closed itemsets containing this item. We present an algorithmto solve it, TopPI. We show that TopPI computes efficiently interesting resultsover our datasets, outperforming simpler solutions or emulations based on existingalgorithms, both in terms of run-time and result completeness. We also show andempirically validate how TopPI can be parallelized, on multi-core machines andon Hadoop clusters, in order to speed-up computation on large scale datasets.Our second contribution is CAPA, a framework allowing us to study whichexisting measures of association rules’ quality are relevant to rank results. Thisconcerns results obtained from TopPI or from j LCM, our implementation of astate-of-the-art frequent closed itemsets mining algorithm (LCM). Our quantita-tive study shows that the 39 quality measures we compare can be grouped into5 families, based on the similarity of the rankings they produce. We also involvemarketing experts in a qualitative study, in order to discover which of the 5 familieswe propose highlights the most interesting associations for their domain.Our close collaboration with Intermarché, one of our industrial partners in theDatalyse project, allows us to show extensive experiments on real, nation-widesupermarket data. We present a complete analytics workflow addressing this usecase. We also experiment on Web data. Our contributions can be relevant invarious other fields, thanks to the genericity of transactional datasets.Altogether our contributions allow analysts to discover associations of interestin modern datasets. We pave the way for a more reactive discovery of items’ asso-ciations in large-scale datasets, whether on highly dynamic data or for interactiveexploration systems.Les algorithmes actuels pour la fouille d’ensembles fréquents sont dépassés parl’augmentation des volumes de données. Dans cette thèse nous nous intéressonsplus particulièrement aux données transactionnelles (des collections d’ensemblesd’objets, par exemple des tickets de caisse) qui contiennent au moins un mil-lion de transactions portant sur au moins des centaines de milliers d’objets. Lesjeux de données de cette taille suivent généralement une distribution dite en“longue traine”: alors que quelques objets sont très fréquents, la plupart sontrares. Ces distributions sont le plus souvent tronquées par les algorithmes defouille d’ensembles fréquents, dont les résultats ne portent que sur une infimepartie des objets disponibles (les plus fréquents). Les méthodes existantes ne per-mettent donc pas de découvrir des associations concises et pertinentes au seind’un grand jeu de données. Nous proposons donc une nouvelle sémantique, plusintuitive pour l’analyste: parcourir les associations par objet, au plus une centaineà la fois, et ce pour chaque objet présent dans les données.Afin de parvenir à couvrir tous les objets, notre première contribution consisteà définir la fouille centrée sur les objets. Cela consiste à calculer, pour chaqueobjet trouvé dans les données, les k ensembles d’objets les plus fréquents qui lecontiennent. Nous présentons un algorithme effectuant ce calcul, TopPI. Nousmontrons que TopPI calcule efficacement des résultats intéressants sur nos jeuxde données. Il est plus performant que des solutions naives ou des émulationsreposant sur des algorithmes existants, aussi bien en termes de rapidité que decomplétude des résultats. Nous décrivons et expérimentons deux versions par-allèles de TopPI (l’une sur des machines multi-coeurs, l’autre sur des grappesHadoop) qui permettent d’accélerer le calcul à grande échelle.Notre seconde contribution est CAPA, un système permettant d’étudier quellemesure de qualité des règles d’association serait la plus appropriée pour trier nosrésultats. Cela s’applique aussi bien aux résultats issus de TopPI que de j LCM,notre implémentation d’un algorithme récent de fouille d’ensembles fréquents fer-més (LCM). Notre étude quantitative montre que les 39 mesures que nous com-parons peuvent être regroupées en 5 familles, d’après la similarité des classementsde règles qu’elles produisent. Nous invitons aussi des experts en marketing à par-ticiper à une étude qualitative, afin de déterminer laquelle des 5 familles que nousproposons met en avant les associations d’objets les plus pertinentes dans leurdomaine.Notre collaboration avec Intermarché, partenaire industriel dans le cadre duprojet Datalyse, nous permet de présenter des expériences complètes et por-tant sur des données réelles issues de supermarchés dans toute la France. Nousdécrivons un flux d’analyse complet, à même de répondre à cette application. Nousprésentons également des expériences portant sur des données issues d’Internet;grâce à la généricité du modèle des ensembles d’objets, nos contributions peuvents’appliquer dans d’autres domaines.Nos contributions permettent donc aux analystes de découvrir des associations d’objets au milieu de grandes masses de données. Nos travaux ouvrent aussi lavoie vers la fouille d’associations interactive à large échelle, afin d’analyser desdonnées hautement dynamiques ou de réduire la portion du fichier à analyser àcelle qui intéresse le plus l’analyste
    corecore