1,274 research outputs found

    HybridMiner: Mining Maximal Frequent Itemsets Using Hybrid Database Representation Approach

    Full text link
    In this paper we present a novel hybrid (arraybased layout and vertical bitmap layout) database representation approach for mining complete Maximal Frequent Itemset (MFI) on sparse and large datasets. Our work is novel in terms of scalability, item search order and two horizontal and vertical projection techniques. We also present a maximal algorithm using this hybrid database representation approach. Different experimental results on real and sparse benchmark datasets show that our approach is better than previous state of art maximal algorithms.Comment: 8 Pages In the proceedings of 9th IEEE-INMIC 2005, Karachi, Pakistan, 200

    Using a combination of methodologies for improving medical information retrieval performance

    Get PDF
    This thesis presents three approaches to improve the current state of Medical Information Retrieval. At the time of this writing, the health industry is experiencing a massive change in terms of introducing technology into all aspects of health delivery. The work in this thesis involves adapting existing established concepts in the field of Information Retrieval to the field of Medical Information Retrieval. In particular, we apply subtype filtering, ICD-9 codes, query expansion, and re-ranking methods in order to improve retrieval on medical texts. The first method applies association rule mining and cosine similarity measures. The second method applies subtype filtering and the Apriori algorithm. And the third method uses ICD-9 codes in order to improve retrieval accuracy. Overall, we show that the current state of medical information retrieval has substantial room for improvement. Our first two methods do not show significant improvements, while our third approach shows an improvement of up to 20%

    A Study Of Data Informatics: Data Analysis And Knowledge Discovery Via A Novel Data Mining Algorithm

    Get PDF
    Frequent pattern mining (fpm) has become extremely popular among data mining researchers because it provides interesting and valuable patterns from large datasets. The decreasing cost of storage devices and the increasing availability of processing power make it possible for researchers to build and analyze gigantic datasets in various scientific and business domains. A filtering process is needed, however, to generate patterns that are relevant. This dissertation contributes to addressing this need. An experimental system named fpmies (frequent pattern mining information extraction system) was built to extract information from electronic documents automatically. Collocation analysis was used to analyze the relationship of words. Template mining was used to build the experimental system which is the foundation of fpmies. With the rising need for improved environmental performance, a dataset based on green supply chain practices of three companies was used to test fpmies. The new system was also tested by users resulting in a recall of 83.4%. The new algorithm\u27s combination of semantic relationships with template mining significantly improves the recall of fpmies. The study\u27s results also show that fpmies is much more efficient than manually trying to extract information. Finally, the performance of the fpmies system was compared with the most popular fpm algorithm, apriori, yielding a significantly improved recall and precision for fpmies (76.7% and 74.6% respectively) compared to that of apriori (30% recall and 24.6% precision)

    A new strategy for case-based reasoning retrieval using classification based on association

    Get PDF
    This paper proposes a novel strategy, Case-Based Reasoning Using Association Rules (CBRAR) to improve the performance of the Similarity base Retrieval SBR, classed frequent pattern trees FP-CAR algorithm, in order to disambiguate wrongly retrieved cases in Case-Based Reasoning (CBR). CBRAR use class as-sociation rules (CARs) to generate an optimum FP-tree which holds a value of each node. The possible advantage offered is that more efficient results can be gained when SBR returns uncertain answers. We compare the CBR Query as a pattern with FP-CAR patterns to identify the longest length of the voted class. If the patterns are matched, the proposed strategy can select not just the most similar case but the correct one. Our experimental evaluation on real data from the UCI repository indicates that the proposed CBRAR is a better approach when com-pared to the accuracy of the CBR systems used in our experiments

    High Performance Frequent Subgraph Mining on Transactional Datasets

    Get PDF
    Graph data mining has been a crucial as well as inevitable area of research. Large amounts of graph data are produced in many areas, such as Bioinformatics, Cheminformatics, Social Networks, and Web etc. Scalable graph data mining methods are getting increasingly popular and necessary due to increased graph complexities. Frequent subgraph mining is one such area where the task is to find overly recurring patterns/subgraphs. To tackle this problem, many main memory-based methods were proposed, which proved to be inefficient as the data size grew exponentially over time. In the past few years several research groups have attempted to handle the frequent subgraph mining (FSM) problem in multiple ways. Many authors have tried to achieve better performance using Graphic Processing Units (GPUs) which has multi-fold improvement over in-memory while dealing with large datasets. Later, Google\u27s MapReduce model with the Hadoop framework proved to be a major breakthrough in high performance large batch processing. Although MapReduce came with many benefits, its disk I/O and non-iterative style model could not help much for FSM domain since subgraph mining process is an iterative approach. In recent years, Spark has emerged to be the De Facto industry standard with its distributed in-memory computing capability. This is a right fit solution for iterative style of programming as well. In this work, we cover how high-performance computing has helped in improving the performance tremendously in the transactional directed and undirected aspect of graphs and performance comparisons of various FSM techniques are done based on experimental results

    Mining Privacy-Preserving Association Rules based on Parallel Processing in Cloud Computing

    Full text link
    With the onset of the Information Era and the rapid growth of information technology, ample space for processing and extracting data has opened up. However, privacy concerns may stifle expansion throughout this area. The challenge of reliable mining techniques when transactions disperse across sources is addressed in this study. This work looks at the prospect of creating a new set of three algorithms that can obtain maximum privacy, data utility, and time savings while doing so. This paper proposes a unique double encryption and Transaction Splitter approach to alter the database to optimize the data utility and confidentiality tradeoff in the preparation phase. This paper presents a customized apriori approach for the mining process, which does not examine the entire database to estimate the support for each attribute. Existing distributed data solutions have a high encryption complexity and an insufficient specification of many participants' properties. Proposed solutions provide increased privacy protection against a variety of attack models. Furthermore, in terms of communication cycles and processing complexity, it is much simpler and quicker. Proposed work tests on top of a realworld transaction database demonstrate that the aim of the proposed method is realistic

    Exploring Pattern Mining Algorithms for Hashtag Retrieval Problem

    Get PDF
    Hashtag is an iconic feature to retrieve the hot topics of discussion on Twitter or other social networks. This paper incorporates the pattern mining approaches to improve the accuracy of retrieving the relevant information and speeding up the search performance. A novel algorithm called PM-HR (Pattern Mining for Hashtag Retrieval) is designed to first transform the set of tweets into a transactional database by considering two different strategies (trivial and temporal). After that, the set of the relevant patterns is discovered, and then used as a knowledge-based system for finding the relevant tweets based on users\u27 queries under the similarity search process. Extensive results are carried out on large and different tweet collections, and the proposed PM-HR outperforms the baseline hashtag retrieval approaches in terms of runtime, and it is very competitive in terms of accuracy
    • …
    corecore