4,205 research outputs found

    Frequent Itemset Mining for Big Data

    Get PDF
    Traditional data mining tools, developed to extract actionable knowledge from data, demonstrated to be inadequate to process the huge amount of data produced nowadays. Even the most popular algorithms related to Frequent Itemset Mining, an exploratory data analysis technique used to discover frequent items co-occurrences in a transactional dataset, are inefficient with larger and more complex data. As a consequence, many parallel algorithms have been developed, based on modern frameworks able to leverage distributed computation in commodity clusters of machines (e.g., Apache Hadoop, Apache Spark). However, frequent itemset mining parallelization is far from trivial. The search-space exploration, on which all the techniques are based, is not easily partitionable. Hence, distributed frequent itemset mining is a challenging problem and an interesting research topic. In this context, our main contributions consist in an (i) exhaustive theoretical and experimental analysis of the best-in-class approaches, whose outcomes and open issues motivated (ii) the development of a distributed high-dimensional frequent itemset miner. The dissertation introduces also a data mining framework which takes strongly advantage of distributed frequent itemset mining for the extraction of a specific type of itemsets (iii). The theoretical analysis highlights the challenges related to the distribution and the preliminary partitioning of the frequent itemset mining problem (i.e. the search-space exploration) describing the most adopted distribution strategies. The extensive experimental campaign, instead, compares the expectations related to the algorithmic choices against the actual performances of the algorithms. We run more than 300 experiments in order to evaluate and discuss the performances of the algorithms with respect to different real life use cases and data distributions. The outcomes of the review is that no algorithm is universally superior and performances are heavily skewed by the data distribution. Moreover, we were able to identify a concrete lack as regards frequent pattern extraction within high-dimensional use cases. For this reason, we have developed our own distributed high-dimensional frequent itemset miner based on Apache Hadoop. The algorithm splits the search-space exploration into independent sub-tasks. However, since the exploration strongly benefits of a full-knowledge of the problem, we introduced an interleaving synchronization phase. The result is a trade-off between the benefits of a centralized state and the ones related to the additional computational power due to parallelism. The experimental benchmarks, performed on real-life high-dimensional use cases, show the efficiency of the proposed approach in terms of execution time, load balancing and reliability to memory issues. Finally, the dissertation introduces a data mining framework in which distributed itemset mining is a fundamental component of the processing pipeline. The aim of the framework is the extraction of a new type of itemsets, called misleading generalized itemsets

    I-prune: Item selection for associative classification

    Get PDF
    Associative classification is characterized by accurate models and high model generation time. Most time is spent in extracting and postprocessing a large set of irrelevant rules, which are eventually pruned.We propose I-prune, an item-pruning approach that selects uninteresting items by means of an interestingness measure and prunes them as soon as they are detected. Thus, the number of extracted rules is reduced and model generation time decreases correspondingly. A wide set of experiments on real and synthetic data sets has been performed to evaluate I-prune and select the appropriate interestingness measure. The experimental results show that I-prune allows a significant reduction in model generation time, while increasing (or at worst preserving) model accuracy. Experimental evaluation also points to the chi-square measure as the most effective interestingness measure for item pruning

    Web-based Text Mining

    Get PDF
    Text mining deals with retrieval of specific information provided by customer search engines. With the massive amount of information that is available on the World Wide Web, text mining provides results in the order of highest relevance to the key words in the query. Text expresses a vast, rich range of information, but encodes this information in a form that is difficult to decipher automatically. For example, it is much more difficult to graphically display textual content than quantitative data. In this paper we describe a method for choosing a subset of the Web, an approach to create a search a flexible service to adopt a new way to generate highly effective results for expert searches. Retrieval of information poses the problem of redundancy in retrieval of same data repeatedly. This paper presents an optimized solution for fast recovery of data and also finds methods for regenerating the queries from the queries posed

    Frequent Itemsets Mining for Big Data: A Comparative Analysis

    Get PDF
    Itemset mining is a well-known exploratory data mining technique used to discover interesting correlations hidden in a data collection. Since it supports different targeted analyses, it is profitably exploited in a wide range of different domains, ranging from network traffic data to medical records. With the increasing amount of generated data, different scalable algorithms have been developed, exploiting the advantages of distributed computing frameworks, such as Apache Hadoop and Spark. This paper reviews Hadoop- and Spark-based scalable algorithms addressing the frequent itemset mining problem in the Big Data domain through both theoretical and experimental comparative analyses. Since the itemset mining task is computationally expensive, its distribution and parallelization strategies heavily affect memory usage, load balancing, and communication costs. A detailed discussion of the algorithmic choices of the distributed methods for frequent itemset mining is followed by an experimental analysis comparing the performance of state-of-the-art distributed implementations on both synthetic and real datasets. The strengths and weaknesses of the algorithms are thoroughly discussed with respect to the dataset features (e.g., data distribution, average transaction length, number of records), and specific parameter settings. Finally, based on theoretical and experimental analyses, open research directions for the parallelization of the itemset mining problem are presented

    Flow-Aware Elephant Flow Detection for Software-Defined Networks

    Get PDF
    Software-defined networking (SDN) separates the network control plane from the packet forwarding plane, which provides comprehensive network-state visibility for better network management and resilience. Traffic classification, particularly for elephant flow detection, can lead to improved flow control and resource provisioning in SDN networks. Existing elephant flow detection techniques use pre-set thresholds that cannot scale with the changes in the traffic concept and distribution. This paper proposes a flow-aware elephant flow detection applied to SDN. The proposed technique employs two classifiers, each respectively on SDN switches and controller, to achieve accurate elephant flow detection efficiently. Moreover, this technique allows sharing the elephant flow classification tasks between the controller and switches. Hence, most mice flows can be filtered in the switches, thus avoiding the need to send large numbers of classification requests and signaling messages to the controller. Experimental findings reveal that the proposed technique outperforms contemporary methods in terms of the running time, accuracy, F-measure, and recall

    Twitter data analysis by means of Strong Flipping Generalized Itemsets

    Get PDF
    Twitter data has recently been considered to perform a large variety of advanced analysis. Analysis ofTwitter data imposes new challenges because the data distribution is intrinsically sparse, due to a large number of messages post every day by using a wide vocabulary. Aimed at addressing this issue, generalized itemsets - sets of items at different abstraction levels - can be effectively mined and used todiscover interesting multiple-level correlations among data supplied with taxonomies. Each generalizeditemset is characterized by a correlation type (positive, negative, or null) according to the strength of thecorrelation among its items.This paper presents a novel data mining approach to supporting different and interesting targetedanalysis - topic trend analysis, context-aware service profiling - by analyzing Twitter posts. We aim atdiscovering contrasting situations by means of generalized itemsets. Specifically, we focus on comparingitemsets discovered at different abstraction levels and we select large subsets of specific (descendant)itemsets that show correlation type changes with respect to their common ancestor. To this aim, a novelkind of pattern, namely the Strong Flipping Generalized Itemset (SFGI), is extracted from Twitter mes-sages and contextual information supplied with taxonomy hierarchies. Each SFGI consists of a frequentgeneralized itemset X and the set of its descendants showing a correlation type change with respect to X. Experiments performed on both real and synthetic datasets demonstrate the effectiveness of the pro-posed approach in discovering interesting and hidden knowledge from Twitter dat
    corecore