5 research outputs found

    Novel techniques to reduce search space in multiple minimum supports-based frequent pattern mining algorithms

    No full text
    ABSTRACT Frequent patterns are an important class of regularities that exist in a transaction database. Certain frequent patterns with low minimum support (minsup) value can provide useful information in many real-world applications. However, extraction of these frequent patterns with single minsupbased frequent pattern mining algorithms such as Apriori and FP-growth leads to "rare item problem." That is, at high minsup value, the frequent patterns with low minsup are missed, and at low minsup value, the number of frequent patterns explodes. In the literature, "multiple minsups framework" was proposed to discover frequent patterns. Furthermore, frequent pattern mining techniques such as Multiple Support Apriori and Conditional Frequent Pattern-growth (CFP-growth) algorithms have been proposed. As the frequent patterns mined with this framework do not satisfy downward closure property, the algorithms follow different types of pruning techniques to reduce the search space. In this paper, we propose an efficient CFP-growth algorithm by proposing new pruning techniques. Experimental results show that the proposed pruning techniques are effective

    Novel techniques to reduce search space in multiple minimum supports-based frequent pattern mining algorithms

    No full text
    Frequent patterns are an important class of regularities that exist in a transaction database. Certain frequent patterns with low minimum support (minsup) value can provide useful information in many real-world applications. However, extraction of these frequent patterns with single minsupbased frequent pattern mining algorithms such as Apriori and FP-growth leads to “rare item problem. ” That is, at high minsup value, the frequent patterns with low minsup are missed, and at low minsup value, the number of frequent patterns explodes. In the literature,“multiple minsups framework” was proposed to discover frequent patterns. Furthermore, frequent pattern mining techniques such as Multiple Support Apriori and Conditional Frequent Pattern-growth (CFP-growth) algorithms have been proposed. As the frequent patterns mined with this framework do not satisfy downward closure property, the algorithms follow different types of pruning techniques to reduce the search space. In this paper, we propose an efficient CFP-growth algorithm by proposing new pruning techniques. Experimental results show that the proposed pruning techniques are effective

    No Need to Know Physics: Resilience of Process-based Model-free Anomaly Detection for Industrial Control Systems

    Full text link
    In recent years, a number of process-based anomaly detection schemes for Industrial Control Systems were proposed. In this work, we provide the first systematic analysis of such schemes, and introduce a taxonomy of properties that are verified by those detection systems. We then present a novel general framework to generate adversarial spoofing signals that violate physical properties of the system, and use the framework to analyze four anomaly detectors published at top security conferences. We find that three of those detectors are susceptible to a number of adversarial manipulations (e.g., spoofing with precomputed patterns), which we call Synthetic Sensor Spoofing and one is resilient against our attacks. We investigate the root of its resilience and demonstrate that it comes from the properties that we introduced. Our attacks reduce the Recall (True Positive Rate) of the attacked schemes making them not able to correctly detect anomalies. Thus, the vulnerabilities we discovered in the anomaly detectors show that (despite an original good detection performance), those detectors are not able to reliably learn physical properties of the system. Even attacks that prior work was expected to be resilient against (based on verified properties) were found to be successful. We argue that our findings demonstrate the need for both more complete attacks in datasets, and more critical analysis of process-based anomaly detectors. We plan to release our implementation as open-source, together with an extension of two public datasets with a set of Synthetic Sensor Spoofing attacks as generated by our framework

    Predicate based association rules mining with new interestingness measure

    Get PDF
    Association Rule Mining (ARM) is one of the fundamental components in the field of data mining that discovers frequent itemsets and interesting relationships for predicting the associative and correlative behaviours for new data. However, traditional ARM techniques are based on support-confidence that discovers interesting association rules (ARs) using predefined minimum support (minsupp) and minimum confidence (minconf) threshold. In addition, traditional AR techniques only consider frequent items while ignoring rare ones. Thus, a new parameter-less predicated based ARM technique was proposed to address these limitations, which was enhanced to handle the frequent and rare items at the same time. Furthermore, a new interestingness measure, called g measure, was developed to select only highly interesting rules. In this proposed technique, interesting combinations were firstly selected by considering both the frequent and the rare items from a dataset. They were then mapped to the pseudo implications using predefined logical conditions. Later, inference rules were used to validate the pseudo-implications to discover rules within the set of mapped pseudo-implications. The resultant set of interesting rules was then referred to as the predicate based association rules. Zoo, breast cancer, and car evaluation datasets were used for conducting experiments. The results of the experiments were evaluated by its comparison with various classification techniques, traditional ARM technique and the coherent rule mining technique. The predicate-based rule mining approach gained an accuracy of 93.33%. In addition, the results of the g measure were compared with a state-of-the-art interestingness measure developed for a coherent rule mining technique called the h value. Predicate rules were discovered with an average confidence value of 0.754 for the zoo dataset and 0.949 for the breast cancer dataset, while the average confidence of the predicate rules found from the car evaluation dataset was 0.582. Results of this study showed that a set of interesting and highly reliable rules were discovered, including frequent, rare and negative association rules that have a higher confidence value. This research resulted in designing a methodology in rule mining which does not rely on the minsupp and minconf threshold. Also, a complete set of association rules are discovered by the proposed technique. Finally, the interestingness measure property for the selection of combinations from datasets makes it possible to reduce the exponential searching of the rules

    Frequent itemset mining on multiprocessor systems

    Get PDF
    Frequent itemset mining is an important building block in many data mining applications like market basket analysis, recommendation, web-mining, fraud detection, and gene expression analysis. In many of them, the datasets being mined can easily grow up to hundreds of gigabytes or even terabytes of data. Hence, efficient algorithms are required to process such large amounts of data. In recent years, there have been many frequent-itemset mining algorithms proposed, which however (1) often have high memory requirements and (2) do not exploit the large degrees of parallelism provided by modern multiprocessor systems. The high memory requirements arise mainly from inefficient data structures that have only been shown to be sufficient for small datasets. For large datasets, however, the use of these data structures force the algorithms to go out-of-core, i.e., they have to access secondary memory, which leads to serious performance degradations. Exploiting available parallelism is further required to mine large datasets because the serial performance of processors almost stopped increasing. Algorithms should therefore exploit the large number of available threads and also the other kinds of parallelism (e.g., vector instruction sets) besides thread-level parallelism. In this work, we tackle the high memory requirements of frequent itemset mining twofold: we (1) compress the datasets being mined because they must be kept in main memory during several mining invocations and (2) improve existing mining algorithms with memory-efficient data structures. For compressing the datasets, we employ efficient encodings that show a good compression performance on a wide variety of realistic datasets, i.e., the size of the datasets is reduced by up to 6.4x. The encodings can further be applied directly while loading the dataset from disk or network. Since encoding and decoding is repeatedly required for loading and mining the datasets, we reduce its costs by providing parallel encodings that achieve high throughputs for both tasks. For a memory-efficient representation of the mining algorithms’ intermediate data, we propose compact data structures and even employ explicit compression. Both methods together reduce the intermediate data’s size by up to 25x. The smaller memory requirements avoid or delay expensive out-of-core computation when large datasets are mined. For coping with the high parallelism provided by current multiprocessor systems, we identify the performance hot spots and scalability issues of existing frequent-itemset mining algorithms. The hot spots, which form basic building blocks of these algorithms, cover (1) counting the frequency of fixed-length strings, (2) building prefix trees, (3) compressing integer values, and (4) intersecting lists of sorted integer values or bitmaps. For all of them, we discuss how to exploit available parallelism and provide scalable solutions. Furthermore, almost all components of the mining algorithms must be parallelized to keep the sequential fraction of the algorithms as small as possible. We integrate the parallelized building blocks and components into three well-known mining algorithms and further analyze the impact of certain existing optimizations. Our algorithms are already single-threaded often up an order of magnitude faster than existing highly optimized algorithms and further scale almost linear on a large 32-core multiprocessor system. Although our optimizations are intended for frequent-itemset mining algorithms, they can be applied with only minor changes to algorithms that are used for mining of other types of itemsets
    corecore