Search CORE

366 research outputs found

A Fast Minimal Infrequent Itemset Mining Algorithm

Author: Demchuk Kostyantyn
Leith Douglas J.
Publication venue
Publication date: 01/01/2014
Field of study

A novel fast algorithm for finding quasi identifiers in large datasets is presented. Performance measurements on a broad range of datasets demonstrate substantial reductions in run-time relative to the state of the art and the scalability of the algorithm to realistically-sized datasets up to several million records

arXiv.org e-Print Archive

MURAL - Maynooth University Research Archive Library

NUI Maynooth Eprint Archive

Maynooth University ePrints and eTheses Archive

Mining data quality rules based on T-dependence

Author: Boeckling Toon
Bronselaer Antoon
De Tré Guy
Publication venue: 'Atlantis Press'
Publication date: 01/01/2019
Field of study

Since their introduction in 1976, edit rules have been a standard tool in statistical analysis. Basically, edit rules are a compact representation of non-permitted combinations of values in a dataset. In this paper, we propose a technique to automatically find edit rules by use of the concept of T-dependence. We first generalize the traditional notion of lift, to that of T-lift, where stochastic independence is generalized to T-dependence. A combination of values is declared as an edit rule under a t-norm T if there is a strong negative correlation under T-dependence. We show several interesting properties of this approach. In particular, we show that under the minimum t-norm, edit rules can be computed efficiently by use of frequent pattern trees. Experimental results show that there is a weak to medium correlation in the rank order of edit rules obtained under T_M and T_P, indicating that the semantics of these kinds of dependencies are different

Crossref

Ghent University Academic Bibliography

New Approaches to Frequent and Incremental Frequent Pattern Mining

Author: Bicer Mehmet
Publication venue: CUNY Academic Works
Publication date: 01/06/2020
Field of study

Data Mining (DM) is a process for extracting interesting patterns from large volumes of data. It is one of the crucial steps in Knowledge Discovery in Databases (KDD). It involves various data mining methods that mainly fall into predictive and descriptive models. Descriptive models look for patterns, rules, relationships and associations within data. One of the descriptive methods is association rule analysis, which represents co-occurrence of items or events. Association rules are commonly used in market basket analysis. An association rule is in the form of X → Y and it shows that X and Y co-occur with a given level of support and conﬁdence. Association rule mining is a common technique used in discovering interesting frequent patterns in large datasets acquired in various application domains. Having petabytes of data ﬁnding its way into data storages in perhaps every day, made many researchers look for eﬃcient methods for analyzing these large datasets. Many algorithms have been proposed for searching for frequent patterns. The search space combinatorically explodes as the size of the source data increases. Simply using more powerful computers, or even super-computers to handle ever-increasing size of large data sets is not suﬃcient. Hence, incremental algorithms have been developed and used to improve the eﬃciency of frequent pattern mining. One of the challenges of frequent itemset mining is long running times of the algorithms. Two major costs of long running times of frequent itemset mining are due to the number of database scans and the number of candidates generated (the latter one requires memory, and the more the number of candidates there are the more memory space is needed. When the candidates do not ﬁt in memory then page swapping will occur which will increase the running time of the algorithms). In this dissertation we propose a new implementation of Apriori algorithm, NCLAT (Near Candidate-less Apriori with Tidlists), which scans the database only once and creates candidates only for level one (1-itemsets) which is equivalent to the total number of unique items in the database. In addition, we also show the results of choice of data structures used whether they are probabilistic or not, whether the datasets are horizontal or vertical, how counting is done, whether the algorithms are computed single or parallel way. We implement, explore and devise incremental algorithm UWEP with single as well as parallel computation. We have also cleaned a minor bug in UWEP and created a more eﬃcient version UWEP2, which reduces the number of candidates created and the number of database scans. We have run all of our tests against three datasets with diﬀerent features for diﬀerent minimum support levels. We show both frequent and incremental frequent itemset mining implementation test results and comparison to each other. While there has been a lot of work done on frequent itemset mining on structured data, very little work has been done on the unstructured data. So, we have created a new hybrid pattern search algorithm, Double-Hash, which performed better for all of our test scenarios than the known pattern search algorithms. Double-Hash can potentially be used in frequent itemset mining on unstructured data in the future. We will be presenting our work and test results on this as well

City University of New York

A log mining approach for process monitoring in SCADA

Author: Bolzoni Damiano
Hadziosmanovic Dina
Hartel Pieter
Publication venue: Springer
Publication date: 01/01/2012
Field of study

SCADA (Supervisory Control and Data Acquisition) systems are used for controlling and monitoring industrial processes. We propose a methodology to systematically identify potential process-related threats in SCADA. Process-related threats take place when an attacker gains user access rights and performs actions, which look legitimate, but which are intended to disrupt the SCADA process. To detect such threats, we propose a semi-automated approach of log processing. We conduct experiments on a real-life water treatment facility. A preliminary case study suggests that our approach is effective in detecting anomalous events that might alter the regular process workflow

Crossref

Springer - Publisher Connector

University of Twente Research Information

Extracting correlated parameters on multicore architectures

Author: Casali Alain
Ernst Christian
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 02/09/2013
Field of study

16 pagesInternational audienceIn this paper, we present a new approach relevant to the discovery of correlated patterns, based on the use of multicore architectures. Our work rests on a full KDD system and allows one to extract Decision Correlation Rules based on the Chi-squared criterion that include a target column from any database. To achieve this objective, we use a levelwise algorithm as well as contingency vectors, an alternate and more powerful representation of contingency tables, in order to prune the search space. The goal is to parallelize the processing associated with the extraction of relevant rules. The parallelization invokes the PPL (Parallel Patterns Library), which allows a simultaneous access to the whole available cores / processors on modern computers. We nally present rst results on the reached performance gains

HAL AMU

Hal-Diderot

HAL-EMSE

Colossal Trajectory Mining: A unifying approach to mine behavioral mobility patterns

Author: Francia M.
Gallinucci E.
Golfarelli M.
Publication venue
Publication date: 01/01/2024
Field of study

Spatio-temporal mobility patterns are at the core of strategic applications such as urban planning and monitoring. Depending on the strength of spatio-temporal constraints, different mobility patterns can be defined. While existing approaches work well in the extraction of groups of objects sharing fine-grained paths, the huge volume of large-scale data asks for coarse-grained solutions. In this paper, we introduce Colossal Trajectory Mining (CTM) to efficiently extract heterogeneous mobility patterns out of a multidimensional space that, along with space and time dimensions, can consider additional trajectory features (e.g., means of transport or activity) to characterize behavioral mobility patterns. The algorithm is natively designed in a distributed fashion, and the experimental evaluation shows its scalability with respect to the involved features and the cardinality of the trajectory dataset

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

The Fifth International VLDB Workshop on Management of Uncertain Data

Author
Publication venue: Centre for Telematics and Information Technology (CTIT)
Publication date: 29/08/2011
Field of study

University of Twente Research Information

A Fast Minimal Infrequent Itemset Mining Algorithm

Author: Demchuk Kostiantyn
Publication venue
Publication date: 01/05/2014
Field of study

MURAL - Maynooth University Research Archive Library

Frequent itemset mining on multiprocessor systems

Author: Schlegel Benjamin
Publication venue
Publication date: 30/05/2013
Field of study

Frequent itemset mining is an important building block in many data mining applications like market basket analysis, recommendation, web-mining, fraud detection, and gene expression analysis. In many of them, the datasets being mined can easily grow up to hundreds of gigabytes or even terabytes of data. Hence, efficient algorithms are required to process such large amounts of data. In recent years, there have been many frequent-itemset mining algorithms proposed, which however (1) often have high memory requirements and (2) do not exploit the large degrees of parallelism provided by modern multiprocessor systems. The high memory requirements arise mainly from inefficient data structures that have only been shown to be sufficient for small datasets. For large datasets, however, the use of these data structures force the algorithms to go out-of-core, i.e., they have to access secondary memory, which leads to serious performance degradations. Exploiting available parallelism is further required to mine large datasets because the serial performance of processors almost stopped increasing. Algorithms should therefore exploit the large number of available threads and also the other kinds of parallelism (e.g., vector instruction sets) besides thread-level parallelism. In this work, we tackle the high memory requirements of frequent itemset mining twofold: we (1) compress the datasets being mined because they must be kept in main memory during several mining invocations and (2) improve existing mining algorithms with memory-efficient data structures. For compressing the datasets, we employ efficient encodings that show a good compression performance on a wide variety of realistic datasets, i.e., the size of the datasets is reduced by up to 6.4x. The encodings can further be applied directly while loading the dataset from disk or network. Since encoding and decoding is repeatedly required for loading and mining the datasets, we reduce its costs by providing parallel encodings that achieve high throughputs for both tasks. For a memory-efficient representation of the mining algorithms’ intermediate data, we propose compact data structures and even employ explicit compression. Both methods together reduce the intermediate data’s size by up to 25x. The smaller memory requirements avoid or delay expensive out-of-core computation when large datasets are mined. For coping with the high parallelism provided by current multiprocessor systems, we identify the performance hot spots and scalability issues of existing frequent-itemset mining algorithms. The hot spots, which form basic building blocks of these algorithms, cover (1) counting the frequency of fixed-length strings, (2) building prefix trees, (3) compressing integer values, and (4) intersecting lists of sorted integer values or bitmaps. For all of them, we discuss how to exploit available parallelism and provide scalable solutions. Furthermore, almost all components of the mining algorithms must be parallelized to keep the sequential fraction of the algorithms as small as possible. We integrate the parallelized building blocks and components into three well-known mining algorithms and further analyze the impact of certain existing optimizations. Our algorithms are already single-threaded often up an order of magnitude faster than existing highly optimized algorithms and further scale almost linear on a large 32-core multiprocessor system. Although our optimizations are intended for frequent-itemset mining algorithms, they can be applied with only minor changes to algorithms that are used for mining of other types of itemsets

Technische Universität Dresden: Qucosa