27,979 research outputs found
A Fast Minimal Infrequent Itemset Mining Algorithm
A novel fast algorithm for finding quasi identifiers in large datasets is
presented. Performance measurements on a broad range of datasets demonstrate
substantial reductions in run-time relative to the state of the art and the
scalability of the algorithm to realistically-sized datasets up to several
million records
A log mining approach for process monitoring in SCADA
SCADA (Supervisory Control and Data Acquisition) systems are used for controlling and monitoring industrial processes. We propose a methodology to systematically identify potential process-related threats in SCADA. Process-related threats take place when an attacker gains user access rights and performs actions, which look legitimate, but which are intended to disrupt the SCADA process. To detect such threats, we propose a semi-automated approach of log processing. We conduct experiments on a real-life water treatment facility. A preliminary case study suggests that our approach is effective in detecting anomalous events that might alter the regular process workflow
A Model-Based Frequency Constraint for Mining Associations from Transaction Data
Mining frequent itemsets is a popular method for finding associated items in
databases. For this method, support, the co-occurrence frequency of the items
which form an association, is used as the primary indicator of the
associations's significance. A single user-specified support threshold is used
to decided if associations should be further investigated. Support has some
known problems with rare items, favors shorter itemsets and sometimes produces
misleading associations.
In this paper we develop a novel model-based frequency constraint as an
alternative to a single, user-specified minimum support. The constraint
utilizes knowledge of the process generating transaction data by applying a
simple stochastic mixture model (the NB model) which allows for transaction
data's typically highly skewed item frequency distribution. A user-specified
precision threshold is used together with the model to find local frequency
thresholds for groups of itemsets. Based on the constraint we develop the
notion of NB-frequent itemsets and adapt a mining algorithm to find all
NB-frequent itemsets in a database. In experiments with publicly available
transaction databases we show that the new constraint provides improvements
over a single minimum support threshold and that the precision threshold is
more robust and easier to set and interpret by the user
Using association rule mining to enrich semantic concepts for video retrieval
In order to achieve true content-based information retrieval on video we should analyse and index video with
high-level semantic concepts in addition to using user-generated tags and structured metadata like title, date,
etc. However the range of such high-level semantic concepts, detected either manually or automatically,
usually limited compared to the richness of information content in video and the potential vocabulary of
available concepts for indexing. Even though there is work to improve the performance of individual concept
classifiers, we should strive to make the best use of whatever partial sets of semantic concept occurrences
are available to us. We describe in this paper our method for using association rule mining to automatically
enrich the representation of video content through a set of semantic concepts based on concept co-occurrence
patterns. We describe our experiments on the TRECVid 2005 video corpus annotated with the 449 concepts
of the LSCOM ontology. The evaluation of our results shows the usefulness of our approach
Contrast mining in large class imbalance data
University of Technology, Sydney. Faculty of Engineering and Information Technology.Class imbalance data, in which the classes are not equally represented and the minority classes include a much smaller number of examples than other classes, is pervasive and ubiquitous, particularly in applications such as fraud/intrusion detection, medical diagnosis/monitoring, and risk management. The conventional classifiers tend to be overwhelmed by the large classes while ignoring the smaller classes. Typically, many of the existing solutions to the class imbalance problem are proposed at the data level, and a few at the algorithmic level. However, the prior methods have more or less limitations in anomaly detection according to our extensive experiments. Therefore, the thesis targets contrast mining to solve the problem of anomaly detection in imbalanced data from three aspects: feature construction, an effective algorithm for mining contrast patterns, and selection of optimal rule combinations through analysing rule interactions.
Feature construction is one of the most important steps in contrast pattern mining, and any other data mining processes as well. The majority of feature construction methods, such as Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Fourier Transformation, and Independent Component Analysis, usually generate new features by transforming the existing raw features into a new data space. Therefore, previous solutions have many limitations with respect to the objective of training highly accurate classifiers in class imbalance data sets. Incomprehensible features may be generated, based on the assumption that all the samples are independent, the feature set is unstable and sensitive to trivial change of the sample set, it is difficult to integrate significant domain knowledge, and the classifiers built on the transformed feature set suffer from high False Positive Rate in the class imbalance data set.
In order to train high performance models in the imbalance scenario, we propose a novel method, Personalised Domain Driven Feature Mining (PDDFM), to generate important features by integrating domain knowledge effectively with a full consideration of the correlations among samples. A framework specially designed for PDDFM is introduced. A novel feature selection method, called Mutual Reduction, is proposed to minimise the noise from redundant features and maximize the contribution of “trivial” features whose gain ratio are low but contribute positively when cooperate with the others. The experimental evaluation reveals our feature mining approach outperforms state-of-the-art methods in anomaly detection.
Contrast pattern mining has been studied intensively for its strong discriminative capability. However, state-of-the-art methods rarely consider the class imbalance problem, which has been proven to be a significant challenge in mining large scale data. The thesis introduces a novel pattern, i.e. converging pattern, which refers to the item sets whose supports contrast sharply from the minority class to the majority class. A novel algorithm, ConvergMiner, is also proposed to mine converging patterns efficiently. A light-weighted index T*-tree is built to speed up the search process, and output patterns instantly. A series of branch bound pruning strategies are further presented to greatly reduce the computational cost. Substantial experiments on large scale real-life online banking transactions for fraud detection show that the ConvergMiner greatly outperforms the existing cost-sensitive classification methods in terms of accuracy. In particular, it efficiently and effectively detects the frauds in large-scale imbalanced transaction sets. More importantly, the efficiency improves with the increase in data imbalance. After many converging patterns are generated, we propose an effective novel method to select the optimal pattern set.
Rule-based anomaly and fraud detection systems often suffer from substantial false alerts in the context of a very large number of enterprise transactions with class imbalance characteristics. A crucial and challenging problem is to effectively select a globally optimal rule set which can capture very rare anomalies dispersed in large-scale background transactions. The existing rule selection methods which suffer significantly from complex rule interactions and overlapping in large imbalanced data, often lead to very high false positive rates. We analyse the interactions and relationships between rules and their coverage in transactions, and propose a novel metric, Max Coverage Gain (MCG). MCG selects the optimal rule set by evaluating the contribution of each rule in terms of overall performance to cut out those locally significant, but globally redundant rules, without any negative impact on the recall. An effective algorithm, MCGminer, is then designed with a series of built-in mechanisms and pruning strategies to handle complex rule interactions and reduce computational complexity in identifying the globally optimal rule set. Substantial experiments on 13 UCI data sets and a real time online banking transactional database demonstrate that MCGminer achieves significant improvement in accuracy, scalability, stability and efficiency with respect to large imbalanced data compared to several state-of-the-art rule selection techniques.
Following that, the above proposed contrast analysis techniques have been applied in two industrial projects. The first project was “Fraud Detection in Online Banking” for a major bank in Australia. We developed a risk management platform called i-Alertor, which is mainly powered by the techniques introduced in this thesis. According to the evaluation report, i-Alertor outperforms the existing rule based system by 10%. The second project was the “Key Indicator Discovery in Student Learning” for a key University in Australia. Another platform called i-Educator is also developed to support this application
- …