340,626 research outputs found

    Contrast mining in large class imbalance data

    Full text link
    University of Technology, Sydney. Faculty of Engineering and Information Technology.Class imbalance data, in which the classes are not equally represented and the minority classes include a much smaller number of examples than other classes, is pervasive and ubiquitous, particularly in applications such as fraud/intrusion detection, medical diagnosis/monitoring, and risk management. The conventional classifiers tend to be overwhelmed by the large classes while ignoring the smaller classes. Typically, many of the existing solutions to the class imbalance problem are proposed at the data level, and a few at the algorithmic level. However, the prior methods have more or less limitations in anomaly detection according to our extensive experiments. Therefore, the thesis targets contrast mining to solve the problem of anomaly detection in imbalanced data from three aspects: feature construction, an effective algorithm for mining contrast patterns, and selection of optimal rule combinations through analysing rule interactions. Feature construction is one of the most important steps in contrast pattern mining, and any other data mining processes as well. The majority of feature construction methods, such as Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Fourier Transformation, and Independent Component Analysis, usually generate new features by transforming the existing raw features into a new data space. Therefore, previous solutions have many limitations with respect to the objective of training highly accurate classifiers in class imbalance data sets. Incomprehensible features may be generated, based on the assumption that all the samples are independent, the feature set is unstable and sensitive to trivial change of the sample set, it is difficult to integrate significant domain knowledge, and the classifiers built on the transformed feature set suffer from high False Positive Rate in the class imbalance data set. In order to train high performance models in the imbalance scenario, we propose a novel method, Personalised Domain Driven Feature Mining (PDDFM), to generate important features by integrating domain knowledge effectively with a full consideration of the correlations among samples. A framework specially designed for PDDFM is introduced. A novel feature selection method, called Mutual Reduction, is proposed to minimise the noise from redundant features and maximize the contribution of “trivial” features whose gain ratio are low but contribute positively when cooperate with the others. The experimental evaluation reveals our feature mining approach outperforms state-of-the-art methods in anomaly detection. Contrast pattern mining has been studied intensively for its strong discriminative capability. However, state-of-the-art methods rarely consider the class imbalance problem, which has been proven to be a significant challenge in mining large scale data. The thesis introduces a novel pattern, i.e. converging pattern, which refers to the item sets whose supports contrast sharply from the minority class to the majority class. A novel algorithm, ConvergMiner, is also proposed to mine converging patterns efficiently. A light-weighted index T*-tree is built to speed up the search process, and output patterns instantly. A series of branch bound pruning strategies are further presented to greatly reduce the computational cost. Substantial experiments on large scale real-life online banking transactions for fraud detection show that the ConvergMiner greatly outperforms the existing cost-sensitive classification methods in terms of accuracy. In particular, it efficiently and effectively detects the frauds in large-scale imbalanced transaction sets. More importantly, the efficiency improves with the increase in data imbalance. After many converging patterns are generated, we propose an effective novel method to select the optimal pattern set. Rule-based anomaly and fraud detection systems often suffer from substantial false alerts in the context of a very large number of enterprise transactions with class imbalance characteristics. A crucial and challenging problem is to effectively select a globally optimal rule set which can capture very rare anomalies dispersed in large-scale background transactions. The existing rule selection methods which suffer significantly from complex rule interactions and overlapping in large imbalanced data, often lead to very high false positive rates. We analyse the interactions and relationships between rules and their coverage in transactions, and propose a novel metric, Max Coverage Gain (MCG). MCG selects the optimal rule set by evaluating the contribution of each rule in terms of overall performance to cut out those locally significant, but globally redundant rules, without any negative impact on the recall. An effective algorithm, MCGminer, is then designed with a series of built-in mechanisms and pruning strategies to handle complex rule interactions and reduce computational complexity in identifying the globally optimal rule set. Substantial experiments on 13 UCI data sets and a real time online banking transactional database demonstrate that MCGminer achieves significant improvement in accuracy, scalability, stability and efficiency with respect to large imbalanced data compared to several state-of-the-art rule selection techniques. Following that, the above proposed contrast analysis techniques have been applied in two industrial projects. The first project was “Fraud Detection in Online Banking” for a major bank in Australia. We developed a risk management platform called i-Alertor, which is mainly powered by the techniques introduced in this thesis. According to the evaluation report, i-Alertor outperforms the existing rule based system by 10%. The second project was the “Key Indicator Discovery in Student Learning” for a key University in Australia. Another platform called i-Educator is also developed to support this application

    Automated design of robust discriminant analysis classifier for foot pressure lesions using kinematic data

    Get PDF
    In the recent years, the use of motion tracking systems for acquisition of functional biomechanical gait data, has received increasing interest due to the richness and accuracy of the measured kinematic information. However, costs frequently restrict the number of subjects employed, and this makes the dimensionality of the collected data far higher than the available samples. This paper applies discriminant analysis algorithms to the classification of patients with different types of foot lesions, in order to establish an association between foot motion and lesion formation. With primary attention to small sample size situations, we compare different types of Bayesian classifiers and evaluate their performance with various dimensionality reduction techniques for feature extraction, as well as search methods for selection of raw kinematic variables. Finally, we propose a novel integrated method which fine-tunes the classifier parameters and selects the most relevant kinematic variables simultaneously. Performance comparisons are using robust resampling techniques such as Bootstrap632+632+and k-fold cross-validation. Results from experimentations with lesion subjects suffering from pathological plantar hyperkeratosis, show that the proposed method can lead tosim96sim 96%correct classification rates with less than 10% of the original features

    k-Nearest Neighbour Classifiers: 2nd Edition (with Python examples)

    Get PDF
    Perhaps the most straightforward classifier in the arsenal or machine learning techniques is the Nearest Neighbour Classifier -- classification is achieved by identifying the nearest neighbours to a query example and using those neighbours to determine the class of the query. This approach to classification is of particular importance because issues of poor run-time performance is not such a problem these days with the computational power that is available. This paper presents an overview of techniques for Nearest Neighbour classification focusing on; mechanisms for assessing similarity (distance), computational issues in identifying nearest neighbours and mechanisms for reducing the dimension of the data. This paper is the second edition of a paper previously published as a technical report. Sections on similarity measures for time-series, retrieval speed-up and intrinsic dimensionality have been added. An Appendix is included providing access to Python code for the key methods.Comment: 22 pages, 15 figures: An updated edition of an older tutorial on kN

    Autoencoding the Retrieval Relevance of Medical Images

    Full text link
    Content-based image retrieval (CBIR) of medical images is a crucial task that can contribute to a more reliable diagnosis if applied to big data. Recent advances in feature extraction and classification have enormously improved CBIR results for digital images. However, considering the increasing accessibility of big data in medical imaging, we are still in need of reducing both memory requirements and computational expenses of image retrieval systems. This work proposes to exclude the features of image blocks that exhibit a low encoding error when learned by a n/p/nn/p/n autoencoder (p ⁣< ⁣np\!<\!n). We examine the histogram of autoendcoding errors of image blocks for each image class to facilitate the decision which image regions, or roughly what percentage of an image perhaps, shall be declared relevant for the retrieval task. This leads to reduction of feature dimensionality and speeds up the retrieval process. To validate the proposed scheme, we employ local binary patterns (LBP) and support vector machines (SVM) which are both well-established approaches in CBIR research community. As well, we use IRMA dataset with 14,410 x-ray images as test data. The results show that the dimensionality of annotated feature vectors can be reduced by up to 50% resulting in speedups greater than 27% at expense of less than 1% decrease in the accuracy of retrieval when validating the precision and recall of the top 20 hits.Comment: To appear in proceedings of The 5th International Conference on Image Processing Theory, Tools and Applications (IPTA'15), Nov 10-13, 2015, Orleans, Franc

    Spatial Filtering Pipeline Evaluation of Cortically Coupled Computer Vision System for Rapid Serial Visual Presentation

    Get PDF
    Rapid Serial Visual Presentation (RSVP) is a paradigm that supports the application of cortically coupled computer vision to rapid image search. In RSVP, images are presented to participants in a rapid serial sequence which can evoke Event-related Potentials (ERPs) detectable in their Electroencephalogram (EEG). The contemporary approach to this problem involves supervised spatial filtering techniques which are applied for the purposes of enhancing the discriminative information in the EEG data. In this paper we make two primary contributions to that field: 1) We propose a novel spatial filtering method which we call the Multiple Time Window LDA Beamformer (MTWLB) method; 2) we provide a comprehensive comparison of nine spatial filtering pipelines using three spatial filtering schemes namely, MTWLB, xDAWN, Common Spatial Pattern (CSP) and three linear classification methods Linear Discriminant Analysis (LDA), Bayesian Linear Regression (BLR) and Logistic Regression (LR). Three pipelines without spatial filtering are used as baseline comparison. The Area Under Curve (AUC) is used as an evaluation metric in this paper. The results reveal that MTWLB and xDAWN spatial filtering techniques enhance the classification performance of the pipeline but CSP does not. The results also support the conclusion that LR can be effective for RSVP based BCI if discriminative features are available
    corecore