1,377 research outputs found

    Fraud detection for online banking for scalable and distributed data

    Get PDF
    Online fraud causes billions of dollars in losses for banks. Therefore, online banking fraud detection is an important field of study. However, there are many challenges in conducting research in fraud detection. One of the constraints is due to unavailability of bank datasets for research or the required characteristics of the attributes of the data are not available. Numeric data usually provides better performance for machine learning algorithms. Most transaction data however have categorical, or nominal features as well. Moreover, some platforms such as Apache Spark only recognizes numeric data. So, there is a need to use techniques e.g. One-hot encoding (OHE) to transform categorical features to numerical features, however OHE has challenges including the sparseness of transformed data and that the distinct values of an attribute are not always known in advance. Efficient feature engineering can improve the algorithm’s performance but usually requires detailed domain knowledge to identify correct features. Techniques like Ripple Down Rules (RDR) are suitable for fraud detection because of their low maintenance and incremental learning features. However, high classification accuracy on mixed datasets, especially for scalable data is challenging. Evaluation of RDR on distributed platforms is also challenging as it is not available on these platforms. The thesis proposes the following solutions to these challenges: • We developed a technique Highly Correlated Rule Based Uniformly Distribution (HCRUD) to generate highly correlated rule-based uniformly-distributed synthetic data. • We developed a technique One-hot Encoded Extended Compact (OHE-EC) to transform categorical features to numeric features by compacting sparse-data even if all distinct values are unknown. • We developed a technique Feature Engineering and Compact Unified Expressions (FECUE) to improve model efficiency through feature engineering where the domain of the data is not known in advance. • A Unified Expression RDR fraud deduction technique (UE-RDR) for Big data has been proposed and evaluated on the Spark platform. Empirical tests were executed on multi-node Hadoop cluster using well-known classifiers on bank data, synthetic bank datasets and publicly available datasets from UCI repository. These evaluations demonstrated substantial improvements in terms of classification accuracy, ruleset compactness and execution speed.Doctor of Philosoph

    Interim Design Report

    Get PDF
    The International Design Study for the Neutrino Factory (the IDS-NF) was established by the community at the ninth "International Workshop on Neutrino Factories, super-beams, and beta- beams" which was held in Okayama in August 2007. The IDS-NF mandate is to deliver the Reference Design Report (RDR) for the facility on the timescale of 2012/13. In addition, the mandate for the study [3] requires an Interim Design Report to be delivered midway through the project as a step on the way to the RDR. This document, the IDR, has two functions: it marks the point in the IDS-NF at which the emphasis turns to the engineering studies required to deliver the RDR and it documents baseline concepts for the accelerator complex, the neutrino detectors, and the instrumentation systems. The IDS-NF is, in essence, a site-independent study. Example sites, CERN, FNAL, and RAL, have been identified to allow site-specific issues to be addressed in the cost analysis that will be presented in the RDR. The choice of example sites should not be interpreted as implying a preferred choice of site for the facility

    Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions

    Get PDF
    Low-rank matrix approximations, such as the truncated singular value decomposition and the rank-revealing QR decomposition, play a central role in data analysis and scientific computing. This work surveys and extends recent research which demonstrates that randomization offers a powerful tool for performing low-rank matrix approximation. These techniques exploit modern computational architectures more fully than classical methods and open the possibility of dealing with truly massive data sets. This paper presents a modular framework for constructing randomized algorithms that compute partial matrix decompositions. These methods use random sampling to identify a subspace that captures most of the action of a matrix. The input matrix is then compressed—either explicitly or implicitly—to this subspace, and the reduced matrix is manipulated deterministically to obtain the desired low-rank factorization. In many cases, this approach beats its classical competitors in terms of accuracy, robustness, and/or speed. These claims are supported by extensive numerical experiments and a detailed error analysis. The specific benefits of randomized techniques depend on the computational environment. Consider the model problem of finding the k dominant components of the singular value decomposition of an m × n matrix. (i) For a dense input matrix, randomized algorithms require O(mn log(k)) floating-point operations (flops) in contrast to O(mnk) for classical algorithms. (ii) For a sparse input matrix, the flop count matches classical Krylov subspace methods, but the randomized approach is more robust and can easily be reorganized to exploit multiprocessor architectures. (iii) For a matrix that is too large to fit in fast memory, the randomized techniques require only a constant number of passes over the data, as opposed to O(k) passes for classical algorithms. In fact, it is sometimes possible to perform matrix approximation with a single pass over the data

    Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions

    Get PDF
    Low-rank matrix approximations, such as the truncated singular value decomposition and the rank-revealing QR decomposition, play a central role in data analysis and scientific computing. This work surveys and extends recent research which demonstrates that randomization offers a powerful tool for performing low-rank matrix approximation. These techniques exploit modern computational architectures more fully than classical methods and open the possibility of dealing with truly massive data sets. This paper presents a modular framework for constructing randomized algorithms that compute partial matrix decompositions. These methods use random sampling to identify a subspace that captures most of the action of a matrix. The input matrix is then compressed---either explicitly or implicitly---to this subspace, and the reduced matrix is manipulated deterministically to obtain the desired low-rank factorization. In many cases, this approach beats its classical competitors in terms of accuracy, speed, and robustness. These claims are supported by extensive numerical experiments and a detailed error analysis

    Case based design of knitwear

    Get PDF
    In the developed world we are surrounded by man-made objects, but most people give little thought to the complex processes needed for their design. The design of hand knitting is complex because much of the domain knowledge is tacit. The objective of this thesis is to devise a methodology to help designers to work within design constraints, whilst facilitating creativity. A hybrid solution including computer aided design (CAD) and case based reasoning (CBR) is proposed. The CAD system creates designs using domain-specific rules and these designs are employed for initial seeding of the case base and the management of constraints. CBR reuses the designer's previous experience. The key aspects in the CBR system are measuring the similarity of cases and adapting past solutions to the current problem. Similarity is measured by asking the user to rank the importance of features; the ranks are then used to calculate weights for an algorithm which compares the specifications of designs. A novel adaptation operator called rule difference replay (RDR) is created. When the specifications to a new design is presented, the CAD program uses it to construct a design constituting an approximate solution. The most similar design from the case-base is then retrieved and RDR replays the changes previously made to the retrieved design on the new solution. A measure of solution similarity that can validate subjective success scores is created. Specification similarity can be used as a guide whether to invoke CBR, in a hybrid CAD-CBR system. If the newly resulted design is suffciently similar to a previous design, then CBR is invoked; otherwise CAD is used. The application of RDR to knitwear design has demonstrated the flexibility to overcome deficiencies in rules that try to automate creativity, and has the potential to be applied to other domains such as interior design

    An Interpretable Multiple-Instance Approach for the Detection of referable Diabetic Retinopathy from Fundus Images

    Full text link
    Diabetic Retinopathy (DR) is a leading cause of vision loss globally. Yet despite its prevalence, the majority of affected people lack access to the specialized ophthalmologists and equipment required for assessing their condition. This can lead to delays in the start of treatment, thereby lowering their chances for a successful outcome. Machine learning systems that automatically detect the disease in eye fundus images have been proposed as a means of facilitating access to DR severity estimates for patients in remote regions or even for complementing the human expert's diagnosis. In this paper, we propose a machine learning system for the detection of referable DR in fundus images that is based on the paradigm of multiple-instance learning. By extracting local information from image patches and combining it efficiently through an attention mechanism, our system is able to achieve high classification accuracy. Moreover, it can highlight potential image regions where DR manifests through its characteristic lesions. We evaluate our approach on publicly available retinal image datasets, in which it exhibits near state-of-the-art performance, while also producing interpretable visualizations of its predictions.Comment: 11 page
    corecore