4,446 research outputs found

    Rank Based Anomaly Detection Algorithms

    Get PDF
    Anomaly or outlier detection problems are of considerable importance, arising frequently in diverse real-world applications such as finance and cyber-security. Several algorithms have been formulated for such problems, usually based on formulating a problem-dependent heuristic or distance metric. This dissertation proposes anomaly detection algorithms that exploit the notion of ``rank, expressing relative outlierness of different points in the relevant space, and exploiting asymmetry in nearest neighbor relations between points: a data point is ``more anomalous if it is not the nearest neighbor of its nearest neighbors. Although rank is computed using distance, it is a more robust and higher level abstraction that is particularly helpful in problems characterized by significant variations of data point density, when distance alone is inadequate. We begin by proposing a rank-based outlier detection algorithm, and then discuss how this may be extended by also considering clustering-based approaches. We show that the use of rank significantly improves anomaly detection performance in a broad range of problems. We then consider the problem of identifying the most anomalous among a set of time series, e.g., the stock price of a company that exhibits significantly different behavior than its peer group of other companies. In such problems, different characteristics of time series are captured by different metrics, and we show that the best performance is obtained by combining several such metrics, along with the use of rank-based algorithms for anomaly detection. In practical scenarios, it is of interest to identify when a time series begins to diverge from the behavior of its peer group. We address this problem as well, using an online version of the anomaly detection algorithm developed earlier. Finally, we address the task of detecting the occurrence of anomalous sub-sequences within a single time series. This is accomplished by refining the multiple-distance combination approach, which succeeds when other algorithms (based on a single distance measure) fail. The algorithms developed in this dissertation can be applied in a large variety of application areas, and can assist in solving many practical problems

    Cost-Sensitive Learning-based Methods for Imbalanced Classification Problems with Applications

    Get PDF
    Analysis and predictive modeling of massive datasets is an extremely significant problem that arises in many practical applications. The task of predictive modeling becomes even more challenging when data are imperfect or uncertain. The real data are frequently affected by outliers, uncertain labels, and uneven distribution of classes (imbalanced data). Such uncertainties create bias and make predictive modeling an even more difficult task. In the present work, we introduce a cost-sensitive learning method (CSL) to deal with the classification of imperfect data. Typically, most traditional approaches for classification demonstrate poor performance in an environment with imperfect data. We propose the use of CSL with Support Vector Machine, which is a well-known data mining algorithm. The results reveal that the proposed algorithm produces more accurate classifiers and is more robust with respect to imperfect data. Furthermore, we explore the best performance measures to tackle imperfect data along with addressing real problems in quality control and business analytics

    Automated Cleaning of Identity Label Noise in A Large-scale Face Dataset Using A Face Image Quality Control

    Get PDF
    For face recognition, some very large-scale datasets are publicly available in recent years which are usually collected from the internet using search engines, and thus have many faces with wrong identity labels (outliers). Additionally, the face images in these datasets have different qualities. Since the low quality face images are hard to identify, current automated identity label cleaning methods are not able to detect the identity label error in the low quality faces. Therefore, we propose a novel approach for cleaning the identity label error more low quality faces. Our face identity labels cleaned by our method can train better models for low quality face recognition. The problem of low quality face recognition is very common in the real-life scenarios, where face images are usually captured by surveillance cameras in unconstrained conditions. \\ \\ Our proposed method starts by defining a clean subset for each identity consists of top high-quality face images and top search ranked faces that has the identity label. We call this set the ``identity reference set\u27\u27. After that, a ``quality adaptive similarity threshold\u27\u27 is applied to decide on whether a face image from the original identity set is similar to the identity reference set (inlier) or not. The quality adaptive similarity threshold means using adaptive threshold values for faces based on their quality scores. Because the inlier low quality faces have less facial information and are likely to achieve less similarity score to the identity reference than the high-quality inlier faces, using less strict threshold to classify low quality faces saves them from being falsely classified as outlier. \\ \\ In our low-to-high-quality face verification experiments, the deep model trained on our cleaning results of MS-Celeb-1M.v1 outperforms the same model trained using MS-Celeb-1M.v1 cleaned by the semantic bootstrapping method. We also apply our identity label cleaning method on a subset of the CACD face dataset, our quality based cleaning can deliver a higher precision and recall than a previous method

    Suspicious activity reporting using dynamic bayesian networks

    Get PDF
    AbstractSuspicious activity reporting has been a crucial part of anti-money laundering systems. Financial transactions are considered suspicious when they deviate from the regular behavior of their customers. Money launderers pay special attention to keep their transactions as normal as possible to disguise their illicit nature. This may deceive the classical deviation based statistical methods for finding anomalies. This study presents an approach, called SARDBN (Suspicious Activity Reporting using Dynamic Bayesian Network), that employs a combination of clustering and dynamic Bayesian network (DBN) to identify anomalies in sequence of transactions. SARDBN applies DBN to capture patterns in a customer’s monthly transactional sequences as well as to compute an anomaly index called AIRE (Anomaly Index using Rank and Entropy). AIRE measures the degree of anomaly in a transaction and is compared against a pre-defined threshold to mark the transaction as normal or suspicious. The presented approach is tested on a real dataset of more than 8 million banking transactions and has shown promising results

    The Robust Estimation of Monthly Prices of Goods Traded by the European Union

    Get PDF
    The general problem addressed in this document is the estimation of “fair” import prices from international trade data. The work is in support to the determination of the customs value at the moment of the customs formalities, to establish how much duty the importer must pay, and the post-clearance checks of individual transactions. The proposed approach can be naturally extended to the analysis of export flows and used for other purposes, including general market analyses. The Joint Research Centre of the European Commission has previously addressed (Arsenis et al., 2015) the trade price estimation problem by considering data for fixed product, origin and destination over a multiannual time period, typically of 3 or 4 years, leading to price estimates that are specific for each EU Member State. This report illustrates a different model whereby each price estimate is calculated on a monthly basis, using data for fixed time (month), product and origin. The approach differentiates between trades originated from different third countries and it is therefore particularly useful to monitor trends and anomalies in specific EU trade markets. These Estimated European Monthly Prices are publishes every month by the Joint Research Centre in a dedicated section of the THESEUS website (https://theseus.jrc.ec.europa.eu), accessible by authorized users of the EU and Member States services. The section, called Monthly Fair Prices, also shows the time evolution of worldwide price estimates computed with the same approach by fixing only time and product.JRC.I.3-Text and Data Minin
    • …
    corecore