64 research outputs found

    Anomaly Detection with Variance Stabilized Density Estimation

    Full text link
    Density estimation based anomaly detection schemes typically model anomalies as examples that reside in low-density regions. We propose a modified density estimation problem and demonstrate its effectiveness for anomaly detection. Specifically, we assume the density function of normal samples is uniform in some compact domain. This assumption implies the density function is more stable (with lower variance) around normal samples than anomalies. We first corroborate this assumption empirically using a wide range of real-world data. Then, we design a variance stabilized density estimation problem for maximizing the likelihood of the observed samples while minimizing the variance of the density around normal samples. We introduce an ensemble of autoregressive models to learn the variance stabilized distribution. Finally, we perform an extensive benchmark with 52 datasets demonstrating that our method leads to state-of-the-art results while alleviating the need for data-specific hyperparameter tuning.Comment: 12 pages, 6 figure

    T2D2: A Time Series Tester, Transformer, and Decomposer Framework for Outlier Detection

    Get PDF
    The automatic detection of outliers in time series datasets has captured much amount of attention in the data science community. It is not a simple task as the data may have several perspectives, such as sessional, trendy, or a combination of the two. Furthermore, to obtain a reliable and untrustworthy knowledge from the data, the data itself should be understandable. To cope with these challenges, in this paper, we introduce a new framework that can first test the stationarity and seasonality of dataset, then apply a set of Fourier transforms to get the Fourier sample frequencies, which can be used as a support of a decomposer component. The proposed framework, namely TTDD (Test, Transform, Decompose, and Detection), implements the decomposer component that split the dataset into three parts: trend, seasonal, and residual. Finally, the frequency difference detector compares the frequency of the test set to the frequency of the training set determining the periods of discrepancy in the frequency considering them as outlier periods

    Unsupervised anomaly detection of retail stores using predictive analysis library on SAP HANa XS advanced

    Get PDF
    The retail industry is quite exposed to fraudulent situations. Daily, thousands of transactions are processed, which may include some frauds difficult to detect, mainly when the perpetrators are the own employees at the retail stores. Large retailers with several stores across different locations may have considerable difficulty in detecting frauds involving their cashiers since they have to take into account different contexts of operation. To reduce fraud losses, retailers get an overview of the transactions in each store to filter the ones that look suspicious deviating from what would be normal. Data mining algorithms can be useful to detect anomalies, differentiating the normal from the abnormal. This study adopted the k-Means clustering algorithm for anomaly detection on a sample of 90 stores in a large food retail chain, revealing the existence of some outliers in the data. The anomaly detection process was fully implemented in SAP HANA XS Advanced using the Predictive Analysis Library (PAL). In the end, it was possible to identify the stores with abnormal behavior and conclude for the usefulness and ease of use of such a library, despite some lack of documentation to use it.This work has been supported by FCT – Fundação para a Ciência e Tecnologia within the R&D Units Project Scope: UIDB/00319/2020

    Deep Learning-Based Approach for Missing Data Imputation

    Get PDF
    The missing values in the datasets are a problem that will decrease the machine learning performance. New methods arerecommended every day to overcome this problem. The methods of statistical, machine learning, evolutionary and deeplearning are among these methods. Although deep learning methods is one of the popular subjects of today, there are limitedstudies in the missing data imputation. Several deep learning techniques have been used to handling missing data, one of themis the autoencoder and its denoising and stacked variants. In this study, the missing value in three different real-world datasetswas estimated by using denoising autoencoder (DAE), k-nearest neighbor (kNN) and multivariate imputation by chainedequations (MICE) methods. The estimation success of the methods was compared according to the root mean square error(RMSE) criterion. It was observed that the DAE method was more successful than other statistical methods in estimating themissing values for large datasets

    Outlier Mining Methods Based on Graph Structure Analysis

    Get PDF
    Outlier detection in high-dimensional datasets is a fundamental and challenging problem across disciplines that has also practical implications, as removing outliers from the training set improves the performance of machine learning algorithms. While many outlier mining algorithms have been proposed in the literature, they tend to be valid or efficient for specific types of datasets (time series, images, videos, etc.). Here we propose two methods that can be applied to generic datasets, as long as there is a meaningful measure of distance between pairs of elements of the dataset. Both methods start by defining a graph, where the nodes are the elements of the dataset, and the links have associated weights that are the distances between the nodes. Then, the first method assigns an outlier score based on the percolation (i.e., the fragmentation) of the graph. The second method uses the popular IsoMap non-linear dimensionality reduction algorithm, and assigns an outlier score by comparing the geodesic distances with the distances in the reduced space. We test these algorithms on real and synthetic datasets and show that they either outperform, or perform on par with other popular outlier detection methods. A main advantage of the percolation method is that is parameter free and therefore, it does not require any training; on the other hand, the IsoMap method has two integer number parameters, and when they are appropriately selected, the method performs similar to or better than all the other methods tested.Peer ReviewedPostprint (published version
    corecore