64 research outputs found
Anomaly Detection with Variance Stabilized Density Estimation
Density estimation based anomaly detection schemes typically model anomalies
as examples that reside in low-density regions. We propose a modified density
estimation problem and demonstrate its effectiveness for anomaly detection.
Specifically, we assume the density function of normal samples is uniform in
some compact domain. This assumption implies the density function is more
stable (with lower variance) around normal samples than anomalies. We first
corroborate this assumption empirically using a wide range of real-world data.
Then, we design a variance stabilized density estimation problem for maximizing
the likelihood of the observed samples while minimizing the variance of the
density around normal samples. We introduce an ensemble of autoregressive
models to learn the variance stabilized distribution. Finally, we perform an
extensive benchmark with 52 datasets demonstrating that our method leads to
state-of-the-art results while alleviating the need for data-specific
hyperparameter tuning.Comment: 12 pages, 6 figure
T2D2: A Time Series Tester, Transformer, and Decomposer Framework for Outlier Detection
The automatic detection of outliers in time series datasets has captured much amount of attention in the data science community. It is not a simple task as the data may have several perspectives, such as sessional, trendy, or a combination of the two. Furthermore, to obtain a reliable and untrustworthy knowledge from the data, the data itself should be understandable. To cope with these challenges, in this paper, we introduce a new framework that can first test the stationarity and seasonality of dataset, then apply a set of Fourier transforms to get the Fourier sample frequencies, which can be used as a support of a decomposer component. The proposed framework, namely TTDD (Test, Transform, Decompose, and Detection), implements the decomposer component that split the dataset into three parts: trend, seasonal, and residual. Finally, the frequency difference detector compares the frequency of the test set to the frequency of the training set determining the periods of discrepancy in the frequency considering them as outlier periods
Unsupervised anomaly detection of retail stores using predictive analysis library on SAP HANa XS advanced
The retail industry is quite exposed to fraudulent situations. Daily, thousands of transactions are processed, which may include some frauds difficult to detect, mainly when the perpetrators are the own employees at the retail stores. Large retailers with several stores across different locations may have considerable difficulty in detecting frauds involving their cashiers since they have to take into account different contexts of operation. To reduce fraud losses, retailers get an overview of the transactions in each store to filter the ones that look suspicious deviating from what would be normal. Data mining algorithms can be useful to detect anomalies, differentiating the normal from the abnormal. This study adopted the k-Means clustering algorithm for anomaly detection on a sample of 90 stores in a large food retail chain, revealing the existence of some outliers in the data. The anomaly detection process was fully implemented in SAP HANA XS Advanced using the Predictive Analysis Library (PAL). In the end, it was possible to identify the stores with abnormal behavior and conclude for the usefulness and ease of use of such a library, despite some lack of documentation to use it.This work has been supported by FCT – Fundação para a Ciência e Tecnologia within the R&D Units Project
Scope: UIDB/00319/2020
Deep Learning-Based Approach for Missing Data Imputation
The missing values in the datasets are a problem that will decrease the machine learning performance. New methods arerecommended every day to overcome this problem. The methods of statistical, machine learning, evolutionary and deeplearning are among these methods. Although deep learning methods is one of the popular subjects of today, there are limitedstudies in the missing data imputation. Several deep learning techniques have been used to handling missing data, one of themis the autoencoder and its denoising and stacked variants. In this study, the missing value in three different real-world datasetswas estimated by using denoising autoencoder (DAE), k-nearest neighbor (kNN) and multivariate imputation by chainedequations (MICE) methods. The estimation success of the methods was compared according to the root mean square error(RMSE) criterion. It was observed that the DAE method was more successful than other statistical methods in estimating themissing values for large datasets
Outlier Mining Methods Based on Graph Structure Analysis
Outlier detection in high-dimensional datasets is a fundamental and challenging problem across disciplines that has also practical implications, as removing outliers from the training set improves the performance of machine learning algorithms. While many outlier mining algorithms have been proposed in the literature, they tend to be valid or efficient for specific types of datasets (time series, images, videos, etc.). Here we propose two methods that can be applied to generic datasets, as long as there is a meaningful measure of distance between pairs of elements of the dataset. Both methods start by defining a graph, where the nodes are the elements of the dataset, and the links have associated weights that are the distances between the nodes. Then, the first method assigns an outlier score based on the percolation (i.e., the fragmentation) of the graph. The second method uses the popular IsoMap non-linear dimensionality reduction algorithm, and assigns an outlier score by comparing the geodesic distances with the distances in the reduced space. We test these algorithms on real and synthetic datasets and show that they either outperform, or perform on par with other popular outlier detection methods. A main advantage of the percolation method is that is parameter free and therefore, it does not require any training; on the other hand, the IsoMap method has two integer number parameters, and when they are appropriately selected, the method performs similar to or better than all the other methods tested.Peer ReviewedPostprint (published version
- …