1 research outputs found

    On the improvement of complexity time and detection rate of outlier detectors : an unsupervised ensemble perspective

    Get PDF
    This thesis presents two unsupervised algorithms to detect outlier observations whose aberrant behavior is hidden in lower dimensional subspaces or cannot be identified with the use of a single detector. In particular, we contemplated three facets: first, the difficulty of a single detector to identify different types of outliers; second, the propensity of interesting outliers to hide in low dimensional subspaces; third, the impact that distinct distance measures have on the outlier detection process. The ambition of the proposed algorithms is to improve our understanding about data observations whose outlier behavior is not evident using simple outlier detection algorithms. Accordingly, we addressed three specific problems. First, we propose to design an ensemble based on different types of outlier detectors with a set of weights assigned without supervision. Second, we propose an ensemble to identify observations whose outlier behavior is visible only on specific subspaces. Third, we develop a scheme to understand how a single detector or an ensemble of outlier detectors is influenced by the selection of a distance metric and its interaction with different dimensionalities, data sizes, parameter settings or ensemble components. There is a wide availability of algorithms aimed at detecting outliers. However, the number of unsupervised ensemble approaches is limited and are mainly oriented towards the detection of a specific type of outlier. Accordingly, our first goal is to detect, in a unsupervised manner, distinct type of outlying observations. We propose an approach capable of using the output of different types of detectors, assigning specific weights to each detector depending on an internal evaluation (unsupervised) of the ability that each algorithm has on the specific dataset at hand; furthermore, this approach assigns a second weight to each data observation in order to increase the gap between outlier and inliers, further improving the outlier detection rate. The main contribution of this work is an ensemble of outlier detectors, whose components can be based on different assumptions, with an enhanced outlier detection rate when compared with similar single and ensemble approaches for outlier detection. Nonetheless, our approach exhibits a processing time linearly dependent on the number of ensemble components; this behavior is not exclusive of our approach, being instead prevalent in the ensemble outlier detection literature. The second part of this thesis focuses on the detection of a complex type of outliers, known in the literature as interesting outliers, which are detectable only on specific subspaces of the data, on the contrary simple outliers are detectable on full dimensionality. Since our first approach was unable to efficiently detect this type of outlier, our second goal is the detection of lower dimensional outliers in a computationally efficient time. We propose an unsupervised ensemble based on different subspaces and subsamples of data which provides a higher detection rate and is computationally more efficient than similar ensemble approaches; in some cases, our approach is even better to that of a single execution of a simple outlier detection algorithm. The main contributions of this work are the possibility of detecting lower dimensional outliers within an improved processing time. The last section of this thesis is oriented towards the study of the interaction between distance metric, parameter settings, data size, dimensionality and number of ensemble components in determining the detection rate and processing time of an outlier detector. Hence, our third goal is to improve our comprehension about the multiple factors influencing an outlier detection algorithm. A set of experiments has been devised to evaluate both detection rate and processing time. The experiments cover a wide set of synthetic and real-world data scenarios. Our synthetic data experiments allow us to introduce perturbations in the size and dimensionality of the data, while real world data permits an evaluation of the effect of varying the parameter settings of an algorithm. To the best of our knowledge this is the first evaluation considering a complete set of factors, mainly distance metrics, influencing the effectiveness and efficiency of an outlier detector. The understanding achieved in this study can be a key step towards the development of new ensemble approaches or the selection and parameterization of existing ones
    corecore