Search CORE

23,222 research outputs found

Uncertain distance-based outlier detection with arbitrarily shaped data objects

Author: Fabio Fassetti
Fabrizio Angiulli
Publication venue
Publication date: 15/10/2020
Field of study

AbstractEnabling information systems to face anomalies in the presence of uncertainty is a compelling and challenging task. In this work the problem of unsupervised outlier detection in large collections of data objects modeled by means of arbitrary multidimensional probability density functions is considered. We present a novel definition ofuncertain distance-based outlierunder the attribute level uncertainty model, according to which an uncertain object is an object that always exists but its actual value is modeled by a multivariate pdf. According to this definition an uncertain object is declared to be an outlier on the basis of the expected number of its neighbors in the dataset. To the best of our knowledge this is the first work that considers the unsupervised outlier detection problem on data objects modeled by means of arbitrarily shaped multidimensional distribution functions. We present the UDBOD algorithm which efficiently detects the outliers in an input uncertain dataset by taking advantages of three optimized phases, that are parameter estimation, candidate selection, and the candidate filtering. An experimental campaign is presented, including a sensitivity analysis, a study of the effectiveness of the technique, a comparison with related algorithms, also in presence of high dimensional data, and a discussion about the behavior of our technique in real case scenarios

Open Access Repository

Certainty of outlier and boundary points processing in data mining

Author: Guo Yanhui
Minaei-bidgoli Behrouz
Norouzi Sanaz Saki
Rashno Elyas
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 19/12/2018
Field of study

Data certainty is one of the issues in the real-world applications which is caused by unwanted noise in data. Recently, more attentions have been paid to overcome this problem. We proposed a new method based on neutrosophic set (NS) theory to detect boundary and outlier points as challenging points in clustering methods. Generally, firstly, a certainty value is assigned to data points based on the proposed definition in NS. Then, certainty set is presented for the proposed cost function in NS domain by considering a set of main clusters and noise cluster. After that, the proposed cost function is minimized by gradient descent method. Data points are clustered based on their membership degrees. Outlier points are assigned to noise cluster and boundary points are assigned to main clusters with almost same membership degrees. To show the effectiveness of the proposed method, two types of datasets including 3 datasets in Scatter type and 4 datasets in UCI type are used. Results demonstrate that the proposed cost function handles boundary and outlier points with more accurate membership degrees and outperforms existing state of the art clustering methods.Comment: Conference Paper, 6 page

arXiv.org e-Print Archive

Crossref

Log-based Anomaly Detection of CPS Using a Statistical Method

Author: Choi Eun-Hye
Harada Yoshiyuki
Mizuno Osamu
Yamagata Yoriyuki
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 12/01/2017
Field of study

Detecting anomalies of a cyber physical system (CPS), which is a complex system consisting of both physical and software parts, is important because a CPS often operates autonomously in an unpredictable environment. However, because of the ever-changing nature and lack of a precise model for a CPS, detecting anomalies is still a challenging task. To address this problem, we propose applying an outlier detection method to a CPS log. By using a log obtained from an actual aquarium management system, we evaluated the effectiveness of our proposed method by analyzing outliers that it detected. By investigating the outliers with the developer of the system, we confirmed that some outliers indicate actual faults in the system. For example, our method detected failures of mutual exclusion in the control system that were unknown to the developer. Our method also detected transient losses of functionalities and unexpected reboots. On the other hand, our method did not detect anomalies that were too many and similar. In addition, our method reported rare but unproblematic concurrent combinations of operations as anomalies. Thus, our approach is effective at finding anomalies, but there is still room for improvement

arXiv.org e-Print Archive

Crossref

The Infrared Database of Extragalactic Observables from Spitzer I: the redshift catalog

Author: Barry Donald P.
Hernán-Caballero Antonio
Lebouteiller Vianney
Rupke David S. N.
Spoon Henrik W. W.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 23/11/2015
Field of study

This is the first of a series of papers on the Infrared Database of Extragalactic Observables from Spitzer (IDEOS). In this work we describe the identification of optical counterparts of the infrared sources detected in Spitzer Infrared Spectrograph (IRS) observations, and the acquisition and validation of redshifts. The IDEOS sample includes all the spectra from the Cornell Atlas of Spitzer/IRS Sources (CASSIS) of galaxies beyond the Local Group. Optical counterparts were identified from correlation of the extraction coordinates with the NASA Extragalactic Database (NED). To confirm the optical association and validate NED redshifts, we measure redshifts with unprecedented accuracy on the IRS spectra ({\sigma}(dz/(1+z))=0.0011) by using an improved version of the maximum combined pseudo-likelihood method (MCPL). We perform a multi-stage verification of redshifts that considers alternate NED redshifts, the MCPL redshift, and visual inspection of the IRS spectrum. The statistics is as follows: the IDEOS sample contains 3361 galaxies at redshift 0<z<6.42 (mean: 0.48, median: 0.14). We confirm the default NED redshift for 2429 sources and identify 124 with incorrect NED redshifts. We obtain IRS-based redshifts for 568 IDEOS sources without optical spectroscopic redshifts, including 228 with no previous redshift measurements. We provide the entire IDEOS redshift catalog in machine-readable formats. The catalog condenses our compilation and verification effort, and includes our final evaluation on the most likely redshift for each source, its origin, and reliability estimates.Comment: 11 pages, 6 figures, 1 table. Accepted for publication in MNRAS. Full redshift table in machine-readable format available at http://ideos.astro.cornell.edu/redshifts.htm

arXiv.org e-Print Archive

redMaPPer III: A Detailed Comparison of the Planck 2013 and SDSS DR8 RedMaPPer Cluster Catalogs

Author: Bartlett James G.
Melin Jean B.
Rozo Eduardo
Rykoff Eli S.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 29/01/2014
Field of study

We compare the Planck Sunyaev-Zeldovich (SZ) cluster sample (PSZ1) to the Sloan Digital Sky Survey (SDSS) redMaPPer catalog, finding that all Planck clusters within the redMaPPer mask and within the redshift range probed by redMaPPer are contained in the redMaPPer cluster catalog. These common clusters define a tight scaling relation in the richness-SZ mass (

\lambda

M_{SZ}

) plane, with an intrinsic scatter in richness of

\sigma_{\lambda|M_{SZ}} = 0.266 \pm 0.017

. The corresponding intrinsic scatter in true cluster halo mass at fixed richness is

\approx 21\%

. The regularity of this scaling relation is used to identify failures in both the redMaPPer and Planck cluster catalogs. Of the 245 galaxy clusters in common, we identify three failures in redMaPPer and 36 failures in the PSZ1. Of these, at least 12 are due to clusters whose optical counterpart was correctly identified in the PSZ1, but where the quoted redshift for the optical counterpart in the external data base used in the PSZ1 was incorrect. The failure rates for redMaPPer and the PSZ1 are

1.2\%

and

14.7\%

respectively, or 9.8% in the PSZ1 after subtracting the external data base errors. We have further identified 5 PSZ1 sources that suffer from projection effects (multiple rich systems along the line-of-sight of the SZ detection) and 17 new high redshift (

z\gtrsim 0.6

) cluster candidates of varying degrees of confidence. Should all of the high-redshift cluster candidates identified here be confirmed, we will have tripled the number of high redshift Planck clusters in the SDSS region. Our results highlight the power of multi-wavelength observations to identify and characterize systematic errors in galaxy cluster data sets, and clearly establish photometric data both as a robust cluster finding method, and as an important part of defining clean galaxy cluster samples.Comment: comments welcom

arXiv.org e-Print Archive