23,222 research outputs found
Uncertain distance-based outlier detection with arbitrarily shaped data objects
AbstractEnabling information systems to face anomalies in the presence of uncertainty is a compelling and challenging task. In this work the problem of unsupervised outlier detection in large collections of data objects modeled by means of arbitrary multidimensional probability density functions is considered. We present a novel definition ofuncertain distance-based outlierunder the attribute level uncertainty model, according to which an uncertain object is an object that always exists but its actual value is modeled by a multivariate pdf. According to this definition an uncertain object is declared to be an outlier on the basis of the expected number of its neighbors in the dataset. To the best of our knowledge this is the first work that considers the unsupervised outlier detection problem on data objects modeled by means of arbitrarily shaped multidimensional distribution functions. We present the UDBOD algorithm which efficiently detects the outliers in an input uncertain dataset by taking advantages of three optimized phases, that are parameter estimation, candidate selection, and the candidate filtering. An experimental campaign is presented, including a sensitivity analysis, a study of the effectiveness of the technique, a comparison with related algorithms, also in presence of high dimensional data, and a discussion about the behavior of our technique in real case scenarios
Certainty of outlier and boundary points processing in data mining
Data certainty is one of the issues in the real-world applications which is
caused by unwanted noise in data. Recently, more attentions have been paid to
overcome this problem. We proposed a new method based on neutrosophic set (NS)
theory to detect boundary and outlier points as challenging points in
clustering methods. Generally, firstly, a certainty value is assigned to data
points based on the proposed definition in NS. Then, certainty set is presented
for the proposed cost function in NS domain by considering a set of main
clusters and noise cluster. After that, the proposed cost function is minimized
by gradient descent method. Data points are clustered based on their membership
degrees. Outlier points are assigned to noise cluster and boundary points are
assigned to main clusters with almost same membership degrees. To show the
effectiveness of the proposed method, two types of datasets including 3
datasets in Scatter type and 4 datasets in UCI type are used. Results
demonstrate that the proposed cost function handles boundary and outlier points
with more accurate membership degrees and outperforms existing state of the art
clustering methods.Comment: Conference Paper, 6 page
Log-based Anomaly Detection of CPS Using a Statistical Method
Detecting anomalies of a cyber physical system (CPS), which is a complex
system consisting of both physical and software parts, is important because a
CPS often operates autonomously in an unpredictable environment. However,
because of the ever-changing nature and lack of a precise model for a CPS,
detecting anomalies is still a challenging task. To address this problem, we
propose applying an outlier detection method to a CPS log. By using a log
obtained from an actual aquarium management system, we evaluated the
effectiveness of our proposed method by analyzing outliers that it detected. By
investigating the outliers with the developer of the system, we confirmed that
some outliers indicate actual faults in the system. For example, our method
detected failures of mutual exclusion in the control system that were unknown
to the developer. Our method also detected transient losses of functionalities
and unexpected reboots. On the other hand, our method did not detect anomalies
that were too many and similar. In addition, our method reported rare but
unproblematic concurrent combinations of operations as anomalies. Thus, our
approach is effective at finding anomalies, but there is still room for
improvement
The Infrared Database of Extragalactic Observables from Spitzer I: the redshift catalog
This is the first of a series of papers on the Infrared Database of
Extragalactic Observables from Spitzer (IDEOS). In this work we describe the
identification of optical counterparts of the infrared sources detected in
Spitzer Infrared Spectrograph (IRS) observations, and the acquisition and
validation of redshifts. The IDEOS sample includes all the spectra from the
Cornell Atlas of Spitzer/IRS Sources (CASSIS) of galaxies beyond the Local
Group. Optical counterparts were identified from correlation of the extraction
coordinates with the NASA Extragalactic Database (NED). To confirm the optical
association and validate NED redshifts, we measure redshifts with unprecedented
accuracy on the IRS spectra ({\sigma}(dz/(1+z))=0.0011) by using an improved
version of the maximum combined pseudo-likelihood method (MCPL). We perform a
multi-stage verification of redshifts that considers alternate NED redshifts,
the MCPL redshift, and visual inspection of the IRS spectrum. The statistics is
as follows: the IDEOS sample contains 3361 galaxies at redshift 0<z<6.42 (mean:
0.48, median: 0.14). We confirm the default NED redshift for 2429 sources and
identify 124 with incorrect NED redshifts. We obtain IRS-based redshifts for
568 IDEOS sources without optical spectroscopic redshifts, including 228 with
no previous redshift measurements. We provide the entire IDEOS redshift catalog
in machine-readable formats. The catalog condenses our compilation and
verification effort, and includes our final evaluation on the most likely
redshift for each source, its origin, and reliability estimates.Comment: 11 pages, 6 figures, 1 table. Accepted for publication in MNRAS. Full
redshift table in machine-readable format available at
http://ideos.astro.cornell.edu/redshifts.htm
redMaPPer III: A Detailed Comparison of the Planck 2013 and SDSS DR8 RedMaPPer Cluster Catalogs
We compare the Planck Sunyaev-Zeldovich (SZ) cluster sample (PSZ1) to the
Sloan Digital Sky Survey (SDSS) redMaPPer catalog, finding that all Planck
clusters within the redMaPPer mask and within the redshift range probed by
redMaPPer are contained in the redMaPPer cluster catalog. These common clusters
define a tight scaling relation in the richness-SZ mass (--)
plane, with an intrinsic scatter in richness of . The corresponding intrinsic scatter in true cluster halo mass
at fixed richness is . The regularity of this scaling relation is
used to identify failures in both the redMaPPer and Planck cluster catalogs. Of
the 245 galaxy clusters in common, we identify three failures in redMaPPer and
36 failures in the PSZ1. Of these, at least 12 are due to clusters whose
optical counterpart was correctly identified in the PSZ1, but where the quoted
redshift for the optical counterpart in the external data base used in the PSZ1
was incorrect. The failure rates for redMaPPer and the PSZ1 are and
respectively, or 9.8% in the PSZ1 after subtracting the external data
base errors. We have further identified 5 PSZ1 sources that suffer from
projection effects (multiple rich systems along the line-of-sight of the SZ
detection) and 17 new high redshift () cluster candidates of
varying degrees of confidence. Should all of the high-redshift cluster
candidates identified here be confirmed, we will have tripled the number of
high redshift Planck clusters in the SDSS region. Our results highlight the
power of multi-wavelength observations to identify and characterize systematic
errors in galaxy cluster data sets, and clearly establish photometric data both
as a robust cluster finding method, and as an important part of defining clean
galaxy cluster samples.Comment: comments welcom
- …