679 research outputs found
Identification of Outlying Observations with Quantile Regression for Censored Data
Outlying observations, which significantly deviate from other measurements,
may distort the conclusions of data analysis. Therefore, identifying outliers
is one of the important problems that should be solved to obtain reliable
results. While there are many statistical outlier detection algorithms and
software programs for uncensored data, few are available for censored data. In
this article, we propose three outlier detection algorithms based on censored
quantile regression, two of which are modified versions of existing algorithms
for uncensored or censored data, while the third is a newly developed algorithm
to overcome the demerits of previous approaches. The performance of the three
algorithms was investigated in simulation studies. In addition, real data from
SEER database, which contains a variety of data sets related to various
cancers, is illustrated to show the usefulness of our methodology. The
algorithms are implemented into an R package OutlierDC which can be
conveniently employed in the \proglang{R} environment and freely obtained from
CRAN
Rainbow plots, Bagplots and Boxplots for Functional Data
We propose new tools for visualizing large numbers of functional data in the form of smooth curves or surfaces. The proposed tools include functional versions of the bagplot and boxplot, and make use of the first two robust principal component scores, Tukey's data depth and highest density regions. By-products of our graphical displays are outlier detection methods for functional data. We compare these new outlier detection methods with exiting methods for detecting outliers in functional data and show that our methods are better able to identify the outliers.Highest density regions, Robust principal component analysis, Kernel density estimation, Outlier detection, Tukey's halfspace depth
A new non-parametric detector of univariate outliers for distributions with unbounded support
The purpose of this paper is to construct a new non-parametric detector of
univariate outliers and to study its asymptotic properties. This detector is
based on a Hill's type statistic. It satisfies a unique asymptotic behavior for
a large set of probability distributions with positive unbounded support (for
instance: for the absolute value of Gaussian, Gamma, Weibull, Student or
regular variations distributions). We have illustrated our results by numerical
simulations which show the accuracy of this detector with respect to other
usual univariate outlier detectors (Tukey, MAD or Local Outlier Factor
detectors). The detection of outliers in a database providing the prices of
used cars is also proposed as an application to real-life database
Providing scientific visualisation for spatial data analysis: criteria and an assessment of SAGE
A consistent theme in recent work on developing exploratory spatial data analysis (ESDA) has been the importance attached to visualization techniques, particularly following the pioneering development of packages such as REGARD by Haslett et al (1990). The focus on visual techniques is often justified in two ways: (a) the power of modern graphical interfaces means that graphics is no longer a way of simply presenting results in the form of maps or graphs, but a tool for the extraction of information from data; (b)graphical, exploratory methods are felt to be more intuitive for non-specialists to use than methods of numerical spatial statistics enabling wider participation in the process of getting data insights. Despite the importance attached to visualisation techniques, very little work has been done to assess the effectiveness of techniques, either in the wider scientific visualisation community, or among those working with spatial data. This paper will describe a theoretical framework for developing visualisation tools for ESDA that incorporates a data model of what the analyst is looking for based on the concepts of "rough" and "smooth" elements of a data set and a theoretical scheme for assessing visual tools. The paper will include examples of appropriate tools and a commentary on the effectiveness of some existing packages
Sparse Functional Boxplots for Multivariate Curves
This paper introduces the sparse functional boxplot and the intensity sparse
functional boxplot as practical exploratory tools. Besides being available for
complete functional data, they can be used in sparse univariate and
multivariate functional data. The sparse functional boxplot, based on the
functional boxplot, displays sparseness proportions within the 50\% central
region. The intensity sparse functional boxplot indicates the relative
intensity of fitted sparse point patterns in the central region. The two-stage
functional boxplot, which derives from the functional boxplot to detect
outliers, is furthermore extended to its sparse form. We also contribute to
sparse data fitting improvement and sparse multivariate functional data depth.
In a simulation study, we evaluate the goodness of data fitting, several depth
proposals for sparse multivariate functional data, and compare the results of
outlier detection between the sparse functional boxplot and its two-stage
version. The practical applications of the sparse functional boxplot and
intensity sparse functional boxplot are illustrated with two public health
datasets. Supplementary materials and codes are available for readers to apply
our visualization tools and replicate the analysis.Comment: 33 pages, 7 figure
The DetectDeviatingCells algorithm was a useful addition to the toolkit for cellwise error detection in observational data
OBJECTIVE: We evaluated the error detection performance of the DetectDeviatingCells (DDC) algorithm, which flags data anomalies at observation (casewise) and variable (cellwise) level in continuous variables. We compared its performance to other approaches in a simulated dataset. STUDY DESIGN AND SETTING: We simulated height and weight data for hypothetical individuals aged 2-20 years. We changed a proportion of height values according to pre-determined error patterns. We applied the DDC algorithm and other error-detection approaches (descriptive statistics, plots, fixed-threshold rules, classic and robust Mahalanobis distance) and we compared error detection performance with sensitivity, specificity, likelihood ratios, predictive values and ROC curves. RESULTS: At our chosen thresholds, error detection specificity was excellent across all scenarios for all methods and sensitivity was higher for multivariable and robust methods. The DDC algorithm performance was similar to other robust multivariable methods. Analysis of ROC curves suggested that all methods had comparable performance for gross errors (e.g. wrong measurement unit), but the DDC algorithm outperformed the others for more complex error patterns (e.g. transcription errors that are still plausible, although extreme). CONCLUSIONS: The DDC algorithm has the potential to improve error detection processes for observational data
Robust classification via MOM minimization
We present an extension of Vapnik's classical empirical risk minimizer (ERM)
where the empirical risk is replaced by a median-of-means (MOM) estimator, the
new estimators are called MOM minimizers. While ERM is sensitive to corruption
of the dataset for many classical loss functions used in classification, we
show that MOM minimizers behave well in theory, in the sense that it achieves
Vapnik's (slow) rates of convergence under weak assumptions: data are only
required to have a finite second moment and some outliers may also have
corrupted the dataset.
We propose an algorithm inspired by MOM minimizers. These algorithms can be
analyzed using arguments quite similar to those used for Stochastic Block
Gradient descent. As a proof of concept, we show how to modify a proof of
consistency for a descent algorithm to prove consistency of its MOM version. As
MOM algorithms perform a smart subsampling, our procedure can also help to
reduce substantially time computations and memory ressources when applied to
non linear algorithms.
These empirical performances are illustrated on both simulated and real
datasets
- âŠ