11,303 research outputs found
Outlier detection for multivariate categorical data
This is an Accepted Manuscript of an article published by Taylor & Francis in “ Quality and Reliability Engineering International ” on 06th June 2018, available online: https://onlinelibrary.wiley.com/doi/abs/10.1002/qre.2339The detection of outlying rows in a contingency table is tackled from a Bayesian perspective, by adapting the framework adopted by Box and Tiao for normal models to multinomial models with random effects. The solution assumes a 2–component mixture model of 2 multinomial continuous mixtures for them, one for the nonoutlier rows and the second one for the outlier rows. The method starts by estimating the distributional characteristics of nonoutlier rows, and then it does cluster analysis to identify which rows belong to the outlier group and which do not. The method applies to any type of contingency table, and in particular, it could be used on the analysis of multivariate categorical control charts. Here, the use of the method is illustrated through a simulated example and by applying it to help identify heterogeneities of style among the acts in the plays of the First Folio edition of Shakespeare dramaPeer ReviewedPostprint (author's final draft
A taxonomy framework for unsupervised outlier detection techniques for multi-type data sets
The term "outlier" can generally be defined as an observation that is significantly different from
the other values in a data set. The outliers may be instances of error or indicate events. The
task of outlier detection aims at identifying such outliers in order to improve the analysis of
data and further discover interesting and useful knowledge about unusual events within numerous
applications domains. In this paper, we report on contemporary unsupervised outlier detection
techniques for multiple types of data sets and provide a comprehensive taxonomy framework and
two decision trees to select the most suitable technique based on data set. Furthermore, we
highlight the advantages, disadvantages and performance issues of each class of outlier detection
techniques under this taxonomy framework
Sensitivity and robustness in MDS configurations for mixed-type data: a study of the economic crisis impact on socially vulnerable Spanish people
Multidimensional scaling (MDS) techniques are initially proposed to produce pictorial representations of distance, dissimilarity or proximity data. Sensitivity and robustness assessment of multivariate methods is essential if inferences are to be drawn from the analysis. To our knowledge, the literature related to MDS for mixed-type data, including variables measured at continuous level besides categorical ones, is quite scarce. The main motivation of this work was to analyze the stability and robustness of MDS configurations as an extension of a previous study on a real data set, coming from a panel-type analysis designed to assess the economic crisis impact on Spanish people who were in situations of high risk of being socially excluded. The main contributions of the paper on the treatment of MDS configurations for mixed-type data are: (i) to propose a joint metric based on distance matrices computed for continuous, multi-scale categorical and/or binary variables, (ii) to introduce a systematic analysis on the sensitivity of MDS configurations and (iii) to present a systematic search for robustness and identification of outliers through a new procedure based on geometric variability notions.Gower distance, MDS configurations, Mixed-type data, Outliers identification, Related metric scaling, Survey data
Automatic Bayesian Density Analysis
Making sense of a dataset in an automatic and unsupervised fashion is a
challenging problem in statistics and AI. Classical approaches for {exploratory
data analysis} are usually not flexible enough to deal with the uncertainty
inherent to real-world data: they are often restricted to fixed latent
interaction models and homogeneous likelihoods; they are sensitive to missing,
corrupt and anomalous data; moreover, their expressiveness generally comes at
the price of intractable inference. As a result, supervision from statisticians
is usually needed to find the right model for the data. However, since domain
experts are not necessarily also experts in statistics, we propose Automatic
Bayesian Density Analysis (ABDA) to make exploratory data analysis accessible
at large. Specifically, ABDA allows for automatic and efficient missing value
estimation, statistical data type and likelihood discovery, anomaly detection
and dependency structure mining, on top of providing accurate density
estimation. Extensive empirical evidence shows that ABDA is a suitable tool for
automatic exploratory analysis of mixed continuous and discrete tabular data.Comment: In proceedings of the Thirty-Third AAAI Conference on Artificial
Intelligence (AAAI-19
- …