1,435 research outputs found
Robust techniques and applications in fuzzy clustering
This dissertation addresses issues central to frizzy classification. The issue of sensitivity to noise and outliers of least squares minimization based clustering techniques, such as Fuzzy c-Means (FCM) and its variants is addressed. In this work, two novel and robust clustering schemes are presented and analyzed in detail. They approach the problem of robustness from different perspectives. The first scheme scales down the FCM memberships of data points based on the distance of the points from the cluster centers. Scaling done on outliers reduces their membership in true clusters. This scheme, known as the Mega-clustering, defines a conceptual mega-cluster which is a collective cluster of all data points but views outliers and good points differently (as opposed to the concept of Dave\u27s Noise cluster). The scheme is presented and validated with experiments and similarities with Noise Clustering (NC) are also presented. The other scheme is based on the feasible solution algorithm that implements the Least Trimmed Squares (LTS) estimator. The LTS estimator is known to be resistant to noise and has a high breakdown point. The feasible solution approach also guarantees convergence of the solution set to a global optima. Experiments show the practicability of the proposed schemes in terms of computational requirements and in the attractiveness of their simplistic frameworks.
The issue of validation of clustering results has often received less attention than clustering itself. Fuzzy and non-fuzzy cluster validation schemes are reviewed and a novel methodology for cluster validity using a test for random position hypothesis is developed. The random position hypothesis is tested against an alternative clustered hypothesis on every cluster produced by the partitioning algorithm. The Hopkins statistic is used as a basis to accept or reject the random position hypothesis, which is also the null hypothesis in this case. The Hopkins statistic is known to be a fair estimator of randomness in a data set. The concept is borrowed from the clustering tendency domain and its applicability to validating clusters is shown here.
A unique feature selection procedure for use with large molecular conformational datasets with high dimensionality is also developed. The intelligent feature extraction scheme not only helps in reducing dimensionality of the feature space but also helps in eliminating contentious issues such as the ones associated with labeling of symmetric atoms in the molecule. The feature vector is converted to a proximity matrix, and is used as an input to the relational fuzzy clustering (FRC) algorithm with very promising results. Results are also validated using several cluster validity measures from literature. Another application of fuzzy clustering considered here is image segmentation. Image analysis on extremely noisy images is carried out as a precursor to the development of an automated real time condition state monitoring system for underground pipelines. A two-stage FCM with intelligent feature selection is implemented as the segmentation procedure and results on a test image are presented. A conceptual framework for automated condition state assessment is also developed
Robust EM algorithm for model-based curve clustering
Model-based clustering approaches concern the paradigm of exploratory data
analysis relying on the finite mixture model to automatically find a latent
structure governing observed data. They are one of the most popular and
successful approaches in cluster analysis. The mixture density estimation is
generally performed by maximizing the observed-data log-likelihood by using the
expectation-maximization (EM) algorithm. However, it is well-known that the EM
algorithm initialization is crucial. In addition, the standard EM algorithm
requires the number of clusters to be known a priori. Some solutions have been
provided in [31, 12] for model-based clustering with Gaussian mixture models
for multivariate data. In this paper we focus on model-based curve clustering
approaches, when the data are curves rather than vectorial data, based on
regression mixtures. We propose a new robust EM algorithm for clustering
curves. We extend the model-based clustering approach presented in [31] for
Gaussian mixture models, to the case of curve clustering by regression
mixtures, including polynomial regression mixtures as well as spline or
B-spline regressions mixtures. Our approach both handles the problem of
initialization and the one of choosing the optimal number of clusters as the EM
learning proceeds, rather than in a two-fold scheme. This is achieved by
optimizing a penalized log-likelihood criterion. A simulation study confirms
the potential benefit of the proposed algorithm in terms of robustness
regarding initialization and funding the actual number of clusters.Comment: In Proceedings of the 2013 International Joint Conference on Neural
Networks (IJCNN), 2013, Dallas, TX, US
Robustness and Outliers
Producción CientíficaUnexpected deviations from assumed models as well as the presence of certain amounts of outlying data are common in most practical statistical applications. This fact could lead to undesirable solutions when applying non-robust statistical techniques. This is often the case in cluster analysis, too. The search for homogeneous groups with large heterogeneity between them can be spoiled due to the lack of robustness of standard clustering methods. For instance, the presence of (even few) outlying observations may result in heterogeneous clusters artificially joined together or in the detection of spurious clusters merely made up of outlying observations. In this chapter we will analyze the effects of different kinds of outlying data in cluster analysis and explore several alternative methodologies designed to avoid or minimize their undesirable effects.Ministerio de Economía, Industria y Competitividad (MTM2014-56235-C2-1-P)Junta de Castilla y León (programa de apoyo a proyectos de investigación – Ref. VA212U13
A survey of outlier detection methodologies
Outlier detection has been used for centuries to detect and, where appropriate, remove anomalous observations from data. Outliers arise due to mechanical faults, changes in system behaviour, fraudulent behaviour, human error, instrument error or simply through natural deviations in populations. Their detection can identify system faults and fraud before they escalate with potentially catastrophic consequences. It can identify errors and remove their contaminating effect on the data set and as such to purify the data for processing. The original outlier detection methods were arbitrary but now, principled and systematic techniques are used, drawn from the full gamut of Computer Science and Statistics. In this paper, we introduce a survey of contemporary techniques for outlier detection. We identify their respective motivations and distinguish their advantages and disadvantages in a comparative review
Android Malware Clustering through Malicious Payload Mining
Clustering has been well studied for desktop malware analysis as an effective
triage method. Conventional similarity-based clustering techniques, however,
cannot be immediately applied to Android malware analysis due to the excessive
use of third-party libraries in Android application development and the
widespread use of repackaging in malware development. We design and implement
an Android malware clustering system through iterative mining of malicious
payload and checking whether malware samples share the same version of
malicious payload. Our system utilizes a hierarchical clustering technique and
an efficient bit-vector format to represent Android apps. Experimental results
demonstrate that our clustering approach achieves precision of 0.90 and recall
of 0.75 for Android Genome malware dataset, and average precision of 0.98 and
recall of 0.96 with respect to manually verified ground-truth.Comment: Proceedings of the 20th International Symposium on Research in
Attacks, Intrusions and Defenses (RAID 2017
Robust Fuzzy Clustering via Trimming and Constraints
Producción CientíficaA methodology for robust fuzzy clustering is proposed. This
methodology can be widely applied in very different statistical problems given
that it is based on probability likelihoods. Robustness is achieved by trimming
a fixed proportion of “most outlying” observations which are indeed
self-determined by the data set at hand. Constraints on the clusters’ scatters
are also needed to get mathematically well-defined problems and to avoid the
detection of non-interesting spurious clusters. The main lines for computationally
feasible algorithms are provided and some simple guidelines about
how to choose tuning parameters are briefly outlined. The proposed methodology
is illustrated through two applications. The first one is aimed at heterogeneously
clustering under multivariate normal assumptions and the second
one migh be useful in fuzzy clusterwise linear regression problems.Ministerio de Economía, Industria y Competitividad (MTM2014-56235-C2-1-P)Junta de Castilla y León (programa de apoyo a proyectos de investigación – Ref. VA212U13
Robust approach to object recognition through fuzzy clustering and hough transform based methods
Object detection from two dimensional intensity images as well as three dimensional range images is considered. The emphasis is on the robust detection of shapes such as cylinders, spheres, cones, and planar surfaces, typically found in mechanical and manufacturing engineering applications. Based on the analyses of different HT methods, a novel method, called the Fast Randomized Hough Transform (FRHT) is proposed. The key idea of FRHT is to divide the original image into multiple regions and apply random sampling method to map data points in the image space into the parameter space or feature space, then obtain the parameters of true clusters. This results in the following characteristics, which are highly desirable in any method: high computation speed, low memory requirement, high result resolution and infinite parameter space. This project also considers use of fuzzy clustering techniques, such as Fuzzy C Quadric Shells (FCQS) clustering algorithm but combines the concept of noise prototype to form the Noise FCQS clustering algorithm that is robust against noise. Then a novel integrated clustering algorithm combining the advantages of FRHT and NFCQS methods is proposed. It is shown to be a robust clustering algorithm having the distinct advantages such as: the number of clusters need not be known in advance, the results are initialization independent, the detection accuracy is greatly improved, and the computation speed is very fast. Recent concepts from robust statistics, such as least trimmed squares estimation (LTS), minimum volume ellipsoid estimator (MVE) and the generalized MVE are also utilized to form a new robust algorithm called the generalized LTS for Quadric Surfaces (GLTS-QS) algorithm is developed. The experimental results indicate that the clustering method combining the FRHT and the GLTS-QS can improve clustering performance. Moreover, a new cluster validity method for circular clusters is proposed by considering the distribution of the points on the circular edge. Different methods for the computation of distance of a point from a cluster boundary, a common issue in all the range image clustering algorithms, are also discussed. The performance of all these algorithms is tested using various real and synthetic range and intensity images. The application of the robust clustering methods to the experimental granular flow research is also included
Anthropometry: An R Package for Analysis of Anthropometric Data
The development of powerful new 3D scanning techniques has enabled the generation of large up-to-date anthropometric databases which provide highly valued data to improve the ergonomic design of products adapted to the user population. As a consequence, Ergonomics and Anthropometry are two increasingly quantitative fields, so advanced statistical methodologies and modern software tools are required to get the maximum benefit from anthropometric data. This paper presents a new R package, called Anthropometry, which is available on the Comprehensive R Archive Network. It brings together some statistical methodologies concerning clustering, statistical shape analysis, statistical archetypal analysis and the statistical concept of data depth, which have been especially developed to deal with anthropometric data. They are proposed with the aim of providing effective solutions to some common anthropometric problems, such as clothing design or workstation design (focusing on the particular case of aircraft cockpits). The utility of the package is shown by analyzing the anthropometric data obtained from a survey of the Spanish female population performed in 2006 and from the 1967 United States Air Force survey. This manuscript is also contained in Anthropometry as a vignette
Robust constrained fuzzy clustering
It is well-known that outliers and noisy data can be very harmful when applying
clustering methods. Several fuzzy clustering methods which are able
to handle the presence of noise have been proposed. In this work, we propose
a robust clustering approach called F-TCLUST based on an “impartial”
(i.e., self-determined by data) trimming. The proposed approach considers
an eigenvalue ratio constraint that makes it a mathematically well-defined
problem and serves to control the allowed differences among cluster scatters.
A computationally feasible algorithm is proposed for its practical implementation.
Some guidelines about how to choose the parameters controlling the
performance of the fuzzy clustering procedure are also given.Estadística e I
- …