2,044 research outputs found
A Self-Supervised Approach for Cluster Assessment of High-Dimensional Data
Estimating the number of clusters and underlying cluster structure in a
dataset is a crucial task. Real-world data are often unlabeled, complex and
high-dimensional, which makes it difficult for traditional clustering
algorithms to perform well. In recent years, a matrix reordering based
algorithm, called "visual assessment of tendency" (VAT), and its variants have
attracted many researchers from various domains to estimate the number of
clusters and inherent cluster structure present in the data. However, these
algorithms fail when applied to high-dimensional data due to the curse of
dimensionality, as they rely heavily on the notions of closeness and farness
between data points. To address this issue, we propose a deep-learning based
framework for cluster structure assessment in complex, image datasets. First,
our framework generates representative embeddings for complex data using a
self-supervised deep neural network, and then, these low-dimensional embeddings
are fed to VAT/iVAT algorithms to estimate the underlying cluster structure. In
this process, we ensured not to use any prior knowledge for the number of
clusters (i.e k). We present our results on four real-life image datasets, and
our findings indicate that our framework outperforms state-of-the-art VAT/iVAT
algorithms in terms of clustering accuracy and normalized mutual information
(NMI).Comment: Submitted to IEEE SMC 202
Relational visual cluster validity
The assessment of cluster validity plays a very important role in cluster analysis. Most commonly used cluster validity methods are based on statistical hypothesis testing or finding the best clustering scheme by computing a number of different cluster validity indices. A number of visual methods of cluster validity have been produced to display directly the validity of clusters by mapping data into two- or three-dimensional space. However, these methods may lose too much information to correctly estimate the results of clustering algorithms. Although the visual cluster validity (VCV) method of Hathaway and Bezdek can successfully solve this problem, it can only be applied for object data, i.e. feature measurements. There are very few validity methods that can be used to analyze the validity of data where only a similarity or dissimilarity relation exists – relational data. To tackle this problem, this paper presents a relational visual cluster validity (RVCV) method to assess the validity of clustering relational data. This is done by combining the results of the non-Euclidean relational fuzzy c-means (NERFCM) algorithm with a modification of the VCV method to produce a visual representation of cluster validity. RVCV can cluster complete and incomplete relational data and adds to the visual cluster validity theory. Numeric examples using synthetic and real data are presente
An Efficient Visual Analysis Method for Cluster Tendency Evaluation, Data Partitioning and Internal Cluster Validation
Visual methods have been extensively studied and performed in cluster data analysis. Given a pairwise dissimilarity matrix D of a set of n objects, visual methods such as Enhanced-Visual Assessment Tendency (E-VAT) algorithm generally represent D as an n times n image I( overlineD) where the objects are reordered to expose the hidden cluster structure as dark blocks along the diagonal of the image. A major constraint of such methods is their lack of ability to highlight cluster structure when D contains composite shaped datasets. This paper addresses this limitation by proposing an enhanced visual analysis method for cluster tendency assessment, where D is mapped to D' by graph based analysis and then reordered to overlineD' using E-VAT resulting graph based Enhanced Visual Assessment Tendency (GE-VAT). An Enhanced Dark Block Extraction (E-DBE) for automatic determination of the number of clusters in I( overlineD') is then proposed as well as a visual data partitioning method for cluster formation from I( overlineD') based on the disparity between diagonal and off-diagonal blocks using permuted indices of GE-VAT. Cluster validation measures are also performed to evaluate the cluster formation. Extensive experimental results on several complex synthetic, UCI and large real-world data sets are analyzed to validate our algorithm
Cluster Analysis of Open Research Data: A Case for Replication Metadata
Research data are often released upon journal publication to enable result verification and reproducibility. For that reason, research dissemination infrastructures typically support diverse datasets coming from numerous disciplines, from tabular data and program code to audio-visual files. Metadata, or data about data, is critical to making research outputs adequately documented and FAIR. Aiming to contribute to the discussions on the development of metadata for research outputs, I conducted an exploratory analysis to determine how research datasets cluster based on what researchers organically deposit together. I use the content of over 40,000 datasets from the Harvard Dataverse research data repository as my sample for the cluster analysis. I find that the majority of the clusters are formed by single-type datasets, while in the rest of the sample, no meaningful clusters can be identified. For the result interpretation, I use the metadata standard employed by DataCite, a leading organization for documenting a scholarly record, and map existing resource types to my results. About 65% of the sample can be described with a single-type metadata (such as Dataset, Software orReport), while the rest would require aggregate metadata types. Though DataCite supports an aggregate type such as a Collection, I argue that a significant number of datasets, in particular those containing both data and code files (about 20% of the sample), would be more accurately described as a Replication resource metadata type. Such resource type would be particularly useful in facilitating research reproducibility
Clustering Data of Mixed Categorical and Numerical Type with Unsupervised Feature Learning
Mixed-type categorical and numerical data are a challenge in many applications. This general area of mixed-type data is among the frontier areas, where computational intelligence approaches are often brittle compared with the capabilities of living creatures. In this paper, unsupervised feature learning (UFL) is applied to the mixed-type data to achieve a sparse representation, which makes it easier for clustering algorithms to separate the data. Unlike other UFL methods that work with homogeneous data, such as image and video data, the presented UFL works with the mixed-type data using fuzzy adaptive resonance theory (ART). UFL with fuzzy ART (UFLA) obtains a better clustering result by removing the differences in treating categorical and numeric features. The advantages of doing this are demonstrated with several real-world data sets with ground truth, including heart disease, teaching assistant evaluation, and credit approval. The approach is also demonstrated on noisy, mixed-type petroleum industry data. UFLA is compared with several alternative methods. To the best of our knowledge, this is the first time UFL has been extended to accomplish the fusion of mixed data types
A Network Topology Approach to Bot Classification
Automated social agents, or bots, are increasingly becoming a problem on
social media platforms. There is a growing body of literature and multiple
tools to aid in the detection of such agents on online social networking
platforms. We propose that the social network topology of a user would be
sufficient to determine whether the user is a automated agent or a human. To
test this, we use a publicly available dataset containing users on Twitter
labelled as either automated social agent or human. Using an unsupervised
machine learning approach, we obtain a detection accuracy rate of 70%
Neuroengineering of Clustering Algorithms
Cluster analysis can be broadly divided into multivariate data visualization, clustering algorithms, and cluster validation. This dissertation contributes neural network-based techniques to perform all three unsupervised learning tasks. Particularly, the first paper provides a comprehensive review on adaptive resonance theory (ART) models for engineering applications and provides context for the four subsequent papers. These papers are devoted to enhancements of ART-based clustering algorithms from (a) a practical perspective by exploiting the visual assessment of cluster tendency (VAT) sorting algorithm as a preprocessor for ART offline training, thus mitigating ordering effects; and (b) an engineering perspective by designing a family of multi-criteria ART models: dual vigilance fuzzy ART and distributed dual vigilance fuzzy ART (both of which are capable of detecting complex cluster structures), merge ART (aggregates partitions and lessens ordering effects in online learning), and cluster validity index vigilance in fuzzy ART (features a robust vigilance parameter selection and alleviates ordering effects in offline learning). The sixth paper consists of enhancements to data visualization using self-organizing maps (SOMs) by depicting in the reduced dimension and topology-preserving SOM grid information-theoretic similarity measures between neighboring neurons. This visualization\u27s parameters are estimated using samples selected via a single-linkage procedure, thereby generating heatmaps that portray more homogeneous within-cluster similarities and crisper between-cluster boundaries. The seventh paper presents incremental cluster validity indices (iCVIs) realized by (a) incorporating existing formulations of online computations for clusters\u27 descriptors, or (b) modifying an existing ART-based model and incrementally updating local density counts between prototypes. Moreover, this last paper provides the first comprehensive comparison of iCVIs in the computational intelligence literature --Abstract, page iv
Opaque Service Virtualisation: A Practical Tool for Emulating Endpoint Systems
Large enterprise software systems make many complex interactions with other
services in their environment. Developing and testing for production-like
conditions is therefore a very challenging task. Current approaches include
emulation of dependent services using either explicit modelling or
record-and-replay approaches. Models require deep knowledge of the target
services while record-and-replay is limited in accuracy. Both face
developmental and scaling issues. We present a new technique that improves the
accuracy of record-and-replay approaches, without requiring prior knowledge of
the service protocols. The approach uses Multiple Sequence Alignment to derive
message prototypes from recorded system interactions and a scheme to match
incoming request messages against prototypes to generate response messages. We
use a modified Needleman-Wunsch algorithm for distance calculation during
message matching. Our approach has shown greater than 99% accuracy for four
evaluated enterprise system messaging protocols. The approach has been
successfully integrated into the CA Service Virtualization commercial product
to complement its existing techniques.Comment: In Proceedings of the 38th International Conference on Software
Engineering Companion (pp. 202-211). arXiv admin note: text overlap with
arXiv:1510.0142
- …