16 research outputs found

    A Self-Supervised Approach for Cluster Assessment of High-Dimensional Data

    Full text link
    Estimating the number of clusters and underlying cluster structure in a dataset is a crucial task. Real-world data are often unlabeled, complex and high-dimensional, which makes it difficult for traditional clustering algorithms to perform well. In recent years, a matrix reordering based algorithm, called "visual assessment of tendency" (VAT), and its variants have attracted many researchers from various domains to estimate the number of clusters and inherent cluster structure present in the data. However, these algorithms fail when applied to high-dimensional data due to the curse of dimensionality, as they rely heavily on the notions of closeness and farness between data points. To address this issue, we propose a deep-learning based framework for cluster structure assessment in complex, image datasets. First, our framework generates representative embeddings for complex data using a self-supervised deep neural network, and then, these low-dimensional embeddings are fed to VAT/iVAT algorithms to estimate the underlying cluster structure. In this process, we ensured not to use any prior knowledge for the number of clusters (i.e k). We present our results on four real-life image datasets, and our findings indicate that our framework outperforms state-of-the-art VAT/iVAT algorithms in terms of clustering accuracy and normalized mutual information (NMI).Comment: Submitted to IEEE SMC 202

    Convergence of ADAM with Constant Step Size in Non-Convex Settings: A Simple Proof

    Full text link
    In neural network training, RMSProp and ADAM remain widely favoured optimization algorithms. One of the keys to their performance lies in selecting the correct step size, which can significantly influence their effectiveness. It is worth noting that these algorithms performance can vary considerably, depending on the chosen step sizes. Additionally, questions about their theoretical convergence properties continue to be a subject of interest. In this paper, we theoretically analyze a constant stepsize version of ADAM in the non-convex setting. We show sufficient conditions for the stepsize to achieve almost sure asymptotic convergence of the gradients to zero with minimal assumptions. We also provide runtime bounds for deterministic ADAM to reach approximate criticality when working with smooth, non-convex functions.Comment: 9 pages including references and appendi

    Learning Low-Rank Latent Spaces with Simple Deterministic Autoencoder: Theoretical and Empirical Insights

    Full text link
    The autoencoder is an unsupervised learning paradigm that aims to create a compact latent representation of data by minimizing the reconstruction loss. However, it tends to overlook the fact that most data (images) are embedded in a lower-dimensional space, which is crucial for effective data representation. To address this limitation, we propose a novel approach called Low-Rank Autoencoder (LoRAE). In LoRAE, we incorporated a low-rank regularizer to adaptively reconstruct a low-dimensional latent space while preserving the basic objective of an autoencoder. This helps embed the data in a lower-dimensional space while preserving important information. It is a simple autoencoder extension that learns low-rank latent space. Theoretically, we establish a tighter error bound for our model. Empirically, our model's superiority shines through various tasks such as image generation and downstream classification. Both theoretical and practical outcomes highlight the importance of acquiring low-dimensional embeddings.Comment: Accepted @ IEEE/CVF WACV 202

    Big data cluster analysis and its applications

    Get PDF
    © 2018 Dr. Punit RathoreThe increasing prevalence of Internet of things (IoT) technologies, smartphones, and social media services generates a huge amount of data, popularly known as ’big data’. Extracting useful information from big data is essential for many businesses and applications for providing better services and increasing their profits. For example, smart city solutions aim to use this wealth ofdata for formulating effective policies to solve the problems faced by citizens. These voluminous data are usually unlabeled, therefore, scalable and efficient unsupervised algorithms are required to manage and extract actionable information from big data. Cluster analysis is a useful unsupervised approach to discover the underlying groups and useful patterns in the data. Cluster Analysis for any data consists of three problems, (P1) cluster assessment, which asks “Do the data have clusters? If yes, how many?"; (P2) Clustering i.e., partitioning the data into clusters, and (P3) cluster validity, which asks “Are the clusters found useful? Is there a better one we did not find?" Traditional cluster analysis algorithms are not suitable for big data owing to its volume, variety, and velocity property. This thesis developed a suite of novel scalable algorithms to solve each of the three problems of cluster analysis, namely, cluster assessment, clustering, and cluster validity, for big data, that may be high-dimensional, anomalous and streaming. For demonstration, a novel scalable framework for predicting large-scale taxi trajectories is presented as a real application of big data clustering. Our first contribution addresses the high-dimensionality and scalability issues for soft clustering methods. Specifically, we developed a simple and computationally efficient framework for high-dimensional data clustering: CAFCM, which employs fuzzy c-means clustering on an ensemble of random projections to obtain multiple fuzzy clustering partitions, and then cumulatively aggregates them based on their quality to get a final output partition. The CAFCM framework scales linearly in the number of samples in the data and does not require any prior knowledge of the number of clusters, which makes it an attractive clustering approach for big datasets. Our second contribution solves the cluster tendency assessment and clustering problem for voluminous, high-dimensional datasets. We developed a fast cluster tendency assessment and subsequent clustering algorithm: FensiVAT, which integrates an intelligent sampling scheme, called Maximin Random Sampling (MMRS), and a new random projection (RP)-based ensemble method with a visual assessment of cluster tendency (VAT) method, in an efficient manner. The reordered dissimilarity image (RDI) (aka cluster heat map) obtained in FensiVAT suggests the number of clusters in data. The FensiVAT is more effective than the existing big data clustering techniques, both in terms of CPU-time and cluster quality. Our third contribution deals with the cluster validity problem for big data. Notably, we presented six novel approximation algorithms including two incremental methods to compute Dunn’s cluster validity index for big data. Four methods used variations of the MMRS sampling and two are based on unsupervised training of one class support vector machines. All six methods for estimation of Dunn’s index (DI) are linear in the number of samples. Computing approximations to DI with MMRS methods is both tractable and accurate. After dealing with big static data, our next contribution focused on detecting evolving structure in high-velocity, streaming data. Existing VAT-based algorithms for streaming data, inc-VAT/ inciVAT and dec-VAT/dec-iVAT, are impractical for high-velocity data streams. We developed a novel algorithm, inc-siVAT, for incremental and time efficient visualization of evolving cluster structures in high-velocity, data streams. The inc-siVAT extracts an initial smart (MMRS) sample and its RDI image, then it incrementally updates them on the fly to track changes in cluster structure after each chunk. The new algorithm is demonstrated for visualizing evolving cluster structures and detecting anomalies in dynamic streams of four big datasets, including a real IoT data. Finally, we demonstrate our big data clustering framework for a real-life smart city application. Based on a big data clustering method and Markov models, we developed a scalable framework for vehicle trajectory prediction which is suitable for a large number of overlapping trajectories in a dense road network, typically for major cities around the world. The short-term and long-term prediction performance of our framework on two real-life, large-scale taxi trajectory data from the Beijing and Singapore Road networks is found to be better than two current methods, in terms of prediction accuracy and distance error

    Estimating generalized Dunn\u27s cluster validity indices for big data

    Full text link
    corecore